Jump to content

JPGs with found "words" in them, OCR artifacts


Recommended Posts

I've been a Premium user for years. I vaguely remember contacting Evernote support early on about the unreasonable number of false "words" that Evernote's OCR finds in images. I think the response was along the lines of "Yeah, OCR isn't perfect." I begged for a way to turn off this horrid feature until OCR got better. Not possible except by stopping being a Premium member. I turned to tags for searching Evernote, but I didn't like them and still don't. Typing out the syntax is a pain when you have an impaired hand (especially if the tag is more than one word), and if Evernote and I ever part ways, I'm not sure how I'd make use of these tags. 

Anyway, has this OCR issue been addressed? I'm about to scan a mess of paper, and am trying to decide when to go PDF and when to go JPG. Maybe it doesn't matter, and these artifacts are simply unavoidable in anything Evernote ingests? Should I just drop Premium?

In case you don't understand my problem, here's an example. I have a notebook full of landscaping ideas, mostly images. If I search for any random word in that notebook, say the word "tax," the "word" is highlighted in tree trunks, grass, clouds, etc. Examples attached. A search full of false positives isn't much of a search. So I have to search by tag to avoid getting overwhelmed by garbage results.

I have 162 notes in that landscaping notebook. The word tax was found correctly in one image. It was found incorrectly in the images of 14 notes. That's a 9% false positive rate. That seems really high to me. Or, to put it another way, 14 out of 15 ( 93%) of my text search results were garbage.

I can't be the only one with this problem, but I couldn't find anything on this topic in a cursory search of these forums.

This does remind me of the days when "subliminal" advertising was suspected by conspiracy theorists (or maybe it was real!) You know, the word "*****" embedded in every advertising image. In my case, "tax" is everywhere! Talk about unavoidable :) 

2016-06-19_1551.png

2016-06-19_1550.png

_DSC6460.png

2016-06-19_1608.png

Link to comment
  • Level 5*

Sympathies with the problem,  and I'm sorry that I just don't know whether OCR has improved to the point that this won't be an issue for you. Without knowing a lot more about the nature of the documents you're scanning I can't really suggest a comprehensive fix,  but it does seem that this comes under the general heading of 'curation'.  The more documents you have,  the more likely that any given word or key phrase is likely to turn up in irrelevant places.  I get this from time to time with new searches,  and I improve the contruction of the search,  change the titles/ tags of the required note(s),  and/or change the tags/ titles of the unwanted ones to eliminate the issue.  It hasn't been a problem,  just an ongoing task.

If you have some pictures you wish to exclude form searches,  you could tag them all 'picture' and include '-tag:picture' maybe?

Good luck with your scans... :)

Link to comment
  • Level 5

I used to see this OCR problem frequently back in 2011 and 2012, but seldom today. I believe the OCR software was significantly improved.

Here are a few steps I take to minimize the problem. The first two are the most important.

  • Everything is scanned to PDF format, never ever to JPG
  • My photos go to Flickr or Google Photo, seldom to Evernote
  • Any note containing a geographical Map is tagged with x
  • Any note with an image that reveals a "hidden word" during a search is tagged with x
Link to comment
  • Level 5
6 hours ago, gazumped said:

If you have some pictures you wish to exclude form searches,  you could tag them all 'picture' and include '-tag:picture' maybe?

I find (in a brief test) that including "-resource:image/*" in the search term eliminates all pictures/images from the search. Of course, that doesn't help if you're trying to find some images that actually have certain words in them. All-in-all, I think scanning documents to PDF rather than JPG might be safer.

Link to comment

Thanks for the inputs, all. I'll scan to PDF to hopefully avoid this issue. I'll ponder the "x" idea and test searching by excluding images.

I did find some older info today on OCR issues (brain fog yesterday). An Evernote employee (link below) suggested comparing the results of two searches to see if it helped with OCR issues like mine:

tax

vs

+text=tax

I've never seen this term "text" in Evernote syntax. I tried it and it eliminated all search results, even positive ones. Maybe I misunderstood her intention.

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...