(Archived) Unreliable OCR

Barry · November 29, 2011

Hi,

I'm wondering if I've just been unlucky or if there are some issues with Evernote's OCR of PDFs. (I'm a premium user, by the way).

As an example, I've just done a search for the word storytelling, and Evernote found 11 notes. The trouble is, only 6 of them actually contain that word!

When I do the search on my Mac so the searched word is highlighted, what I see is that Evernote has recognised the following words as 'storytelling' - display, prompt, behaviours, strategies. I can almost forgive the last one, but the others are just bizarre.

In a search with a relatively small result set this isn't a big deal, but the more results there are the more frustrating it gets.

What's more worrying for me is whether there are other notes that do contain the search term, but Evernote isn't finding them.

I've been using Evernote and DevonThink alongside each other for a couple of years and really want to consolidate everything in to just one of them, but I need to completely trust whichever system I'm using and right now I don't feel that way about Evernote.

Barry

GrumpyMonkey · November 29, 2011

OCR is never 100%. OCR quality depends on your original sources. Are we talking about chicken scratches, blurry copies, or smudged characters? Also, remember that Evernote doesn't do OCR on everything, and it takes a little time before it completes the OCR. There are a few threads in the forums discussing its limitations.

Personally, I prefer to do my own OCR whenever possible (Adobe Acrobat Pro).

BurgersNFries · November 29, 2011

Also, if the results are words in photos, those are OCR'd with a tree of possibilities, since it's having to "guess" what the word is.

If you've scanned handwritten pages, they are best stored as images rather than PDFs for the above reason.

Barry · November 29, 2011

Thanks for the replies.

The originals are all good quality documents produced as PDF's. The documents were added to Evernote between Feb and August this year, so I figure they will have been OCR'ed by now

gtuckerkellogg · November 29, 2011

I've had just great results from OCR on PDFs. Somewhat less on images, but that's because many of the images text are the writings of my young daughters.

GrumpyMonkey · November 29, 2011

Thanks for the replies.

The originals are all good quality documents produced as PDF's. The documents were added to Evernote between Feb and August this year, so I figure they will have been OCR'ed by now

There are some limitations. See here. My recommendation is to OCR them yourself.

https://support.evernote.com/link/portal/16051/16058/Article/591/How-does-Evernote-s-PDF-processing-work-for-scanned-PDFs

peterfmartin · November 29, 2011

Also, while the problem you cite, false positives, can be really annoying, it doesn't imply that the more serious problem of false negatives is equally likely. For the reasons the previous users said, OCR false positives are more likely, at least for resources that include typed text (like the PDFs you're talking about), than false negatives.

Barry · November 29, 2011

@GrumpyMonkey - Thanks for the link, that's useful information. Not sure I want to go to the effort of doing the OCR myself when I can just drop them into DevonThink and have it automated. It just seems like an unnecessary extra step.

@peterfmartin - True, but anything that suggests the results aren't 100% reliable worries me.

Maybe I'm just too demanding!

BurgersNFries · November 29, 2011

If you're worried you won't be able to find a note, simply add the keywords to the note. I do this all the time. Rarely (I'd actually have to say never, so far, in over two years of using EN) do I need to find a PDF based upon the OCR process for all those words embedded within the doc, which I tend to find often distracting (IE false positives).

jbenson2 · November 29, 2011

One of the benefits of doing the OCR on my machine is that PDF is searchable both in Evernote and elsewhere.

The Evernote OCR process only works in Evernote. Once the PDF is pulled and sent elsewhere it is no longer searchable.

Barry · November 29, 2011

@BurgersNFries - Thanks for the suggestion, but the tagging approach doesn't really work for me. It assumes that at the point I put something into the system I know how and why I might look for it later (and that isn't always the case). The storytelling search is a good example; a couple of the documents that came up in the search are ones that i would never have considered tagging as such when I fioled them.

@jbenson2 - That's a good point. One of the benefits of using DevonThink is that the OCR'ed PDFs are searchable elsewhere too.

Even though I'm probably leaning more away from Evernote, this has been really useful. Thanks everyone for their replies. Oh, and just to be clear, I'm not knocking Evernote at all; it may just not be right for me.

GrumpyMonkey · November 30, 2011

No OCR is 100%. DevonThink won't be either. Sorry.

As for doing it yourself, it's not a pain at all, in my opinion. But, if it doesn't fit in your workflow, then it doesn't

JMichaelTX · November 30, 2011

For the reasons the previous users said, OCR false positives are more likely, at least for resources that include typed text (like the PDFs you're talking about), than false negatives.

I disagree. It only takes one character to be "misread" by the OCR engine to cause the "word" to be wrong.

For "typed text" on a clean copy, the OCR should be near 100%.

The failed examples that the OP gave are, IMO, inexcusable, if the original was a clean typed copy.

Legal Discovery companies have been doing very successful OCR with searchable DB for years.

OCR is an old technology.

I'm not sure what is going on with the Evernote OCR'd documents.

But I don't trust the EN OCR.

I'm with jbenson2 -- I do my own OCR using Adobe Acrobat, and it only takes a few moments.

JMichaelTX · November 30, 2011

No OCR is 100%.

This is one of those things about never using "never" or "always".

I can generate repeatable 100% OCR: Print a document (Word, for example) using a standard font on bright white paper, and you can scan and OCR it with 100% accuracy.

GrumpyMonkey · November 30, 2011

No OCR is 100%.
This is one of those things about never using "never" or "always".
I can generate repeatable 100% OCR: Print a document (Word, for example) using a standard font on bright white paper, and you can scan and OCR it with 100% accuracy.

I don't think I used the words "never" or "always," but I stand by my contention that OCR is not 100%. I've had years of experience with software produced by various companies and I haven't found one yet that is 100%. In fact, I haven't seen anyone claim to be that good either.

JMichaelTX · November 30, 2011

I think we can agree that the OCR accuracy is highly dependent on the quality of the source document.

GrumpyMonkey · November 30, 2011

Yep. Garbage in equals garbage out!

(Archived) Unreliable OCR

Recommended Posts

Barry 0

Link to comment

GrumpyMonkey 4,319

Link to comment

BurgersNFries 2,407

Link to comment

Barry 0

Link to comment

gtuckerkellogg 33

Link to comment

GrumpyMonkey 4,319

Link to comment

peterfmartin 221

Link to comment

Barry 0

Link to comment

BurgersNFries 2,407

Link to comment

jbenson2 2,147

Link to comment

Barry 0

Link to comment

GrumpyMonkey 4,319

Link to comment

JMichaelTX 4,117

Link to comment

JMichaelTX 4,117

Link to comment

GrumpyMonkey 4,319

Link to comment

JMichaelTX 4,117

Link to comment

GrumpyMonkey 4,319

Link to comment

Archived

Community Resources