Jump to content
Barry

(Archived) Unreliable OCR

Recommended Posts

Hi,

I'm wondering if I've just been unlucky or if there are some issues with Evernote's OCR of PDFs. (I'm a premium user, by the way).

As an example, I've just done a search for the word storytelling, and Evernote found 11 notes. The trouble is, only 6 of them actually contain that word!

When I do the search on my Mac so the searched word is highlighted, what I see is that Evernote has recognised the following words as 'storytelling' - display, prompt, behaviours, strategies. I can almost forgive the last one, but the others are just bizarre.

In a search with a relatively small result set this isn't a big deal, but the more results there are the more frustrating it gets.

What's more worrying for me is whether there are other notes that do contain the search term, but Evernote isn't finding them.

I've been using Evernote and DevonThink alongside each other for a couple of years and really want to consolidate everything in to just one of them, but I need to completely trust whichever system I'm using and right now I don't feel that way about Evernote.

Barry

Share this post


Link to post

OCR is never 100%. OCR quality depends on your original sources. Are we talking about chicken scratches, blurry copies, or smudged characters? Also, remember that Evernote doesn't do OCR on everything, and it takes a little time before it completes the OCR. There are a few threads in the forums discussing its limitations.

Personally, I prefer to do my own OCR whenever possible (Adobe Acrobat Pro).

Share this post


Link to post

Also, if the results are words in photos, those are OCR'd with a tree of possibilities, since it's having to "guess" what the word is.

If you've scanned handwritten pages, they are best stored as images rather than PDFs for the above reason.

Share this post


Link to post

Thanks for the replies.

The originals are all good quality documents produced as PDF's. The documents were added to Evernote between Feb and August this year, so I figure they will have been OCR'ed by now :)

Share this post


Link to post

I've had just great results from OCR on PDFs. Somewhat less on images, but that's because many of the images text are the writings of my young daughters. :)

Share this post


Link to post

Thanks for the replies.

The originals are all good quality documents produced as PDF's. The documents were added to Evernote between Feb and August this year, so I figure they will have been OCR'ed by now :)

There are some limitations. See here. My recommendation is to OCR them yourself.

https://support.evernote.com/link/portal/16051/16058/Article/591/How-does-Evernote-s-PDF-processing-work-for-scanned-PDFs

Share this post


Link to post

Also, while the problem you cite, false positives, can be really annoying, it doesn't imply that the more serious problem of false negatives is equally likely. For the reasons the previous users said, OCR false positives are more likely, at least for resources that include typed text (like the PDFs you're talking about), than false negatives.

Share this post


Link to post

@GrumpyMonkey - Thanks for the link, that's useful information. Not sure I want to go to the effort of doing the OCR myself when I can just drop them into DevonThink and have it automated. It just seems like an unnecessary extra step.

@peterfmartin - True, but anything that suggests the results aren't 100% reliable worries me.

Maybe I'm just too demanding! ;)

Share this post


Link to post

If you're worried you won't be able to find a note, simply add the keywords to the note. I do this all the time. Rarely (I'd actually have to say never, so far, in over two years of using EN) do I need to find a PDF based upon the OCR process for all those words embedded within the doc, which I tend to find often distracting (IE false positives).

Share this post


Link to post

One of the benefits of doing the OCR on my machine is that PDF is searchable both in Evernote and elsewhere.

The Evernote OCR process only works in Evernote. Once the PDF is pulled and sent elsewhere it is no longer searchable.

  • Like 1

Share this post


Link to post

@BurgersNFries - Thanks for the suggestion, but the tagging approach doesn't really work for me. It assumes that at the point I put something into the system I know how and why I might look for it later (and that isn't always the case). The storytelling search is a good example; a couple of the documents that came up in the search are ones that i would never have considered tagging as such when I fioled them.

@jbenson2 - That's a good point. One of the benefits of using DevonThink is that the OCR'ed PDFs are searchable elsewhere too.

Even though I'm probably leaning more away from Evernote, this has been really useful. Thanks everyone for their replies. Oh, and just to be clear, I'm not knocking Evernote at all; it may just not be right for me.

Share this post


Link to post

No OCR is 100%. DevonThink won't be either. Sorry.

As for doing it yourself, it's not a pain at all, in my opinion. But, if it doesn't fit in your workflow, then it doesn't :)

Share this post


Link to post

For the reasons the previous users said, OCR false positives are more likely, at least for resources that include typed text (like the PDFs you're talking about), than false negatives.

I disagree. It only takes one character to be "misread" by the OCR engine to cause the "word" to be wrong.

For "typed text" on a clean copy, the OCR should be near 100%.

The failed examples that the OP gave are, IMO, inexcusable, if the original was a clean typed copy.

Legal Discovery companies have been doing very successful OCR with searchable DB for years.

OCR is an old technology.

I'm not sure what is going on with the Evernote OCR'd documents.

But I don't trust the EN OCR.

I'm with jbenson2 -- I do my own OCR using Adobe Acrobat, and it only takes a few moments.

Share this post


Link to post

No OCR is 100%.

This is one of those things about never using "never" or "always".

I can generate repeatable 100% OCR: Print a document (Word, for example) using a standard font on bright white paper, and you can scan and OCR it with 100% accuracy.

  • Like 1

Share this post


Link to post

No OCR is 100%.

This is one of those things about never using "never" or "always".

I can generate repeatable 100% OCR: Print a document (Word, for example) using a standard font on bright white paper, and you can scan and OCR it with 100% accuracy.

I don't think I used the words "never" or "always," but I stand by my contention that OCR is not 100%. I've had years of experience with software produced by various companies and I haven't found one yet that is 100%. In fact, I haven't seen anyone claim to be that good either.

Share this post


Link to post

I think we can agree that the OCR accuracy is highly dependent on the quality of the source document.

Share this post


Link to post
Guest
This topic is now closed to further replies.

×
×
  • Create New...