Jump to content

(Archived) Unreliable OCR


Recommended Posts

Hi,

I'm wondering if I've just been unlucky or if there are some issues with Evernote's OCR of PDFs. (I'm a premium user, by the way).

As an example, I've just done a search for the word storytelling, and Evernote found 11 notes. The trouble is, only 6 of them actually contain that word!

When I do the search on my Mac so the searched word is highlighted, what I see is that Evernote has recognised the following words as 'storytelling' - display, prompt, behaviours, strategies. I can almost forgive the last one, but the others are just bizarre.

In a search with a relatively small result set this isn't a big deal, but the more results there are the more frustrating it gets.

What's more worrying for me is whether there are other notes that do contain the search term, but Evernote isn't finding them.

I've been using Evernote and DevonThink alongside each other for a couple of years and really want to consolidate everything in to just one of them, but I need to completely trust whichever system I'm using and right now I don't feel that way about Evernote.

Barry

Link to comment
  • Level 5*

OCR is never 100%. OCR quality depends on your original sources. Are we talking about chicken scratches, blurry copies, or smudged characters? Also, remember that Evernote doesn't do OCR on everything, and it takes a little time before it completes the OCR. There are a few threads in the forums discussing its limitations.

Personally, I prefer to do my own OCR whenever possible (Adobe Acrobat Pro).

Link to comment

Thanks for the replies.

The originals are all good quality documents produced as PDF's. The documents were added to Evernote between Feb and August this year, so I figure they will have been OCR'ed by now :)

Link to comment
  • Level 5*

Thanks for the replies.

The originals are all good quality documents produced as PDF's. The documents were added to Evernote between Feb and August this year, so I figure they will have been OCR'ed by now :)

There are some limitations. See here. My recommendation is to OCR them yourself.

https://support.evernote.com/link/portal/16051/16058/Article/591/How-does-Evernote-s-PDF-processing-work-for-scanned-PDFs

Link to comment

Also, while the problem you cite, false positives, can be really annoying, it doesn't imply that the more serious problem of false negatives is equally likely. For the reasons the previous users said, OCR false positives are more likely, at least for resources that include typed text (like the PDFs you're talking about), than false negatives.

Link to comment

@GrumpyMonkey - Thanks for the link, that's useful information. Not sure I want to go to the effort of doing the OCR myself when I can just drop them into DevonThink and have it automated. It just seems like an unnecessary extra step.

@peterfmartin - True, but anything that suggests the results aren't 100% reliable worries me.

Maybe I'm just too demanding! ;)

Link to comment

If you're worried you won't be able to find a note, simply add the keywords to the note. I do this all the time. Rarely (I'd actually have to say never, so far, in over two years of using EN) do I need to find a PDF based upon the OCR process for all those words embedded within the doc, which I tend to find often distracting (IE false positives).

Link to comment
  • Level 5

One of the benefits of doing the OCR on my machine is that PDF is searchable both in Evernote and elsewhere.

The Evernote OCR process only works in Evernote. Once the PDF is pulled and sent elsewhere it is no longer searchable.

Link to comment

@BurgersNFries - Thanks for the suggestion, but the tagging approach doesn't really work for me. It assumes that at the point I put something into the system I know how and why I might look for it later (and that isn't always the case). The storytelling search is a good example; a couple of the documents that came up in the search are ones that i would never have considered tagging as such when I fioled them.

@jbenson2 - That's a good point. One of the benefits of using DevonThink is that the OCR'ed PDFs are searchable elsewhere too.

Even though I'm probably leaning more away from Evernote, this has been really useful. Thanks everyone for their replies. Oh, and just to be clear, I'm not knocking Evernote at all; it may just not be right for me.

Link to comment
  • Level 5*

For the reasons the previous users said, OCR false positives are more likely, at least for resources that include typed text (like the PDFs you're talking about), than false negatives.

I disagree. It only takes one character to be "misread" by the OCR engine to cause the "word" to be wrong.

For "typed text" on a clean copy, the OCR should be near 100%.

The failed examples that the OP gave are, IMO, inexcusable, if the original was a clean typed copy.

Legal Discovery companies have been doing very successful OCR with searchable DB for years.

OCR is an old technology.

I'm not sure what is going on with the Evernote OCR'd documents.

But I don't trust the EN OCR.

I'm with jbenson2 -- I do my own OCR using Adobe Acrobat, and it only takes a few moments.

Link to comment
  • Level 5*

No OCR is 100%.

This is one of those things about never using "never" or "always".

I can generate repeatable 100% OCR: Print a document (Word, for example) using a standard font on bright white paper, and you can scan and OCR it with 100% accuracy.

Link to comment
  • Level 5*

No OCR is 100%.

This is one of those things about never using "never" or "always".

I can generate repeatable 100% OCR: Print a document (Word, for example) using a standard font on bright white paper, and you can scan and OCR it with 100% accuracy.

I don't think I used the words "never" or "always," but I stand by my contention that OCR is not 100%. I've had years of experience with software produced by various companies and I haven't found one yet that is 100%. In fact, I haven't seen anyone claim to be that good either.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...