Jump to content

Challenges with Evernote's OCR Functionality and Document Searchability


Recommended Posts

Dear Forum Members,

I hope this post finds you well. I am reaching out today to share an experience and seek potential solutions about an issue I've been facing with Evernote - specifically in relation to its Optical Character Recognition (OCR) capabilities.

Evernote, as we know, is a tool that stands out for its ability to search for specific keywords within documents. However, I've been encountering difficulties when searching for specific keywords within scanned documents. The text within these scanned documents is clear and legible, and yet Evernote's OCR does not seem to be indexing all of them correctly.

For instance, I have multiple documents that contain correspondence from an individual named "Adam." When I input "Adam" into Evernote's search function, not all relevant documents are returned in the results. This has required me to manually sift through my files to find the necessary information, which, as you can imagine, is time-consuming and defeats the purpose of Evernote's celebrated search function.

I understand that some files, like password-protected PDFs, may not be accessible for OCR due to security measures. However, the documents I am referring to are regular, non-encrypted scanned documents. This has left me rather frustrated with the service.

Has anyone else encountered this issue, and if so, have you found a workaround or solution? Any insights into why Evernote might be failing to fully index legible, non-encrypted scanned documents would be greatly appreciated.

I value this community and am hopeful that someone may have faced and overcome similar challenges, or at least might shed some light on this situation. Thank you in advance for your time and help.

Kind Regards,
Kozz

  • Like 1
Link to comment
  • Level 5*

Hi.  There are a couple of help pages that might... er,  help - Tips for searching scanned PDFs,  How Evernote makes text inside images searchable - that first one lists the types of PDF that can't be scanned.  No clue as to why some of your documents are missed out - it would be worth getting in touch with Support if the answer is not pretty clear from these pages...

EDIT - forgot to say:  I scan and OCR all (well, most) of my attachments which may be affecting my experience.

Link to comment

Yes, I've had this issue. I image documents (10k+ at least) directly into Evernote from my archival research, and sometimes realize that it's simply not catching everything. When that happens, I actually type the word in between images to make sure it gets caught the next time. Seems like there aren't many other options.

  • Like 1
Link to comment
  • Level 5

Some hints:

  • Text in images (this means real images, not the scanned image of a text) will not be OCRed when in pdfs. It is only made searchable when in pictures (JPG, PNG or GIF).
  • Search uses a dictionary to improve search results. This depends on your language settings in your account settings in the web. The more languages set, the more fuzzy OCR will be.
  • Since a dictionary approach sometimes misses a word, it may be it is in the search index under a different, close word.
  • When the OCR was done by EN, search only works inside of EN. Once the pdf is exported, it is a plain picture file, without text information.

Personally I use mostly the build in OCR capabilities of the software of my ix500 scanner. This means the pdf receives a full text layer after the scanning. It is then imported into EN - I use Import Folders to move it into a note. These pdf files are searchable not only in EN, but in all programs, including MacOS spotlight.

  • Like 1
Link to comment
  • Level 5*

As @BrooklynBen suggested - I was the former custodian of an archive of a few hundred thousand customer details for a big company, and the one thing I and every other IT staff member did on a daily basis was 'curation' - editing details / adding tags / changing titles / listing keywords...  there's always something to do to make searches more accurate and to eliminate false positives.  That's what owning a database involves.  Get used to it!

  • Like 1
Link to comment
  • Level 5

Dump everything, forget about it ´til you need it.

At least that’s what I do. But maybe running my own OCR before uploading makes the difference.

It even finds words that are hidden in the fine print at the bottom of business stationary, which is often printed in light grey and hardly legible with the eye.

Link to comment
1 hour ago, PinkElephant said:

Text in images (this means real images, not the scanned image of a text) will not be OCRed when in pdfs. It is only made searchable when in pictures (JPG, PNG or GIF).

This is puzzling because I remember being able to search for keywords within these PDFs, and the program would not only find the keyword, but also highlight it in yellow, making it easier to locate within the document. However, this yellow feature no longer seems to function for me.

To clarify my understanding: PDFs that are text-searchable have undergone OCR in EN, whereas scanned PDFs (which are essentially collections of images) that cannot be searched by keyword have not undergone OCR, is this the case?

I was under the impression that EN would automatically apply OCR to any imported PDF or image. I was under this impression since I first signed up to EN some years ago...

I assumed that this process might take place in the background after importing a PDF and that, while the OCR might not be immediately available, it would eventually scan any the documents and apply OCR to it somehow?

Link to comment
  • Level 5
9 hours ago, KoZz said:

This is puzzling because I remember being able to search for keywords within these PDFs, and the program would not only find the keyword, but also highlight it in yellow, making it easier to locate within the document. However, this yellow feature no longer seems to function for me.

To clarify my understanding: PDFs that are text-searchable have undergone OCR in EN, whereas scanned PDFs (which are essentially collections of images) that cannot be searched by keyword have not undergone OCR, is this the case?

I was under the impression that EN would automatically apply OCR to any imported PDF or image. I was under this impression since I first signed up to EN some years ago...

I assumed that this process might take place in the background after importing a PDF and that, while the OCR might not be immediately available, it would eventually scan any the documents and apply OCR to it somehow?

What I said is this: Text in PDFs will be OCRed, except when in real pictures embedded into the pdf.

As an example: You have a pdf about wine. It shows a picture of a bottle, with the label, and a text below the picture with more information. All of it plain scanned, no text layer.

The OCR will now make all the text below the picture (and every other text not in pictures) searchable.

All text on the label, part of the wine bottles picture will not be OCRed.

Now you drink the wine, enjoy it and take a picture of the bottle. You open a note and drop the picture there (as jpg, png or gif).

The text from the label inside of the picture will be OCRed.

In the pdf you can search for information from the text surrounding the picture. Text only in the picture of the bottle will not be found. In the note with the picture of the label, text in the picture itself will be found.

Same with handwriting: Handwriting (which is a picture) in PDFs: Not OCRed. Handwriting in picture files will be OCRed.

That’s my understanding how the OCR works.

About search results: Highlighting works for me (Mac desktop client).

About what to do: You can’t force a second OCR pass. What you can try from the hidden menu are the options to fix notes or to recreate the search index. All this fixes problems with your local database, not on the server.

Link to comment
  • Level 5*

Back in the day local OCR worked better than server OCR.  Not sure if V10 accepts locally OCR'd documents the same as Legacy did.  Easy enough test, OCR and replace in EN a few of the documents that did not appear in your search as they should have.  Than search for Adam again.

Link to comment
  • Level 5

Good point: If there is a text layer in the pdf already on import, it will not pass through the OCR process.

If that text layer is not complete, text that only exists as a picture of a text will not be found.

Link to comment

Thanks for the above posts guys...

I'm going to have to re-read the above a few times so I can wrap my brain around what is going on here. I'm still confused.. I did a search earlier today for latch speed trying to find a note about how my door latch system works and it found the note which I was looking for and that note in question was a JPG which highlighted the text in yellow. Superb. Instant find. I thought that note was scanned as a PDF but it wasn't as @PinkElephant stated above there are no problems with OCR when it comes to native images. Cool.. 

Before sending this post I've just doubled checked and under all notes I type Adam and some notes have Adam in the title thus are being brought to the surface, other documents, PDFS have Adam's name at the end of letters which are being brought up by search thus I presume OCR is working here, but nothing highlighted in yellow. Don't mind as I have the documents etc 

So maybe I'm thinking the PDF's that I'm scanning which contain the name Adam of a physician, must be images which in turn are amalgamated together as a PDF - I really have no idea if that this is the case. 

I've used the same method of scanning , well with the exception of changing my printer when then printer head dies , (HP Officejet Pro 9010 series) to scan all my documents same way, using the ADF,  so it's strange Adam is found in some PDFs when conducting a search, and the note (super important as it contains medical results) I which I was looking for it was not being found and I had to search for it manually.

So I'm still confused why Adam is being brought up in some PDF's which his name is not being listed in the note title , therefore I assume is OCR is working on those, but not on the important PDF. 

Anyway look I don't want to be a nuisance.. It just don't make sense why that one particular PDF doesn't want to play ball.  Search for Adam pulls up some PDFS with his name in it and others are not being called when searched , yet same hardware / software used as the method of scanning.

Anyway guys thank you,. Let me go re-read the above over a cup of coffee and hopefully something pings off in me brain... I need an Ahh Ha moment here .. :blink:

Link to comment
On 7/8/2023 at 6:48 PM, CalS said:

Back in the day local OCR worked better than server OCR.  Not sure if V10 accepts locally OCR'd documents the same as Legacy did.  Easy enough test, OCR and replace in EN a few of the documents that did not appear in your search as they should have.  Than search for Adam again.

What do I use to OCR ?  Something like Adobe Acrobat which I use. I can open and scan for text then save ( which will make it keyword searchable) and then put back into EN ? But doesn't this defeat EN performing the task of carrying out the OCR? 

Link to comment
On 7/8/2023 at 10:00 PM, PinkElephant said:

Good point: If there is a text layer in the pdf already on import, it will not pass through the OCR process.

If that text layer is not complete, text that only exists as a picture of a text will not be found.

silly question, is there a way to find out if a PDF has a text layer ? If so I can remove it and see if EN then performs the OCR task no?

Link to comment
  • Level 5

This Photo.pdf does not have a text layer. I took it with the iPhone camera and printed the picture as a pdf. I have checked with PDF Expert that it only contains a graphics layer. The Macs Preview app however immediately identified the text as such, and allowed me to copy it right from the open file into the clipboard.

The second pdf was OCRed and has a text layer. The text is editable with PDF Expert, so it exists as text inside of the pdf. To show the text, I have moved the text fields a little to the top and the right of the picture layer.

Photo.pdf Photo with text.pdf

Link to comment
  • Level 5*
2 hours ago, KoZz said:

What do I use to OCR ?  Something like Adobe Acrobat which I use. I can open and scan for text then save ( which will make it keyword searchable) and then put back into EN ? But doesn't this defeat EN performing the task of carrying out the OCR? 

Not sure about defe4ating the purpose.  If local OCR is more complete than EN OCR it kind of is what it is.  Until/when/if EN changes its OCR engine you will continue to have missing notes in your searches.  

Link to comment
  • Level 5

When EN does the OCR, a second pdf is created, holding the text layer. This pdf file is embedded into the note, invisible to the user. And it won't export. It only serves to support the full text search inside of EN.

So basically what I get when I do the OCR myself before uploading is the same in effect - but I can always export my notes, and still have searchable documents, because they hold their own text layer.

  • Like 1
Link to comment
On 7/8/2023 at 3:12 AM, PinkElephant said:

What I said is this: Text in PDFs will be OCRed, except when in real pictures embedded into the pdf.

As an example: You have a pdf about wine. It shows a picture of a bottle, with the label, and a text below the picture with more information. All of it plain scanned, no text layer.

The OCR will now make all the text below the picture (and every other text not in pictures) searchable.

All text on the label, part of the wine bottles picture will not be OCRed.

This doesn't make any sense to me. I just did a quick test with my own notes and I'm getting the opposite of what you're claiming Evernote does.

I recently used Evernote's Scannable app on my iPhone to create PDFs of a ton of paper documents I had lying around so I could throw them away. So I have a bunch of fresh PDFs that only include images directly from my phone's camera. I did nothing to add OCR data to them, I have no plain text in the PDF anywhere. But I can use Evernote's search and it will return results that only include notes with these PDFs that only have images inside of them, because it managed to OCR the text in those images.

If the PDF already had an application do OCR on the images inside of it, that's when Evernote (as far as I'm aware) doesn't do its own OCR. (which I believe you say) I've had this issue in the past when using a Scansnap scanner where it would use its own OCR software, and the search results were poor in comparison to Evernote's systems. It's definitely "safer" to upload a bunch of images to a note because you won't have to deal with issues of a piece of software inserting its own interpretation of what text is in the PDF, but Evernote has the capability to run its own OCR on images in PDFs.

Link to comment
  • Level 5

You haven’t read the thread, did you ? You just read the last postings, and now comment on something that hasn’t been said.

EN does it’s own OCR on pdfs, but that was not the question here.

EN OCR has three downsides, at least for me: First you can search, but the text is not extractable. Second the OCR result is only available for search inside of EN. When you export the pdf, it is a plain graphic file, without the embedded text layer that can be used by other apps outside of EN. Third to use the build in search in pdfs, you need to stay subscribed. If you go to Free, you loose a reliable search for pdf content.

This doesn’t mean OCR and search of pdfs in EN would be worthless, but anybody using it should be aware of the system limits.

Link to comment
23 minutes ago, PinkElephant said:

You haven’t read the thread, did you ? You just read the last postings, and now comment on something that hasn’t been said.

EN does it’s own OCR on pdfs, but that was not the question here.

EN OCR has three downsides, at least for me: First you can search, but the text is not extractable. Second the OCR result is only available for search inside of EN. When you export the pdf, it is a plain graphic file, without the embedded text layer that can be used by other apps outside of EN. Third to use the build in search in pdfs, you need to stay subscribed. If you go to Free, you loose a reliable search for pdf content.

This doesn’t mean OCR and search of pdfs in EN would be worthless, but anybody using it should be aware of the system limits.

All those downsides are fair. But I'm confused about why you say I didn't read the thread? The original post was about how the search wasn't finding PDFs that Evernote should be able to search the text in images in them, and from what I understand of what was being said was that Evernote couldn't OCR images inside of PDFs so I was providing some evidence of the capability to do that.

Link to comment
  • Level 5*

As I said in my post, "back in the day"....

I filed a trouble ticket in August 2018 re search results not matching on IOS and Windows desktop.  The response I got said the the EN server OCR engine had an issue with some fonts.  The essence was since I OCR'd on the desktop I got a better OCR.  To be clear, this issue may no longer exist, for all I know.  I just brought it up for the OP as a maybe for what they were seeing.  

I left my EN data in a Basic account since I left.  I have access to legacy desktop and web V10.  The issue still exists.  More notes are returned on the legacy desktop and some valid notes are not returned on V10 web.  One data point in the wilderness.  This was a data issue pre V10 that EN opted not to address in my view.

Salient bit of the email exchange from 2018.

To first elaborate on why it works on Windows when compared to iOS - Windows uses a different search mechanism to extract text from PDFs than the one used by our server, which is what web and iOS use.

The issue you're seeing is related to server-side search, which seems to have an issue with recognizing the wrong font when scanning PDF files. The text in the PDF is printed in a monospaced font, so the letters are wider, but the OCR sometimes incorrectly recognizes it as a non-monospace font, so it sees the gap in position and thinks there's a space there.

Link to comment
  • Level 5

There can be a problem as well when there is a text layer in a pdf - but it is not complete.

EN will not OCR if there is a text layer. It won’t check if that text layer only holds a partial information, and skips a part of the text that shows visually.

It all depends on the individual documents.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...