Jump to content

Seaching text in graphics embedded in PDFs


Recommended Posts

I have a number of PDF files which contain embedded images. These graphics are scans of receipts for business expenses. I can search for text contained in the PDFs themselves, but I want to be able to search for text in the embedded images.

 

I can't seem to do that with my basic Evernote subscription. Is this something that would be available if I were to upgrade to the premium service?

 

Chris

 

Link to comment
  • Level 5*

Hi.  OCR of PDFs is a premium feature,  so yes,  upgrading will give you access to more in-depth searching - though that also depends on the resolution and quality of your scans.  Pre-OCR'd PDFs - text created by you in a 'searchable' file format,  or images OCR'd locally in Adobe or another PDF editor,  are searchable because there's already an index of the content that Evernote can copy when the file is added to a note.  I tend to OCR everything,  including receipts,  before uploading to Evernote.

Link to comment

I have now signed up for an upgrade to Premium and added a couple of PDFs to my notebooks. I can search for text contained in the main body of the file (as I could with the Basic service), but unfortunately it still doesn't OCR any of the embedded images to make them searchable. I realise this would depend on the quality of the images, but these images are of scanned business receipts and much of the text is very clearly discernable in the graphics.

 

I'm a little disappointed with this because that was my only reason for paying extra for the Premium service.

 

Link to comment
  • Level 5*

Does your PDF file contain both text and image(s)?

 

If so, this may be the cause of your issue.

I have found with Adobe Acrobat Pro that if I attempt to OCR a PDF with both text and images on the same page it sometimes refuses to do the OCR.  I have no idea why.

 

There may be other PDF tools that will do a OCR in this case.  You would need to do the research.

 

You might try a test.  Scan one of your receipts to a PDF file, but add no text.

Then attach to an EN Note, and see if Evernote will do the OCR.

Link to comment

Hi Chris,

 

If you're on Windows desktop, select a note that is is not showing up in search and take a look at the information for that note (hit the "i" button in the note toolbar). Just check and see whether the attachments have been indexed or not. If not, hit Ctrl+Help menu (top of interface), and select, "Fix current note". Then try searching for text in scanned images within that note's PDF. Just maybe...

 

Also, how are you trying to search the images in your PDFs? Are you trying with the Ctrl+F keyboard shortcut or using the search box? Ctrl+F does not search for OCRed text within images. That text has to be found in a "global" search (or whatever context you're filtering for) using the search box. Same goes for mobile devices. One would not use the magnifying glass icon to search for text within an OCRed image in a note.

 

Have you tried doing the same search on the web client?

Link to comment
  • Level 5*

Chris,

 

You can try and OCR the PDF yourself and see if you get an error message something like "Page contains renderable text".  If there is renderable text in a PDF will stop the OCR.  I believe this is more and Adobe thing than anything else.

 

One fix is to print the PDF to a new PDF and then OCR the new PDF.  You have to want the PDF to searchable to follow this process, so it is okay for one offs.  I found this out when trying a search on my cell phone bills and decided it wasn't worth the effort.  The company must include an image of their logo or something in some template print as opposed to a pure PDF print.  Guessing on my part since I'm not that familiar with the technology.  I haven't found any OCR tools that get past the renderable text issue.  FWIW.

Link to comment
  • Level 5*

Also, how are you trying to search the images in your PDFs? Are you trying with the Ctrl+F keyboard shortcut or using the search box? Ctrl+F does not search for OCRed text within images. That text has to be found in a "global" search (or whatever context you're filtering for) using the search box. Same goes for mobile devices. One would not use the magnifying glass icon to search for text within an OCRed image in a note.

 

Frank, perhaps EN Win and EN Mac behave differently in that respect.

A FIND (CMD-F) in EN Mac *does* find the OCR'd text in a PDF.  I just tested it in EN Mac 6.0.11 running Mavericks (10.9.5).

The same text in the PDF was found by a SEARCH.

This was a PDF that I scanned and OCR'd using Adobe Acrobat Pro XI before attaching to Evernote.

 

FWIW, I would expect a FIND in specific Note to always find the same text, whether OCR'd or not, as a Search when searching for a given set of characters.  Of course, there have been numerous bugs reported on Evernote Search consistency.

Link to comment

 

Also, how are you trying to search the images in your PDFs? Are you trying with the Ctrl+F keyboard shortcut or using the search box? Ctrl+F does not search for OCRed text within images. That text has to be found in a "global" search (or whatever context you're filtering for) using the search box. Same goes for mobile devices. One would not use the magnifying glass icon to search for text within an OCRed image in a note.

 

Frank, perhaps EN Win and EN Mac behave differently in that respect.

A FIND (CMD-F) in EN Mac *does* find the OCR'd text in a PDF.  I just tested it in EN Mac 6.0.11 running Mavericks (10.9.5).

The same text in the PDF was found by a SEARCH.

This was a PDF that I scanned and OCR'd using Adobe Acrobat Pro XI before attaching to Evernote.

 

FWIW, I would expect a FIND in specific Note to always find the same text, whether OCR'd or not, as a Search when searching for a given set of characters.  Of course, there have been numerous bugs reported on Evernote Search consistency.

 

 

@JMichael,

 

Perhaps it is different... but I am specifically talking about images within a PDF... and any image in general... for example, a comic strip embedded in a PDF (or any image)... or a comic strip (or any image) all on its own in the form of a .GIF, .PNG, .JPEG etc.

 

One cannot search images via the Ctrl+F keyboard shortcut in Windows. I haven't tinkered on Mac yet. The OP mentions the problem as being specifically within embedded images within a PDF... not the text. Chris says that he can search the text... but not the embedded images. I'm guessing that his embedded images of receipts were not previously OCRed, and would rely on Evernote's OCR process. 

 

In order to search the text in images, one has to search using the search box. One cannot search directly within a note, as demonstrated below:

 

OCR%20testing%202.png?dl=1

 

 

 

... One needs to search via the search box to find OCRed text within images. In fact, that's the only way I could have found the key words within the image in question at all, amidst 18,000 others: 

 

 

OCR%20testing%201.PNG?dl=1

 

 

If I put the same image into a PDF (which in this case I did, because I used this strip as the basis for an English conversation class), I can search the text within the PDF via Ctrl+F... but not the image within the PDF. The text in the image needs to be found via the search box.

 

 

Most likely Chris' challenge is unrelated to the troubleshooting I'm putting forth... but it would be good to rule this out and move forward  :)

 

UPDATE: I can search for the text within an embedded image in a PDF only on the Web client... for some reason I cannot do the same on the Windows desktop client, even if I have "fixed" the note as I have done with other images imported directly into a note. Either way, still, the key words need to be found via the search box, not Ctrl+F.

Link to comment
  • Level 5*

Geez Frank, I wasn't questioning your veracity, I was simply pointing out a difference in behavior with EN Mac.

 

BTW, I also was talking about images in a PDF.  The basic output of a scanning process is an image, which may, or may not, be used to create a PDF file.  I'm not sure if a PDF created from an image retains the original image file format or not.  I would think it is documented somewhere it is really important to you.

 

As both Cal and I pointed out, Adobe Acrobat (and perhaps other PDF tools) do not like to OCR images in a PDF that also contains text.

I could be wrong, but I suspect that if Acrobat won't OCR it, then the Evernote OCR tool won't either.

 

That's why I suggested that the OP run a simple test to see if it was a basic Evernote issue, or something more complex.

Link to comment

We're on the same wavelength here, @JMichael. Just wanted to clarify what I posted earlier. I'm still learning as I plod along  :P

 

The Evernote tool does in fact OCR my PDFs' images... but as I mentioned in my last post's edit, for some strange reason the OCRed text in my embedded PDF image only came up in a search on the Web client (via the search box). Nothing I can do will make it so on the Windows desktop client  :(

 

Once again (in case the previous suggestion gets lost), I'd ask Chris to take a quick peek on the Web client to see if he can, like myself, search for text within the same PDF images. 

Link to comment

Does your PDF file contain both text and image(s)?

 

If so, this may be the cause of your issue.

I have found that if I attempt to OCR a PDF with both text and images on the same page it sometimes refuses to do the OCR.  I have no idea why.

 

There may be other PDF tools that will do a OCR in this case.  You would need to do the research.

 

You might try a test.  Scan one of your receipts to a PDF file, but add no text.

Then attach to an EN Note, and see if Evernote will do the OCR.

 

That's it. The PDF does contain both text and images in the same document. The text is searchable, but not the text contained in the images. I tried your suggestion of creating a PDF that contains no text at all and that one appears to have OCR'd correctly. The only thing there was although EN correctly identified the file containing the searched text, it didn't highlight the text in yellow as it normally does.

 

If you're on Windows desktop, select a note that is is not showing up in search and take a look at the information for that note (hit the "i" button in the note toolbar). Just check and see whether the attachments have been indexed or not. If not, hit Ctrl+Help menu (top of interface), and select, "Fix current note". Then try searching for text in scanned images within that note's PDF. Just maybe...

 

Also, how are you trying to search the images in your PDFs? Are you trying with the Ctrl+F keyboard shortcut or using the search box? Ctrl+F does not search for OCRed text within images. That text has to be found in a "global" search (or whatever context you're filtering for) using the search box. Same goes for mobile devices. One would not use the magnifying glass icon to search for text within an OCRed image in a note.

 

Have you tried doing the same search on the web client?

 

The "Info" box contains a line "Attachment Status: 1 PDF has not been indexed", but interestingly every PDF in every notebook also has that exact same line, even the ones that do OCR correctly. The "Fix current note" command didn't make any difference, and I can confirm that I am searching using the search box at the top of the window. Same problem with the web client too.

 

One fix is to print the PDF to a new PDF and then OCR the new PDF.  You have to want the PDF to searchable to follow this process, so it is okay for one offs.  I found this out when trying a search on my cell phone bills and decided it wasn't worth the effort.

 

I think I'm going to have to do something like that, although as you say it would be too much trouble to do that with every PDF. I'll probably try to find a way to have these files created a .jpg files from the start and abandon using PDF's altogether for this particular requirement.
 
I'd just like to thank everyone who spent their time offering suggestions for this. I really do appreciate the help guys.

 

Link to comment

Hi Chris, have you tried searching for the text in your PDF images on the web client? It works for me, although neither on the desktop nor mobile clients. A search filters the note with your key words... but they are not highlighted in the actual image.

Link to comment

Hi Chris, have you tried searching for the text in your PDF images on the web client? It works for me, although neither on the desktop nor mobile clients. A search filters the note with your key words... but they are not highlighted in the actual image.

 

Hi Frank. I have tried searching using the web client, but the result is the same. The search comes back with "0 results found"

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...