(Archived) Quality of OCR of PDFs lower than JPGs

Inverti · January 8, 2013

After testing the free account with scanned JPGs, which were OCRed great I bought the paid account, to collect also scanned PDFs. Unfortunately I found that OCR scanning engine of JPGs is far more superior than PDFs. Here's the example:

1. I printed out the document with "Lorem ipsum" text.

2. I scanned it in 300 dpi, grayscale, to JPG and to PDF. I have flatbed scanner, so even the position of the document was not changed. Those two scans look exactly the same and they were scanned in perfect condtions for OCR (clear, black, printed text).

3. I uploaded those two documents to evernote, as two separate notes. Here they are:

http://www.evernote.com/shard/s43/sh/a89ba406-2ddc-4d68-85b0-5fda9cbcd9bd/c0414bd0ad35035ef564fe7566a87b48

http://www.evernote.com/shard/s43/sh/9c822ea1-7a20-46d9-aee7-a1e821403d8b/6c1a9334a4124b7dccb1f627fd638580

4. After a while notes were processed and they are searchable.

5. Unfortunately when I search for words like "lorem", "tempor", "laborum" only the note with JPG is found. Evernote does not find the note with PDF as relevant. However searching for most other words like "ipsum" works without any problems, both notes are relevant.

That's strange for me, seems that scanning engine of PDFs are much worse than of JPGs. I think that uploading "real life PDFs" (which are not so clear) will cause much more searching problems. Could you please verify my observations?

6. When two notes are found only words in JPGs are highlighted in yellow in Windows client. In PDFs nothing is highlighted. Is that a correct behaviour?

gazumped · January 8, 2013

Hi - I haven't found there's a problem with PDF OCR accuracy, though to be fair I do the OCR myself before uploading. Have you tried that?

I downloaded your PDF file but that was the original unOCR'd version you saved (it should be possible for you to download the OCR version if you right-click the file in your local note). Searches for 'lorem' came back 'not found'.

When I OCR'd this file here, the searches worked fine.

Inverti · January 8, 2013

I understand that I could use some external tool to OCR the PDF before uploading to Evernote, but I don't want to do that for 2 reasons:

1. I don't have such a tool with my scanner and buing it separately is expensive.

2. JPGs were OCRed perfectly without using any external tools even when I used free account, so after buying the paid account finding that the same scan saved as PDF is OCRed much worse is really a surprise for me.

jbenson2 · January 8, 2013

I understand that I could use some external tool to OCR the PDF before uploading to Evernote, but I don't want to do that for 2 reasons:
1. I don't have such a tool with my scanner and buing it separately is expensive.
2. JPGs were OCRed perfectly without using any external tools even when I used free account, so after buying the paid account finding that the same scan saved as PDF is OCRed much worse is really a surprise for me.

To say a PDF OCR is "much worse" is a bit of a stretch when you are OCR'ing a "dead" language.

OCR is not a perfect science - it is not black and white.

I pulled the Searchable PDF that Evernote created from your PDF.

The results look pretty good to me.

http://www.evernote....b732436fc3f12f8

If you are prepared to put up with the downside of storing text as a JPG file, go ahead.

Just realize that you cannot put multiple pages into one JPG file..

(Archived) Quality of OCR of PDFs lower than JPGs

Recommended Posts

Inverti 1

Link to comment

gazumped 11,662

Link to comment

Inverti 1

Link to comment

jbenson2 2,147

Link to comment

Archived

Community Resources