Jump to content

Selectable Text in OCR'd documents


Recommended Posts

This is related enquiry to OCR'ing in PDF. 

I've noticed in addition to some documents being OCR'd while others not, there is division between successfully OCR'd documents.

Some OCR'd searchable documents allow text to be selected (cut and paste), while other not. 

Here is an example:

 

H5NEQ7t.jpg

Link to comment
  • Level 5*
2 hours ago, spartaaaaa said:

This is related enquiry to OCR'ing in PDF. 

I've noticed in addition to some documents being OCR'd while others not, there is division between successfully OCR'd documents.

Some OCR'd searchable documents allow text to be selected (cut and paste), while other not. 

Here is an example:

I don't think that has anything to do with the OCR process

Some PDFs are actual text, and some are images of text

Link to comment
  • Level 5*

Yeah - it certainly used to be the case that when Evernote OCR'd a file,  the text version was kept on the server,  and the client's original file was left as uploaded,  in line with Evernote's premise of "we'll never mess with your data". 

In other words the file remained images instead of those images being replaced by text.  Where PDFs are OCR'd locally,  the process replaces the former image(s) with the actual text.  Which makes the files significantly smaller when uploaded,  and means that text highlighted in searches tended to be more accurately highlighted in the local OCR file than in the Evernote-processed version.  And that text can be copy/ pasted more easily from the local-OCR files.

Not sure what the process is these days - I routinely OCR my own scanned content and have done for ages.  Don't know if Evernote handle it differently these days or not..

Link to comment
6 hours ago, DTLow said:

I don't think that has anything to do with the OCR process

Some PDFs are actual text, and some are images of text

Perhaps I explained it poorly. Actual text PDF's are always selectable and searchable. 

I am referring to specifically scanned PDF's that are images that have additional OCR on them. It's possible to select parts of the image as text for some of these. (See my original image)

5 hours ago, gazumped said:

Yeah - it certainly used to be the case that when Evernote OCR'd a file,  the text version was kept on the server,  and the client's original file was left as uploaded,  in line with Evernote's premise of "we'll never mess with your data". 

In other words the file remained images instead of those images being replaced by text.  Where PDFs are OCR'd locally,  the process replaces the former image(s) with the actual text.  Which makes the files significantly smaller when uploaded,  and means that text highlighted in searches tended to be more accurately highlighted in the local OCR file than in the Evernote-processed version.  And that text can be copy/ pasted more easily from the local-OCR files.

Not sure what the process is these days - I routinely OCR my own scanned content and have done for ages.  Don't know if Evernote handle it differently these days or not..

Interesting. If I may ask, do you have a good OCR scanner in mind for Mac? I really miss ScanSnap as it comes with ABBY software (which is really good imo)

Link to comment

So for the benefit of anyone reading this thread, I have some feedback.

After downloading a trial version of ABBYY Pro Finereader, I was able to get my old results back. AFR does an incredible job of OCR while making the text regions selectable. In, fact it even allows one to export a scanned PDF back to Word document (it recognizes elements such as tables and fonts etc)

I now believe this could be a the reason why some my scanned-&-ocr'd documents have selectable text and some do not.

The ones that do have selectable text where originally fed through the Fujitsu ScanSnap software and OCR was done there. The ones that do not, were most likely OCR's by the Evernote cloud. I recently learned that ABBYY is the engine that powers the ScanSnap software under the hood, which was a bit of an 'aha' moment :)

So where does this leave me. Well, on the one hand the OCR results of ABBYY are simply stellar. OTOH, however, it's over 100 bucks! I might just ditch my all-in-one Epson and haul the ScanSnap out of storage just to use the software!

Link to comment
  • Level 5*
4 hours ago, spartaaaaa said:

So for the benefit of anyone reading this thread, I have some feedback.

After downloading a trial version of ABBYY Pro Finereader, I was able to get my old results back.

 

3 hours ago, spartaaaaa said:

FWIW I selected some text on a recently document. You can see the result here in color. Wonderful

Thanks for posting
I'm going to give ABBYY a try also

My understanding is that your OCR process is actually altering the document
It is no longer an image based pdf

When you're posting your results, you are actually selecting text in the altered document

Link to comment
6 minutes ago, DTLow said:

Thanks for posting
I'm going to give ABBYY a try to

My pleasure. Sounds good

6 minutes ago, DTLow said:

My understanding is that your OCR process is actually altering the document
It is no longer an image based pdf

When you're posting your results, you are actually selecting text in the altered document

Well, I think that's partly right. It's no longer *purely* image based.

My understanding is it becomes a hybrid using multiple layers. Some layers are text and bounding boxes, whereas the original image layer remains on top or below. (This is configurable in advanced options in ABBYY)

 

Link to comment
  • Level 5

Not much to add, except that ABBYY really is a terrific piece of work. It's pricey, but does a tremendous number of things. It comes with a good screen reader, for instance, which you can call up, then select a portion of the screen and copy it as text to the clipboard or directly to Word or a new file. In a wide variety of languages, too.

Link to comment
  • Level 5*
3 hours ago, spartaaaaa said:

Well, I think that's partly right. It's no longer *purely* image based.

My understanding is it becomes a hybrid using multiple layers. Some layers are text and bounding boxes, whereas the original image layer remains on top or below. (This is configurable in advanced options in ABBYY)

 

As you and @gazumped noted, the EN OCR process doesn't work this way.  

They do not alter the document; the OCR is only for the purpose of building a search index

If this is important for you, it would be better to OCR your documents outside of Evernote

Link to comment
8 minutes ago, DTLow said:

If this is important for you, it might be better to OCR your documents outside of Evernote

Yeah it's kind of a minor feature, and yet one of those things that you only sort of notice when it's not there anymore. For example yesterday I was trying to copy a tracking number out of a scanned label. Slightly annoying to have to manually type it out.

10 minutes ago, DTLow said:

As you noted, the EN OCR process doesn't work this way.  

They do not alter the document; the OCR is only for the purpose of building a search index

That's probably a good thing anyway, or at least as a default for average users. Not that it will ever happen, but as somewhat of a power user, I wouldn't mind having a text OCR layer added via EN cloud if it was an optional toggle. 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...