Jump to content
spartaaaaa

Selectable Text in OCR'd documents

Recommended Posts

This is related enquiry to OCR'ing in PDF. 

I've noticed in addition to some documents being OCR'd while others not, there is division between successfully OCR'd documents.

Some OCR'd searchable documents allow text to be selected (cut and paste), while other not. 

Here is an example:

 

H5NEQ7t.jpg

Share this post


Link to post
2 hours ago, spartaaaaa said:

This is related enquiry to OCR'ing in PDF. 

I've noticed in addition to some documents being OCR'd while others not, there is division between successfully OCR'd documents.

Some OCR'd searchable documents allow text to be selected (cut and paste), while other not. 

Here is an example:

I don't think that has anything to do with the OCR process

Some PDFs are actual text, and some are images of text

Share this post


Link to post

Yeah - it certainly used to be the case that when Evernote OCR'd a file,  the text version was kept on the server,  and the client's original file was left as uploaded,  in line with Evernote's premise of "we'll never mess with your data". 

In other words the file remained images instead of those images being replaced by text.  Where PDFs are OCR'd locally,  the process replaces the former image(s) with the actual text.  Which makes the files significantly smaller when uploaded,  and means that text highlighted in searches tended to be more accurately highlighted in the local OCR file than in the Evernote-processed version.  And that text can be copy/ pasted more easily from the local-OCR files.

Not sure what the process is these days - I routinely OCR my own scanned content and have done for ages.  Don't know if Evernote handle it differently these days or not..

  • Like 2

Share this post


Link to post
6 hours ago, DTLow said:

I don't think that has anything to do with the OCR process

Some PDFs are actual text, and some are images of text

Perhaps I explained it poorly. Actual text PDF's are always selectable and searchable. 

I am referring to specifically scanned PDF's that are images that have additional OCR on them. It's possible to select parts of the image as text for some of these. (See my original image)

5 hours ago, gazumped said:

Yeah - it certainly used to be the case that when Evernote OCR'd a file,  the text version was kept on the server,  and the client's original file was left as uploaded,  in line with Evernote's premise of "we'll never mess with your data". 

In other words the file remained images instead of those images being replaced by text.  Where PDFs are OCR'd locally,  the process replaces the former image(s) with the actual text.  Which makes the files significantly smaller when uploaded,  and means that text highlighted in searches tended to be more accurately highlighted in the local OCR file than in the Evernote-processed version.  And that text can be copy/ pasted more easily from the local-OCR files.

Not sure what the process is these days - I routinely OCR my own scanned content and have done for ages.  Don't know if Evernote handle it differently these days or not..

Interesting. If I may ask, do you have a good OCR scanner in mind for Mac? I really miss ScanSnap as it comes with ABBY software (which is really good imo)

Share this post


Link to post

Hi again @spartaaaaa - I Use a venerable ScanSnap 1500 on my Windows system.  No experience with Macs I'm afraid.  Mine came with Acrobat for editing,  and I use that for batch OCRing scans too...

Share this post


Link to post
34 minutes ago, gazumped said:

Hi again @spartaaaaa - I Use a venerable ScanSnap 1500 on my Windows system.  No experience with Macs I'm afraid.  Mine came with Acrobat for editing,  and I use that for batch OCRing scans too...

No problem. Thanks.

Share this post


Link to post

So for the benefit of anyone reading this thread, I have some feedback.

After downloading a trial version of ABBYY Pro Finereader, I was able to get my old results back. AFR does an incredible job of OCR while making the text regions selectable. In, fact it even allows one to export a scanned PDF back to Word document (it recognizes elements such as tables and fonts etc)

I now believe this could be a the reason why some my scanned-&-ocr'd documents have selectable text and some do not.

The ones that do have selectable text where originally fed through the Fujitsu ScanSnap software and OCR was done there. The ones that do not, were most likely OCR's by the Evernote cloud. I recently learned that ABBYY is the engine that powers the ScanSnap software under the hood, which was a bit of an 'aha' moment :)

So where does this leave me. Well, on the one hand the OCR results of ABBYY are simply stellar. OTOH, however, it's over 100 bucks! I might just ditch my all-in-one Epson and haul the ScanSnap out of storage just to use the software!

  • Like 1

Share this post


Link to post

FWIW I selected some text on a recently document. You can see the result here in color. Wonderful

 

 

Untitled.png

Share this post


Link to post
4 hours ago, spartaaaaa said:

So for the benefit of anyone reading this thread, I have some feedback.

After downloading a trial version of ABBYY Pro Finereader, I was able to get my old results back.

 

3 hours ago, spartaaaaa said:

FWIW I selected some text on a recently document. You can see the result here in color. Wonderful

Thanks for posting
I'm going to give ABBYY a try also

My understanding is that your OCR process is actually altering the document
It is no longer an image based pdf

When you're posting your results, you are actually selecting text in the altered document

Share this post


Link to post
6 minutes ago, DTLow said:

Thanks for posting
I'm going to give ABBYY a try to

My pleasure. Sounds good

6 minutes ago, DTLow said:

My understanding is that your OCR process is actually altering the document
It is no longer an image based pdf

When you're posting your results, you are actually selecting text in the altered document

Well, I think that's partly right. It's no longer *purely* image based.

My understanding is it becomes a hybrid using multiple layers. Some layers are text and bounding boxes, whereas the original image layer remains on top or below. (This is configurable in advanced options in ABBYY)

 

Share this post


Link to post

Not much to add, except that ABBYY really is a terrific piece of work. It's pricey, but does a tremendous number of things. It comes with a good screen reader, for instance, which you can call up, then select a portion of the screen and copy it as text to the clipboard or directly to Word or a new file. In a wide variety of languages, too.

Share this post


Link to post
3 hours ago, spartaaaaa said:

Well, I think that's partly right. It's no longer *purely* image based.

My understanding is it becomes a hybrid using multiple layers. Some layers are text and bounding boxes, whereas the original image layer remains on top or below. (This is configurable in advanced options in ABBYY)

 

As you and @gazumped noted, the EN OCR process doesn't work this way.  

They do not alter the document; the OCR is only for the purpose of building a search index

If this is important for you, it would be better to OCR your documents outside of Evernote

  • Like 1

Share this post


Link to post
8 minutes ago, DTLow said:

If this is important for you, it might be better to OCR your documents outside of Evernote

Yeah it's kind of a minor feature, and yet one of those things that you only sort of notice when it's not there anymore. For example yesterday I was trying to copy a tracking number out of a scanned label. Slightly annoying to have to manually type it out.

10 minutes ago, DTLow said:

As you noted, the EN OCR process doesn't work this way.  

They do not alter the document; the OCR is only for the purpose of building a search index

That's probably a good thing anyway, or at least as a default for average users. Not that it will ever happen, but as somewhat of a power user, I wouldn't mind having a text OCR layer added via EN cloud if it was an optional toggle. 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...