Jump to content
Mike Wood

PDF OCR not always working (premium account)

Recommended Posts

I've just upgraded to Premium and thought PDFs would become searchable like JPGs.

The yellow highlighting doesnt work within PDFs..... but I understand this is a known issue!

My problem is some image PDFs don't seem to get OCRed at all see these two apparently identical notes... https://www.evernote.com/pub/m1kewood/ocrproblem

They appear to be the same but one was emailed from PDF Expert to EN using the 'flattened copy' option. This copy searches fine but the other can't be searched at all.

Perhaps we need a button to force an OCR regardless of type etc? or an indicator that tells us whether EN considers the PDF to be pre OCRed.

  • Like 1

Share this post


Link to post

Don't know if I'm being slow, but if one of those files was mailed directly, how did the other one get into EN and when was it added? If you allow Evernote to do your OCR there are occasionally delays...

Share this post


Link to post

Both emailed into EN. I did it this way to ensure it was a fair example. Previously I had uploaded normally using the Windows client. I thought it might be the number of pages etc but the only thing that appears to fix it is this flattening option.

I thought it might be a delay too, but in my experience if it's going to work the documents are searchable within 5-10 minutes. I tried various methods and renamed the file etc.... As soon as I flatten, it becomes searchable almost immediately.

It's a solid/repeatable example of the problem.

Share this post


Link to post

Google says:

to flatten a document: to prevent PDF file from editing data inside it, finalize the signature on the document and ensure that all annotations can be viewed with any PDF reader application.

- which may indicate that the UNflattened file isn't totally compatible with Evernote's OCR - one of the rules for scanning files is (not unreasonably) - if you can't read the file, don't scan it...

Have you tried the same trick with PDF software that is not PDF Expert?

Share this post


Link to post

Obviously we need to understand why this isn't being OCRed. One of the most important features of EN, is its ability to index text it sees in the documents/photos you send it. If I have to check everytime I unload a PDF then its silly.

Also this is a PDF that has been produced by someone else so I'm not in control of the process. Yes I can re-process it, flattening appears to have worked here but it may not work everytime.

Do I need to report this as an official problem?

Share this post


Link to post

I get "around" the OCR issue by home-schooling my PDFs anyway. Means I don't have to worry about when or whether they'll be processed, and my OCR is a portable text layer in the same file - Evernote keeps the PDF file and the OCR'd text separate, so you won't necessarily find the file searchable if you download it to your desktop. Search the Forum for more on the relative merits and whether to OCR in-line or batch process afterward.

You could certainly submit a support ticket on this. At least you would get a definitive response on whether your software is compatible; but in this case you appear to have a corollary to Gaz's First Law (if it works - don't fix it). In this case - if it don't work work, try other stuff until you find something that does...

  • Like 1

Share this post


Link to post

I've just upgraded to Premium and thought PDFs would become searchable like JPGs.

The yellow highlighting doesnt work within PDFs..... but I understand this is a known issue!

My problem is some image PDFs don't seem to get OCRed at all see these two apparently identical notes... https://www.evernote...wood/ocrproblem

They appear to be the same but one was emailed from PDF Expert to EN using the 'flattened copy' option. This copy searches fine but the other can't be searched at all.

Perhaps we need a button to force an OCR regardless of type etc? or an indicator that tells us whether EN considers the PDF to be pre OCRed.

Hi Mike. I don't have an answer for you. As Gaz said, I tend to OCR things before uploading them to my iPad or to my EN account. There are several reasons for this, but let's set those aside and try to figure out why the PDF isn't getting OCR'd. I'll put them in my account and see what happens.

The only thing I can say right now is that they were created using different versions of Adobe's software, and that the non-OCR document can be OCR'd by me on my machine. So, we know that OCR is at least possible! Now, we have to figure out why EN isn't doing it for you. Give me a little while to see how the files fare in my account.

Share this post


Link to post

For further intrigue I have re-processed the file in Paperport into a searchable PDF.... I now see why you OCR before uploading!

The yellow highlights now appear/work etc...

Obviously another issue entirely but this is how EN should OCR a PDF and I can't really see why it doesn't!

I have placed this in my public shared Notebook too.

Share this post


Link to post

I am quite confused at the moment. Almost immediately after putting the PDFs into my account they became searchable. A search of all notes for dwellings with the search sorted by date updated pulled up your two PDFs at the top of the list on my iPhone. They didn't come up on the Mac, though, until I synced a few more times. Strange.

Even stranger, I am seeing different results in the next few items returned from the search. I would have thought the search results would look identical, but they are not. I cannot explain this, and will report the behavior to the developers so that we can figure it out.

Your files are both being indexed and turn up 46 hits for "dwellings" in each one. So, everything looks good. Could you try them again?

Share this post


Link to post

Thanks Grumpy,

I did upload the file a few times previously with same result.. Rather than email I'll upload via Windows EN and I'll use a different filename too.

How are you seeing the number of hits?

ALSO why is the iPad client so basic... Can't believe they don't show the PDF contents, just shows as an attachment then if you open it you can't search the contents?

Share this post


Link to post

Thanks Grumpy,

I did upload the file a few times previously with same result.. Rather than email I'll upload via Windows EN and I'll use a different filename too.

How are you seeing the number of hits?

ALSO why is the iPad client so basic... Can't believe they don't show the PDF contents, just shows as an attachment then if you open it you can't search the contents?

Hi. I used my Mac to see the number of hits. I went into the note and pressed CMD + F to search within the note. I think I would start a support ticket and mention in it that GrumpyMonkey/Christopher Mayo had no problem with OCR for the same files (mention the process too, I downloaded from your shared note, and you put them into your account differently). They'll probably know me, because I have a couple of support tickets open there, LOL :) If there is anything I can do to help, I'd be happy to share my experiences/logs with them.

As for the iPad, that is a whole other issue! I have posted threads with wish lists and so forth for the developers, so I won't go into all of the details about what I want to see.

The way I look at it, the Windows platform is the gold standard right now in terms of functionality, and the closer we can get all of the platforms to it, the better. The Mac developers have done a great job with the layout, and they have really stepped up the performance with fullscreen mode, so I think it will continue to look different than Windows (a good thing), but hopefully gain similar functionality.

iOS is another matter. I expect it will get a lot more useful in the future as EN developers improve it, but I don't expect it will ever quite meet the functionality found on the desktop versions. The iPad platform just has too many constraints. I am an amateur iPad app developer (working on one, but a long way from completion), and I could be wrong about this, but it seems to me that Evernote would somehow have to build its own PDF viewer into the iPad app in order to display PDFs inline, or incorporate the native PDF viewer into the note views. If either of these could be done, it would likely be a lot of work, with very little productivity payoff for most users. I think inline viewing would be a tremendous feat, but Evernote would be impressive enough if it found a way to open up PDFs with its own viewer, search through them, edit / annotate them, and save them back into the note. An integration with PDF Expert or iAnnotate (my app of choice) would be good enough for me :)

The workaround is to OCR the files yourself, extract the text, and paste it into a note. This has two benefits. First, it makes your PDFs searchable offline (via their copy/pasted text). Second, it makes it possible for you to have the data offline without downloading the PDF into your local drive (a smaller footprint). Obviously, you'll have to have the copy/pasted data kept in a separate note from the original PDF in order to take advantage of #2. See my site to get an idea of how this works for me.

http://www.princeton...ganization.html

Share this post


Link to post

I have uploaded the PDF again...199432.pdf to the public notebook. No sign of it working yet!

So what PDF viewer is it loading into when you open it on the iPad is it an Apple one? Its just stupid that when it opens in this viewer you can't search. Once in the viewer you can then send to iBooks etc where it is fully searchable.... but you have the obvious file duplication issue which I know is an iPad file system issue.

The answer here is simply a PDF app that talks directly to the evernote site (bit like many talk to dropbox) can't we pursuade one of these developers to do something? LOL

Share this post


Link to post

I have uploaded the PDF again...199432.pdf to the public notebook. No sign of it working yet!

So what PDF viewer is it loading into when you open it on the iPad is it an Apple one? Its just stupid that when it opens in this viewer you can't search. Once in the viewer you can then send to iBooks etc where it is fully searchable.... but you have the obvious file duplication issue which I know is an iPad file system issue.

The answer here is simply a PDF app that talks directly to the evernote site (bit like many talk to dropbox) can't we pursuade one of these developers to do something? LOL

The PDF viewer is Apple's native viewer (at least, that is my understanding). It doesn't have search capability, so neither does Evernote (at least, that is my understanding). You can open it in another app, which I think is the way to go forward, but you cannot save it back into the note (this is possible for other document types with at least one app, but not yet with PDFs). I imagine it is pretty tough or someone would have done it by now!

However things get sorted out, I think the iPad will likely always be a hobbled device (bad move by Apple, in my opinion, but I am no expert on this stuff). I still do most of my work on it, and I am quite happy with it, but there are so many constraints with it (like the lack of a file system), I don't expect as much from it. Maybe, if Windows can pull off something decent with their tablets + W8 then Apple will rethink things.

Share this post


Link to post

Which App allows editing other document types?

Its just a case of other App developers being pursuaded that adding Evernote integration is worth while. Although the other issue is how easy Evernote have made it.... Ive seen App developers add Dropbox support virtually overnight. They don't need to communicate to the native iPad app just to the website thus avoiding the filesystem issue.

Share this post


Link to post

Which App allows editing other document types?

Its just a case of other App developers being pursuaded that adding Evernote integration is worth while. Although the other issue is how easy Evernote have made it.... Ive seen App developers add Dropbox support virtually overnight. They don't need to communicate to the native iPad app just to the website thus avoiding the filesystem issue.

I think apps work really well with Evernote -- you can get things easily into and out of it. The problem is saving something you have opened back to the same place in Evernote. I don't know what the obstacles to doing this might be, but if they could be overcome, you could open in PDF Expert, do your stuff, and then save it back into the app. Apparently, this is what the following company does with other file types, but it is the only company I know of that can do this (am I wrong?), so I don't think the solution is as easy as it might seem.

http://evernote.com/trunk/items/quickoffice-pro-hd?lang=en&layout=default&source=mobile_page

Share this post


Link to post

i'll have a play with that app.... I think it gets more complicated when you consider what happens when the broadband connection is lost during editing a note etc... you end up having to implement some sort of local copy and then sync....

But really I'm not interested at the moment in updating I'm wanting to get to a basic useful searchable system on the iPad which works with PDFs meaningfully....

I guess all EN need to do when I click the PDF attachment is have the option to select another PDF app and I guess they could simply pass the notes URL to that app as they start it up.

Share this post


Link to post

UPDATE .... that file is now OCRing OK so very odd? I just emailed it in again to double check as well and its instantly searchable!

The one that wasn't still isn't.... but I guess it only gets processed once. I wonder whether it was due to the email processor not being made aware of my account type for 24 hours? I only set this account up recently and upgraded to Premium a couple of days ago.

Share this post


Link to post

UPDATE .... that file is now OCRing OK so very odd? I just emailed it in again to double check as well and its instantly searchable!

The one that wasn't still isn't.... but I guess it only gets processed once. I wonder whether it was due to the email processor not being made aware of my account type for 24 hours? I only set this account up recently and upgraded to Premium a couple of days ago.

When you upgrade to premium, newly added PDFs get put into the queue to be OCR'd. Existing PDFs get OCR'd over the next few days. Depending upon how many PDFs you have & the current workload of other users, it may take a week for existing PDFs to get processed.

Additionally,

Share this post


Link to post

Thanks, Burgers

It was uploaded after I went premium but it was probably the same day..... AND in this case we know this file complies with EN OCR conditions. I think the answer may be to introduce a flag (column) that indicates whether EN has OCRed a file or in the case of a PDF.... perhaps 3 states - queued, OCR, embedded (text already embedded).

Without this you have no idea what's going on! or you mistakenly think its unreliable....

I have just checked I went Premium on Wednesday 26th at about 11pm these files were emailed in as attachments after 5pm on the 27th....

Share this post


Link to post

@grumpy I tried the quickoffice pro app.... It doesn't work as well as we hoped.... It will send a document to the app but it's named differently so you have to remember what the note was titled in EN, find it in quickoffice and save it over the top.

You can however open an Evernote attachment from within quickoffice and it will save back as expected but to find the attachment you can only search by the name of it.... Hmmmm

Share this post


Link to post

@grumpy I tried the quickoffice pro app.... It doesn't work as well as we hoped.... It will send a document to the app but it's named differently so you have to remember what it was called in EN and resave it over the top.

You can however open an Evernote attachment from within quickoffice and it will save back as expected but to find the attachment you can only search by the name of it.... Hmmmm

I haven't used it, so I don't know much about it. It sounds like there are a few issues left to iron out. In my experience, it just works better (at the moment) to use Dropbox for ongoing stuff, and move everything to Evernote once it is completed. You get to take advantage of the strengths of each app on the iPad this way. Obviously, for PDFs, you can sync iAnnotate or PDF Expert to your Dropbox account to record any annotations or changes.

Share this post


Link to post

Yeah I suspect it could be sorted by the EN app storing a temp file in your account with a reference that quickoffice could use to get it back in the right place.... Obviously both developers would have to talk to each other to arrange this. I'm not an apple developer but I guess there is very limited information that can be communicated directly between apps.

or thinking this through EN could pass a simple txt file to quickoffice rather than the document which could contain a predefined secret code word that would instruct quickoffice to understand the file as a note and attachment name which would then be opened directly from your account rather than from the EN app....

Share this post


Link to post

Google says:

to flatten a document: to prevent PDF file from editing data inside it, finalize the signature on the document and ensure that all annotations can be viewed with any PDF reader application.

- which may indicate that the UNflattened file isn't totally compatible with Evernote's OCR - one of the rules for scanning files is (not unreasonably) - if you can't read the file, don't scan it...

Have you tried the same trick with PDF software that is not PDF Expert?

 

What is a "flattened" or "unflattened" file or document?

Share this post


Link to post

×
×
  • Create New...