Jump to content

Recommended Posts

I am a new Evernote user so be kind 😁

 

I have added some pdfs to Evernote from MS Print to Pdf but even after syncing and waiting, well actually days, they are not searchable. How can I make sure the pdfs become searchable.

I do have a pro account.

  • Like 1

Share this post


Link to post
1 minute ago, ridleyrumpus said:

I have added some pdfs to Evernote from MS Print to Pdf but even after syncing and waiting, well actually days, they are not searchable. How can I make sure the pdfs become searchable.

Can you post a sample of the pdfs.  I'll test it on my installation.

 

Share this post


Link to post
4 hours ago, ridleyrumpus said:

I am a new Evernote user so be kind 😁

 

I have added some pdfs to Evernote from MS Print to Pdf but even after syncing and waiting, well actually days, they are not searchable. How can I make sure the pdfs become searchable.

I do have a pro account.

Are you sure the notes are in a synced notebook and not a Local one?  If you are using Windows, highlight the notes with PDFs that aren't OCR'd, hold the Ctrl key while clicking on Help in the Menu bar, and then select Fix Selected Notes.  See if that helps.

Share this post


Link to post
3 hours ago, CalS said:

Are you sure the notes are in a synced notebook and not a Local one?  If you are using Windows, highlight the notes with PDFs that aren't OCR'd, hold the Ctrl key while clicking on Help in the Menu bar, and then select Fix Selected Notes.  See if that helps.

Tried it but made no immediate difference.

What does that do?

Share this post


Link to post
3 hours ago, CalS said:

Are you sure the notes are in a synced notebook and not a Local one?  If you are using Windows, highlight the notes with PDFs that aren't OCR'd, hold the Ctrl key while clicking on Help in the Menu bar, and then select Fix Selected Notes.  See if that helps.

I just tried creating a local notebook and it shows its name in "grey", which I presume indicates it is a local note book?

All the other note books have white text which I presume means these are synced notebooks?

If so then all the pertinent note books are synced.

Share this post


Link to post
6 hours ago, ridleyrumpus said:

I can confirm no search text was returned

Just a guess, Evernote thinks the pdf is already a text pdf, or already OCRed

You might want to open a support ticket for this.   https://www.evernote.com/SupportLogin.action

Share this post


Link to post
1 minute ago, DTLow said:

I can confirm Evernote is bypassing the OCR process for this pdf - no search text was returned

Just a guess, the pdf is parially OCR'd all ready

You might want to open a support ticket for this.   https://www.evernote.com/SupportLogin.action

 

Thanks for the confirmation.

I think that the ones that Evernote is not  OCR'ing were created using the MS Print to pdf "print driver". Which is a sod as there are quite a lot of them and I cannot see how to tell them apart from other pdf's.

Hopefully there will be a way to fix this....

Share this post


Link to post
53 minutes ago, ridleyrumpus said:

Hopefully there will be a way to fix this...

As I said, check with Support

fwiw,     I OCR'd the pdf externally              Living with autism Perception.pdf

I used app PDF OCR X

Share this post


Link to post

This should be checked with support:

Out of curiosity I have tried some things on this pdf as well.

With tools that are usually reliable, I could neither open nor edit the pdf from the google drive source. The search function of google drive shows „0“ hits even when the searched text is right there. So the pdf is not yet OCRed. 

Pretty weird. To me it seems the content is mostly a large picture embedded into a pdf shell instead of the usual pdf content. Only a small footer Text is ledgible pdf content.

Share this post


Link to post
23 hours ago, DTLow said:

As I said, check with Support

fwiw,     I OCR'd the pdf externallyLiving with autism Perception.pdf

I use app PDF OCR X

I have now had a reply from support and they confirm that it is not searchable, and suggested seeing if it was searchable in other applications, it isnt.

But then they suggested opening it and then saving it in another application, so I tried doing that in Adobe and if I open it and then "Save as" it becomes searchable.

Now I have Adobe and Nuance so I have the ability to make a whole folder of pdf's searchable. What I would like to do is make sure that all of the pdf's already entered into Evernote are searchable.

Problems

  • I don't know where the pdf's are stored if at all.
  • In the beginning I was deleting files from my import folder once entered into Evernote
  • Not all pdf's would have been saved into my import folder but "sent" to Evernote.
  • Will processing them into searchable versions remove the tags etc that I have already added?
  • Processing every pdf one by one is not practical as there are hundred if not thousands in there.

This could be a nightmare and rather removes the reason for using Evernote.

Share this post


Link to post
1 hour ago, ridleyrumpus said:

I don't know where the pdf's are stored if at all.

The pdfs are stored within the Evernote database, as attachments to notes.  
The notes can be identified by search resource:application/pdf

The pdf files can be exported (save attachments),
however the import creates new notes - you lose any tags, notes, ...

>>they confirm that it is not searchable

A point on terminology. 
- We shouldn't need searchable PDFs.  
   Evernote is supposed to do it's OCR magic if the pdf is not searchable

Share this post


Link to post

Depends on how serious you are about the ability to search full text. Whether in EN or somewhere else, these pdfs are not searchable in their current state. This has nothing to do with Evernote, no other program or indexing search will treat these as searchable files.

Problems top-down:

How are they are distributed in the database (number of notebooks, any common properties like specific tags etc.) ? First thing would be to leave for now the past be the past, and start fresh with importing pdfs that are as you want them to be. Maybe create a tag to be able to identify these. Whenever an old one is converted, stick that tag to it as well.

Import folder ? Yes, I do not see any sense in keeping files that are in Evernote now as a second copy somewhere else. My stuff goes into my Inbox-notebook, if not otherwise specified. One reason I still use the Win client is the "Import Folder" feature. Next I try to keep the Inbox to zero, which means tagging & moving what is in there. My folders elsewhere I try to clean out every few weeks or so, to get rid of the duplicates and save disk space.

Otherwise specified means any direct import into notebooks other than the Inbox. I do this for example with workflows generated by the app ScannerPro - all these are automatically tagged with a tag "ScannerPro" and are mostly send to specified notebooks other than Inbox directly. So maybe think about makin imported notes more identifiable by source for the future. Apply an automatic tag does not cost anything, but allows searched for. If ScannerPro would create bad files one day, I could at least find them.

When you make your pdfs searchable with 3rd party applications, you have to export them from EN into that app, convert, and reimport them. When you manage to get them into the same note they originated from, all properties will be kept (like tags, creation date etc.). These "belong" to the note, not the attached pdf. The new pdfs are just a new attachment to that note. If you create new notes instead, you need to re-tag. I think if it is just the tags, re-tagging might be simpler than manually move the new attachments around into the existing notes.

Finally, it is always cumbersome if you have invested time and effort, and for some reason things do not work. This is the same when you discover in 2019 that going paperless is a realistic thing today - and watch at your x meters of properly filed paper. Looking forward is liberating !

First, I would make sure that any new file is searchable from now on, and that I can identify these.

Second, I would give it a hard guess which of the old stuff I really need searchable in full text mode, and where some tags and note titles will do the job as well.

Happy cleaning !

Share this post


Link to post
2 hours ago, PinkElephant said:

Depends on how serious you are about the ability to search full text. Whether in EN or somewhere else, these pdfs are not searchable in their current state. This has nothing to do with Evernote, no other program or indexing search will treat these as searchable files.

Problems top-down:

How are they are distributed in the database (number of notebooks, any common properties like specific tags etc.) ? First thing would be to leave for now the past be the past, and start fresh with importing pdfs that are as you want them to be. Maybe create a tag to be able to identify these. Whenever an old one is converted, stick that tag to it as well.

Import folder ? Yes, I do not see any sense in keeping files that are in Evernote now as a second copy somewhere else. My stuff goes into my Inbox-notebook, if not otherwise specified. One reason I still use the Win client is the "Import Folder" feature. Next I try to keep the Inbox to zero, which means tagging & moving what is in there. My folders elsewhere I try to clean out every few weeks or so, to get rid of the duplicates and save disk space.

Otherwise specified means any direct import into notebooks other than the Inbox. I do this for example with workflows generated by the app ScannerPro - all these are automatically tagged with a tag "ScannerPro" and are mostly send to specified notebooks other than Inbox directly. So maybe think about makin imported notes more identifiable by source for the future. Apply an automatic tag does not cost anything, but allows searched for. If ScannerPro would create bad files one day, I could at least find them.

When you make your pdfs searchable with 3rd party applications, you have to export them from EN into that app, convert, and reimport them. When you manage to get them into the same note they originated from, all properties will be kept (like tags, creation date etc.). These "belong" to the note, not the attached pdf. The new pdfs are just a new attachment to that note. If you create new notes instead, you need to re-tag. I think if it is just the tags, re-tagging might be simpler than manually move the new attachments around into the existing notes.

Finally, it is always cumbersome if you have invested time and effort, and for some reason things do not work. This is the same when you discover in 2019 that going paperless is a realistic thing today - and watch at your x meters of properly filed paper. Looking forward is liberating !

First, I would make sure that any new file is searchable from now on, and that I can identify these.

Second, I would give it a hard guess which of the old stuff I really need searchable in full text mode, and where some tags and note titles will do the job as well.

Happy cleaning !

On 5/26/2019 at 9:23 PM, DTLow said:

As I said, check with Support

fwiw,     I OCR'd the pdf externallyLiving with autism Perception.pdf

I use app PDF OCR X

Thanks for the reply. That is a "Shame".

I am using Evernote to build up indexable/searchable evidence in building up a Employment Tribunal case so having everything searchable is sort of the point. 

My problem now is identifying a workflow that creates searchable pdfs, some work needed and am on it. I can using Adobe tag all files converted.

The biggest problem is the data already entered, is there anyway that I can identify those that are NOT searchable without testing manually?

Share this post


Link to post

My knowledge of the different pdf formats is not good enough to answer this question. We have 3 levels of possible content here

  • pdfs with embedded pictures, not OCRable unless converted before
  • pdfs that are OCRable by the Evernote mechanism, not yet OCRed, OCR-data will be amended by EN to the note (not the pdf)
  • pdfs that are OCRable and that are already OCRed by other software, OCR-data contained in the pdf file

I have no idea how to export all this, find the problematic ones by sorting or any other automatic screening, recreate them and send them back to where they came from. Maybe EN support can tell ...

Basically I would not blame EN not to process these files further. I tried with programs that usually work on complex pdfs and perform things like opening them and make them editable. The pdfs that started the whole thread were un-editable by all means, because they contain basically a picture inside of a pdf envelope. I ask myself why a program meant to create a pdf does something so senseless, as to create a pdf file that is of no use whatsoever, call it a "pdf" and dump it on the user who expects something  closer to the pdf standards. This is where the problem was build into these pdfs, not when Evernote started to screen it and found it not legible. EN will file it anyhow "as it is", as it does with other files it can not open to build the search index. Then it simply holds the file as an attachment, and builds the search based on information like note title, other content, creation date, tags etc.

Share this post


Link to post

Understood.

However Evernote indicates it will be able to make notes searchable, it is the main reason I bought in. It now looks like it can, sometimes. But more to the point when it cannot it does not let you know (or am I missing a flag somewhere) 

It would be useful if Evernote had the facility to flag notes that it could not OCR to allow you to go back and OCR only those ones manually. 

This is going to be a right PITA almost to the point of looking elsewhere. 

Share this post


Link to post
8 minutes ago, ridleyrumpus said:

However Evernote indicates it will be able to make notes searchable, it is the main reason I bought in. It now looks like it can, sometimes.

Here's documentation https://evernote.com/blog/how-evernotes-image-recognition-works/ with the note

For a PDF to be eligible for OCR, it must meet certain requirements:

  1. It must contain a bitmap image
  2. It must not contain selectable text (or, at least, a minimal amount)

 

 

 

Share this post


Link to post

Didn't the previous poster say that the pdf given as an example was a pic inside a pdf container? 

The text isn't selectable at least not until it is ocr'd. 

Share this post


Link to post

Yes, it is sort of an image, but is is not a plain bitmap. So not only EN can not handle this, but other programs build to and able to modify most PDFs will not work as well. 

About OCRed before or by EN: When a pdf already contains significant searchable text , EN will not OCR it again. Usually the OCR quality of EN is pretty good, so I would once test my usual setup (scanner, software etc.) against the EN OCR. If the results from EN are better, I would rather disable my local OCR and let EN do the postprocessing. If the local OCR is crappy, but done to a pdf, EN will not do it again, even when the results would be much better.

Bad search results can well be caused by a badly run OCR before loading a file into EN. 

Share this post


Link to post

Thanks for the teplyDidn't the previous poster say that the pdf given as an example was a pic inside a pdf container? 

The text isn't selectable at least not until it is ocr'd. 

Share this post


Link to post

Thanks for the reply. 

I have lots of PDFs that have been supplied to me, I will have to OCR them on acrobat prior to import. PITA bu I do not think I have a choice. 

Such a pain that Evernote does let you know that it tries and was unable to OCR the note. 

 

Share this post


Link to post
On 5/28/2019 at 5:44 PM, ridleyrumpus said:

Didn't the previous poster say that the pdf given as an example was a pic inside a pdf container? 

Yes, the PDF contained both

- a bit map image

- "selectable text"; just guessing on this

Share this post


Link to post
4 hours ago, ridleyrumpus said:

Thanks for the reply. 

I have lots of PDFs that have been supplied to me, I will have to OCR them on acrobat prior to import. PITA bu I do not think I have a choice. 

Such a pain that Evernote does let you know that it tries and was unable to OCR the note. 

 

Agreed.  Typically I learn of an issue when I am looking for something and can't find it, but I know it is in EN.  Again, typically it is a renderable text issue for me.  Some statements, bills, advices, etc., embed some key piece of data in an image and it won't OCR.  Unless I save the PDF to image/TIFF and then combine the pages in Adobe and then recognize text.  PITA, for sure, but I think EN is on the back end of the problem.  Would be nice to be advised a PDF did not fully OCR though. 

Share this post


Link to post
7 hours ago, CalS said:

Agreed.  Typically I learn of an issue when I am looking for something and can't find it, but I know it is in EN.  Again, typically it is a renderable text issue for me.  Some statements, bills, advices, etc., embed some key piece of data in an image and it won't OCR.  Unless I save the PDF to image/TIFF and then combine the pages in Adobe and then recognize text.  PITA, for sure, but I think EN is on the back end of the problem.  Would be nice to be advised a PDF did not fully OCR though. 

I think that given this it will be best to assume that Evernote is NOT going to OCR anything.

 

It is just too much of a problem to try and fix later when it may or may not have worked and you have no way of knowing which it was. 🙁

 

Share this post


Link to post

GIGO

Garbage in, garbage out ...

I have several thousand pdf attachments in my EN account, from all sort of sources, and everything is working as needed (full search, highlighting of hits etc.). But if I throw non-conforming stuff at an algorithm, the program will go, check and deliver a non-result.

Where I agree: It would be good to get a feedback if a function on the server failed.

Share this post


Link to post
1 minute ago, PinkElephant said:

GIGO

Garbage in, garbage out ...

I have several thousand pdf attachments in my EN account, from all sort of sources, and everything is working as needed (full search, highlighting of hits etc.). But if I throw non-conforming stuff at an algorithm, the program will go, check and deliver a non-result.

Where I agree: It would be good to get a feedback if a function on the server failed.

 

That is my gripe, you can enter in notes but not know straight away that they will not OCR via EN and if they don't you have no way of knowing......

If would be nice to have a function to search for all those non readable.

  • Like 1

Share this post


Link to post
5 hours ago, ridleyrumpus said:

I think that given this it will be best to assume that Evernote is NOT going to OCR anything.  

You might be able to shrink the universe.  For me It only happens with a subset of downloaded PDFs.  I know the offending providers at this point.  PITA I have to remember to check new ones.  But again, not sure in my use case the lack of OCR on these is an EN issue (though adding notice/flag of lack of OCRing would be something that would help).  No OCR on PDFs containing renderable text is universal best I can tell   

No issues at all with PDFs created via my ScanSnap (I let it do the OCRing).   

Share this post


Link to post
1 hour ago, CalS said:

You might be able to shrink the universe.  For me It only happens with a subset of downloaded PDFs.  I know the offending providers at this point.  PITA I have to remember to check new ones.  But again, not sure in my use case the lack of OCR on these is an EN issue (though adding notice/flag of lack of OCRing would be something that would help).  No OCR on PDFs containing renderable text is universal best I can tell   

No issues at all with PDFs created via my ScanSnap (I let it do the OCRing).   

Nor here.

The offending PDF's have been sent to me in a bundle being used in a legal case. To make things interesting not all of them are a problem.

Share this post


Link to post

Just a thought: Knowing that legal counsel (especially the big law firms) extensively use data processing and data mining in legal issues, could it be that these PDFs were intentionally set up in a way to make OCRing them difficult ? I would not know how to do this, but it might make sense:

You hand over a volume of data to the opponent, putting the critical part into PDFs that will most likely not be OCRed. The other side will put everything into their machine, crunch it through and start working on it. Now, the interesting part is not found, because it is there, but not processed, and hidden among all the rest. If your OCR normally works fine, and you do not except this, you will rely on search and similar functions, and will simply overlook what is there.

Who created these PDFs will not have violated discovery terms, because the information was delivered. Just too bad when the opponent did not find it.

As I have said, this is just free thinking out of thin air ...

Share this post


Link to post
54 minutes ago, PinkElephant said:

could it be that these PDFs were intentionally set up in a way to make OCRing them difficult ?

I think it's more a issue of Evernote's decision not to OCR the pdf.  
You'll have to make the same decision.

  1. Avoiding overwriting OCR text
  2. Avoiding unnecessary work load

Share this post


Link to post

Maybe. As I have posted above, I had difficulties to process the pdf with any program I have and use to work on PDFs. I do not have the original Adobe suite, but pretty good stuff that usually can open, edit and OCR every pdf I throw on it.

This is why I regard this sample pdf as weird.

And EN does not take a decision to OCR or not to OCR. It is not Hamlet 😉

As I understand it, any file uploaded will enter a queue. If it contains significant text information, it will not again be OCRed. If not, the server will try to OCR it. If this worked, the information will be added to make it searchable. If the OCR comes out blank (like with this file), the algorithm will „think“ there was nothing to OCR, and will move on to the next file to work on. 

Share this post


Link to post
12 minutes ago, PinkElephant said:

If it contains significant text information, it will not again be OCRed.

That is a decision to OCR or not OCR

My thinking is Evernote made the decision to not OCR the sample pdf

Share this post


Link to post

However ...

When I meet John Evernote next time, I will question him about his decisions  😉 I think he will be on the church BBQ next Sunday, together with Bill Appleseed.

Share this post


Link to post
31 minutes ago, PinkElephant said:

Maybe. As I have posted above, I had difficulties to process the pdf with any program I have and use to work on PDFs. I do not have the original Adobe suite, but pretty good stuff that usually can open, edit and OCR every pdf I throw on it.

This is why I regard this sample pdf as weird.

And EN does not take a decision to OCR or not to OCR. It is not Hamlet 😉

As I understand it, any file uploaded will enter a queue. If it contains significant text information, it will not again be OCRed. If not, the server will try to OCR it. If this worked, the information will be added to make it searchable. If the OCR comes out blank (like with this file), the algorithm will „think“ there was nothing to OCR, and will move on to the next file to work on. 

If, as I suspect, it is MS Print to pdf that is to blame I do wonder if it is deliberate or just terrible coding by MS. 

If I use the MS Print to pdf they do not work, if I use the Adobe equivalent, no problem.

Share this post


Link to post

Maybe it is Microsoft’s way of promoting it’s own XPS-format as a pdf replacement, that up to now is dead in the water, not supported by any serious software provider I can think of.

But as you say, running Microsoft is like being on a submarine: The problems start immediately when you open the first window ... 

Share this post


Link to post
15 minutes ago, ridleyrumpus said:

I do wonder if it is deliberate or just terrible coding by MS. 

If I use the MS Print to pdf they do not work, if I use the Adobe equivalent, no problem.

MS is the dark side 🙂, but I don't think there's anything nefarious here.

They added some text to the PDF and Evernote mistook it

Share this post


Link to post
7 hours ago, PinkElephant said:

Just a thought: Knowing that legal counsel (especially the big law firms) extensively use data processing and data mining in legal issues, could it be that these PDFs were intentionally set up in a way to make OCRing them difficult ? I would not know how to do this, but it might make sense:

You hand over a volume of data to the opponent, putting the critical part into PDFs that will most likely not be OCRed. The other side will put everything into their machine, crunch it through and start working on it. Now, the interesting part is not found, because it is there, but not processed, and hidden among all the rest. If your OCR normally works fine, and you do not except this, you will rely on search and similar functions, and will simply overlook what is there.

Who created these PDFs will not have violated discovery terms, because the information was delivered. Just too bad when the opponent did not find it.

As I have said, this is just free thinking out of thin air ...

And a jolly good point it is too.

Should I point out that the documentation was not only un OCRable but that it included all documents pertaining to an individual that had worked for as a teacher for the past 15 years.

The documents, memos, notes, emails etc etc etc were wrapped up in one BIG pdf and each one was in there randomly by date and type.

I do not believe that the way the pdf was put together was a random act.

Share this post


Link to post

Was security applied to the PDF?

Share this post


Link to post
5 hours ago, ridleyrumpus said:

un OCRable

Can we get more details on this.

As per @Cals, was it a security thing?

One option is to generate a separate text file instead of updating the pdf

Share this post


Link to post
47 minutes ago, CalS said:

Was security applied to the PDF?

No. 

I had to physically collect and sign for the data in person. It was on CD. 

Share this post


Link to post
6 hours ago, ridleyrumpus said:

No. 

I had to physically collect and sign for the data in person. It was on CD. 

I meant more along the lines did the originator apply some form of protection to the PDF contents.  I have some financial institution PDFs where it can’t be printed for example.  

Share this post


Link to post

Maybe we should get this guy to post his tricks here.

Could be useful one day ...

Share this post


Link to post
8 hours ago, DTLow said:

Can we get more details on this.

As per @Cals, was it a security thing?

One option is to generate a separate text file instead of updating the pdf

I don't want to do that as the files may be used in evidence so I do not want to materially alter them or rely on a file derived from them that coukd be questioned etc etc. 

Share this post


Link to post
2 hours ago, CalS said:

I meant more along the lines did the originator apply some form of protection to the PDF contents.  I have some financial institution PDFs where it can’t be printed for example.  

 

How does that work, ie the prevention of printing? 

There has to be just a flag on the file that prevents that, I would be surprised if there was not a pdf reader out there that ignored it though I have been known yo be wrong. 

 

Share this post


Link to post
11 minutes ago, ridleyrumpus said:

How does that work, ie the prevention of printing? 

There has to be just a flag on the file that prevents that, I would be surprised if there was not a pdf reader out there that ignored it though I have been known yo be wrong. 

https://www.adobe.com/content/dam/acom/en/products/acrobat/pdfs/adobe-acrobat-xi-protect-pdf-file-with-permissions-tutorial-ue.pdf

Share this post


Link to post

When trying to open the pdf, resp. Work on it, I did not have the impression that it was PW-protected.

I think the issue preventing the OCR was the type or properties of the picture that made up for most of the file content.

Share this post


Link to post
46 minutes ago, ridleyrumpus said:

 

How does that work, ie the prevention of printing? 

There has to be just a flag on the file that prevents that, I would be surprised if there was not a pdf reader out there that ignored it though I have been known yo be wrong. 

 

Screen shot of security that can be applied to a document via Adobe.  Not an expert, never really use anything other than encrypt a PDF myself.

ScreenClip.png.99b14bdab6c6d29dcac91039d73799e1.png

  • Like 1

Share this post


Link to post
2 hours ago, ridleyrumpus said:

I do not want to materially alter them

That matches Evernote's policy;

  1. they will not update our pdfs in any way
  2. the OCR feature only runs if it will not overwrite text in the pdf

Share this post


Link to post
10 minutes ago, CalS said:

Well, try printing your problematic PDF to a PDF and OCR the resultant PDF.

My problem wasn't making them OCRable it was that I didn't know which needed to be converted.

I have had to save all attachments in a notebook to a folder, ocr them using Adobe Acrobat and then re import them, but by doing that I lose all the tags etc.

Share this post


Link to post
28 minutes ago, ridleyrumpus said:

My problem wasn't making them OCRable it was that I didn't know which needed to be converted.

I have had to save all attachments in a notebook to a folder, ocr them using Adobe Acrobat and then re import them, but by doing that I lose all the tags etc.

Got it.  You could OCR them in place and save the tags.  PITA I suppose....  If they all have the same tags not that hard to mass import and apply the tags.  Good luck in any case.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...