ridleyrumpus

May 31, 2019

10 minutes ago, CalS said:

Well, try printing your problematic PDF to a PDF and OCR the resultant PDF.

My problem wasn't making them OCRable it was that I didn't know which needed to be converted.

I have had to save all attachments in a notebook to a folder, ocr them using Adobe Acrobat and then re import them, but by doing that I lose all the tags etc.

May 31, 2019

1 hour ago, DTLow said:

https://www.adobe.com/content/dam/acom/en/products/acrobat/pdfs/adobe-acrobat-xi-protect-pdf-file-with-permissions-tutorial-ue.pdf

https://www.labnol.org/software/print-password-protected-pdf-files/29918/

Easy

May 31, 2019

54 minutes ago, CalS said:

Screen shot of security that can be applied to a document via Adobe. Not an expert, never really use anything other than encrypt a PDF myself.

https://www.labnol.org/software/print-password-protected-pdf-files/29918/

Google is your friend.

Tested and yes you can print a print protected pdf easily and permanently remove the restriction easily.

May 31, 2019

2 hours ago, CalS said:

I meant more along the lines did the originator apply some form of protection to the PDF contents. I have some financial institution PDFs where it can’t be printed for example.

How does that work, ie the prevention of printing?

There has to be just a flag on the file that prevents that, I would be surprised if there was not a pdf reader out there that ignored it though I have been known yo be wrong.

May 31, 2019

8 hours ago, DTLow said:

Can we get more details on this.

As per @Cals, was it a security thing?

One option is to generate a separate text file instead of updating the pdf

I don't want to do that as the files may be used in evidence so I do not want to materially alter them or rely on a file derived from them that coukd be questioned etc etc.

May 31, 2019

47 minutes ago, CalS said:

Was security applied to the PDF?

No.

I had to physically collect and sign for the data in person. It was on CD.

May 30, 2019

7 hours ago, PinkElephant said:

Just a thought: Knowing that legal counsel (especially the big law firms) extensively use data processing and data mining in legal issues, could it be that these PDFs were intentionally set up in a way to make OCRing them difficult ? I would not know how to do this, but it might make sense:

You hand over a volume of data to the opponent, putting the critical part into PDFs that will most likely not be OCRed. The other side will put everything into their machine, crunch it through and start working on it. Now, the interesting part is not found, because it is there, but not processed, and hidden among all the rest. If your OCR normally works fine, and you do not except this, you will rely on search and similar functions, and will simply overlook what is there.

Who created these PDFs will not have violated discovery terms, because the information was delivered. Just too bad when the opponent did not find it.

As I have said, this is just free thinking out of thin air ...

And a jolly good point it is too.

Should I point out that the documentation was not only un OCRable but that it included all documents pertaining to an individual that had worked for as a teacher for the past 15 years.

The documents, memos, notes, emails etc etc etc were wrapped up in one BIG pdf and each one was in there randomly by date and type.

I do not believe that the way the pdf was put together was a random act.

May 30, 2019

31 minutes ago, PinkElephant said:

Maybe. As I have posted above, I had difficulties to process the pdf with any program I have and use to work on PDFs. I do not have the original Adobe suite, but pretty good stuff that usually can open, edit and OCR every pdf I throw on it.

This is why I regard this sample pdf as weird.

And EN does not take a decision to OCR or not to OCR. It is not Hamlet 😉

As I understand it, any file uploaded will enter a queue. If it contains significant text information, it will not again be OCRed. If not, the server will try to OCR it. If this worked, the information will be added to make it searchable. If the OCR comes out blank (like with this file), the algorithm will „think“ there was nothing to OCR, and will move on to the next file to work on.

If, as I suspect, it is MS Print to pdf that is to blame I do wonder if it is deliberate or just terrible coding by MS.

If I use the MS Print to pdf they do not work, if I use the Adobe equivalent, no problem.

May 30, 2019

1 hour ago, CalS said:

You might be able to shrink the universe. For me It only happens with a subset of downloaded PDFs. I know the offending providers at this point. PITA I have to remember to check new ones. But again, not sure in my use case the lack of OCR on these is an EN issue (though adding notice/flag of lack of OCRing would be something that would help). No OCR on PDFs containing renderable text is universal best I can tell

No issues at all with PDFs created via my ScanSnap (I let it do the OCRing).

Nor here.

The offending PDF's have been sent to me in a bundle being used in a legal case. To make things interesting not all of them are a problem.

May 30, 2019

1 minute ago, PinkElephant said:

GIGO

Garbage in, garbage out ...

I have several thousand pdf attachments in my EN account, from all sort of sources, and everything is working as needed (full search, highlighting of hits etc.). But if I throw non-conforming stuff at an algorithm, the program will go, check and deliver a non-result.

Where I agree: It would be good to get a feedback if a function on the server failed.

That is my gripe, you can enter in notes but not know straight away that they will not OCR via EN and if they don't you have no way of knowing......

If would be nice to have a function to search for all those non readable.

May 30, 2019

7 hours ago, CalS said:

Agreed. Typically I learn of an issue when I am looking for something and can't find it, but I know it is in EN. Again, typically it is a renderable text issue for me. Some statements, bills, advices, etc., embed some key piece of data in an image and it won't OCR. Unless I save the PDF to image/TIFF and then combine the pages in Adobe and then recognize text. PITA, for sure, but I think EN is on the back end of the problem. Would be nice to be advised a PDF did not fully OCR though.

I think that given this it will be best to assume that Evernote is NOT going to OCR anything.

It is just too much of a problem to try and fix later when it may or may not have worked and you have no way of knowing which it was. 🙁

May 29, 2019

Thanks for the reply.

I have lots of PDFs that have been supplied to me, I will have to OCR them on acrobat prior to import. PITA bu I do not think I have a choice.

Such a pain that Evernote does let you know that it tries and was unable to OCR the note.

May 29, 2019

Thanks for the teplyDidn't the previous poster say that the pdf given as an example was a pic inside a pdf container?

The text isn't selectable at least not until it is ocr'd.

May 29, 2019

Didn't the previous poster say that the pdf given as an example was a pic inside a pdf container?

The text isn't selectable at least not until it is ocr'd.

May 29, 2019

Understood.

However Evernote indicates it will be able to make notes searchable, it is the main reason I bought in. It now looks like it can, sometimes. But more to the point when it cannot it does not let you know (or am I missing a flag somewhere)

It would be useful if Evernote had the facility to flag notes that it could not OCR to allow you to go back and OCR only those ones manually.

This is going to be a right PITA almost to the point of looking elsewhere.

May 27, 2019

2 hours ago, PinkElephant said:

Depends on how serious you are about the ability to search full text. Whether in EN or somewhere else, these pdfs are not searchable in their current state. This has nothing to do with Evernote, no other program or indexing search will treat these as searchable files.

Problems top-down:

How are they are distributed in the database (number of notebooks, any common properties like specific tags etc.) ? First thing would be to leave for now the past be the past, and start fresh with importing pdfs that are as you want them to be. Maybe create a tag to be able to identify these. Whenever an old one is converted, stick that tag to it as well.

Import folder ? Yes, I do not see any sense in keeping files that are in Evernote now as a second copy somewhere else. My stuff goes into my Inbox-notebook, if not otherwise specified. One reason I still use the Win client is the "Import Folder" feature. Next I try to keep the Inbox to zero, which means tagging & moving what is in there. My folders elsewhere I try to clean out every few weeks or so, to get rid of the duplicates and save disk space.

Otherwise specified means any direct import into notebooks other than the Inbox. I do this for example with workflows generated by the app ScannerPro - all these are automatically tagged with a tag "ScannerPro" and are mostly send to specified notebooks other than Inbox directly. So maybe think about makin imported notes more identifiable by source for the future. Apply an automatic tag does not cost anything, but allows searched for. If ScannerPro would create bad files one day, I could at least find them.

When you make your pdfs searchable with 3rd party applications, you have to export them from EN into that app, convert, and reimport them. When you manage to get them into the same note they originated from, all properties will be kept (like tags, creation date etc.). These "belong" to the note, not the attached pdf. The new pdfs are just a new attachment to that note. If you create new notes instead, you need to re-tag. I think if it is just the tags, re-tagging might be simpler than manually move the new attachments around into the existing notes.

Finally, it is always cumbersome if you have invested time and effort, and for some reason things do not work. This is the same when you discover in 2019 that going paperless is a realistic thing today - and watch at your x meters of properly filed paper. Looking forward is liberating !

First, I would make sure that any new file is searchable from now on, and that I can identify these.

Second, I would give it a hard guess which of the old stuff I really need searchable in full text mode, and where some tags and note titles will do the job as well.

Happy cleaning !

On 5/26/2019 at 9:23 PM, DTLow said:

As I said, check with Support

fwiw, I OCR'd the pdf externallyLiving with autism Perception.pdf

I use app PDF OCR X

Thanks for the reply. That is a "Shame".

I am using Evernote to build up indexable/searchable evidence in building up a Employment Tribunal case so having everything searchable is sort of the point.

My problem now is identifying a workflow that creates searchable pdfs, some work needed and am on it. I can using Adobe tag all files converted.

The biggest problem is the data already entered, is there anyway that I can identify those that are NOT searchable without testing manually?

May 27, 2019

23 hours ago, DTLow said:

As I said, check with Support

fwiw, I OCR'd the pdf externallyLiving with autism Perception.pdf

I use app PDF OCR X

I have now had a reply from support and they confirm that it is not searchable, and suggested seeing if it was searchable in other applications, it isnt.

But then they suggested opening it and then saving it in another application, so I tried doing that in Adobe and if I open it and then "Save as" it becomes searchable.

Now I have Adobe and Nuance so I have the ability to make a whole folder of pdf's searchable. What I would like to do is make sure that all of the pdf's already entered into Evernote are searchable.

Problems

I don't know where the pdf's are stored if at all.
In the beginning I was deleting files from my import folder once entered into Evernote
Not all pdf's would have been saved into my import folder but "sent" to Evernote.
Will processing them into searchable versions remove the tags etc that I have already added?
Processing every pdf one by one is not practical as there are hundred if not thousands in there.

This could be a nightmare and rather removes the reason for using Evernote.

May 26, 2019

1 minute ago, DTLow said:

I can confirm Evernote is bypassing the OCR process for this pdf - no search text was returned

Just a guess, the pdf is parially OCR'd all ready

You might want to open a support ticket for this. https://www.evernote.com/SupportLogin.action

Thanks for the confirmation.

I think that the ones that Evernote is not OCR'ing were created using the MS Print to pdf "print driver". Which is a sod as there are quite a lot of them and I cannot see how to tell them apart from other pdf's.

Hopefully there will be a way to fix this....

May 26, 2019

3 hours ago, DTLow said:

The pdf is not available. You could try a public link to your note.

Try this

https://drive.google.com/file/d/1UkjsQRc-KGN9yymEuaqixQtWiM5BXSL9/view?usp=sharing

May 26, 2019

3 hours ago, CalS said:

Are you sure the notes are in a synced notebook and not a Local one? If you are using Windows, highlight the notes with PDFs that aren't OCR'd, hold the Ctrl key while clicking on Help in the Menu bar, and then select Fix Selected Notes. See if that helps.

I just tried creating a local notebook and it shows its name in "grey", which I presume indicates it is a local note book?

All the other note books have white text which I presume means these are synced notebooks?

If so then all the pertinent note books are synced.

May 26, 2019

3 hours ago, CalS said:

Are you sure the notes are in a synced notebook and not a Local one? If you are using Windows, highlight the notes with PDFs that aren't OCR'd, hold the Ctrl key while clicking on Help in the Menu bar, and then select Fix Selected Notes. See if that helps.

Tried it but made no immediate difference.

What does that do?

May 26, 2019

This one does not seem to work for me.

Living with autism Perception.pdf

May 26, 2019

I am a new Evernote user so be kind 😁

I have added some pdfs to Evernote from MS Print to Pdf but even after syncing and waiting, well actually days, they are not searchable. How can I make sure the pdfs become searchable.

I do have a pro account.

May 13, 2019

Testing but I reckon it could be Google Backup and Sync that is busy backing up and syncing anything put into the folder.

May 13, 2019

Mmmm.

If I set it to delete the file once imported, I get one copy in Evernote.

If I set it to Keep then I get two copies of each doc in Evernote.

ridleyrumpus

Posts

Joined

Last visited

Content Type

Profiles

Events

Forums

Blogs

Gallery

Downloads

Posts posted by ridleyrumpus

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

OCR on PDFs

Import Folders files duplicated

Import Folders files duplicated

Community Resources