Jump to content

ridleyrumpus

Level 2
  • Posts

    29
  • Joined

  • Last visited

Posts posted by ridleyrumpus

  1. 10 minutes ago, CalS said:

    Well, try printing your problematic PDF to a PDF and OCR the resultant PDF.

    My problem wasn't making them OCRable it was that I didn't know which needed to be converted.

    I have had to save all attachments in a notebook to a folder, ocr them using Adobe Acrobat and then re import them, but by doing that I lose all the tags etc.

  2. 2 hours ago, CalS said:

    I meant more along the lines did the originator apply some form of protection to the PDF contents.  I have some financial institution PDFs where it can’t be printed for example.  

     

    How does that work, ie the prevention of printing? 

    There has to be just a flag on the file that prevents that, I would be surprised if there was not a pdf reader out there that ignored it though I have been known yo be wrong. 

     

  3. 8 hours ago, DTLow said:

    Can we get more details on this.

    As per @Cals, was it a security thing?

    One option is to generate a separate text file instead of updating the pdf

    I don't want to do that as the files may be used in evidence so I do not want to materially alter them or rely on a file derived from them that coukd be questioned etc etc. 

  4. 7 hours ago, PinkElephant said:

    Just a thought: Knowing that legal counsel (especially the big law firms) extensively use data processing and data mining in legal issues, could it be that these PDFs were intentionally set up in a way to make OCRing them difficult ? I would not know how to do this, but it might make sense:

    You hand over a volume of data to the opponent, putting the critical part into PDFs that will most likely not be OCRed. The other side will put everything into their machine, crunch it through and start working on it. Now, the interesting part is not found, because it is there, but not processed, and hidden among all the rest. If your OCR normally works fine, and you do not except this, you will rely on search and similar functions, and will simply overlook what is there.

    Who created these PDFs will not have violated discovery terms, because the information was delivered. Just too bad when the opponent did not find it.

    As I have said, this is just free thinking out of thin air ...

    And a jolly good point it is too.

    Should I point out that the documentation was not only un OCRable but that it included all documents pertaining to an individual that had worked for as a teacher for the past 15 years.

    The documents, memos, notes, emails etc etc etc were wrapped up in one BIG pdf and each one was in there randomly by date and type.

    I do not believe that the way the pdf was put together was a random act.

  5. 31 minutes ago, PinkElephant said:

    Maybe. As I have posted above, I had difficulties to process the pdf with any program I have and use to work on PDFs. I do not have the original Adobe suite, but pretty good stuff that usually can open, edit and OCR every pdf I throw on it.

    This is why I regard this sample pdf as weird.

    And EN does not take a decision to OCR or not to OCR. It is not Hamlet 😉

    As I understand it, any file uploaded will enter a queue. If it contains significant text information, it will not again be OCRed. If not, the server will try to OCR it. If this worked, the information will be added to make it searchable. If the OCR comes out blank (like with this file), the algorithm will „think“ there was nothing to OCR, and will move on to the next file to work on. 

    If, as I suspect, it is MS Print to pdf that is to blame I do wonder if it is deliberate or just terrible coding by MS. 

    If I use the MS Print to pdf they do not work, if I use the Adobe equivalent, no problem.

  6. 1 hour ago, CalS said:

    You might be able to shrink the universe.  For me It only happens with a subset of downloaded PDFs.  I know the offending providers at this point.  PITA I have to remember to check new ones.  But again, not sure in my use case the lack of OCR on these is an EN issue (though adding notice/flag of lack of OCRing would be something that would help).  No OCR on PDFs containing renderable text is universal best I can tell   

    No issues at all with PDFs created via my ScanSnap (I let it do the OCRing).   

    Nor here.

    The offending PDF's have been sent to me in a bundle being used in a legal case. To make things interesting not all of them are a problem.

  7. 1 minute ago, PinkElephant said:

    GIGO

    Garbage in, garbage out ...

    I have several thousand pdf attachments in my EN account, from all sort of sources, and everything is working as needed (full search, highlighting of hits etc.). But if I throw non-conforming stuff at an algorithm, the program will go, check and deliver a non-result.

    Where I agree: It would be good to get a feedback if a function on the server failed.

     

    That is my gripe, you can enter in notes but not know straight away that they will not OCR via EN and if they don't you have no way of knowing......

    If would be nice to have a function to search for all those non readable.

    • Like 1
  8. 7 hours ago, CalS said:

    Agreed.  Typically I learn of an issue when I am looking for something and can't find it, but I know it is in EN.  Again, typically it is a renderable text issue for me.  Some statements, bills, advices, etc., embed some key piece of data in an image and it won't OCR.  Unless I save the PDF to image/TIFF and then combine the pages in Adobe and then recognize text.  PITA, for sure, but I think EN is on the back end of the problem.  Would be nice to be advised a PDF did not fully OCR though. 

    I think that given this it will be best to assume that Evernote is NOT going to OCR anything.

     

    It is just too much of a problem to try and fix later when it may or may not have worked and you have no way of knowing which it was. 🙁

     

  9. Thanks for the reply. 

    I have lots of PDFs that have been supplied to me, I will have to OCR them on acrobat prior to import. PITA bu I do not think I have a choice. 

    Such a pain that Evernote does let you know that it tries and was unable to OCR the note. 

     

  10. Understood.

    However Evernote indicates it will be able to make notes searchable, it is the main reason I bought in. It now looks like it can, sometimes. But more to the point when it cannot it does not let you know (or am I missing a flag somewhere) 

    It would be useful if Evernote had the facility to flag notes that it could not OCR to allow you to go back and OCR only those ones manually. 

    This is going to be a right PITA almost to the point of looking elsewhere. 

  11. 2 hours ago, PinkElephant said:

    Depends on how serious you are about the ability to search full text. Whether in EN or somewhere else, these pdfs are not searchable in their current state. This has nothing to do with Evernote, no other program or indexing search will treat these as searchable files.

    Problems top-down:

    How are they are distributed in the database (number of notebooks, any common properties like specific tags etc.) ? First thing would be to leave for now the past be the past, and start fresh with importing pdfs that are as you want them to be. Maybe create a tag to be able to identify these. Whenever an old one is converted, stick that tag to it as well.

    Import folder ? Yes, I do not see any sense in keeping files that are in Evernote now as a second copy somewhere else. My stuff goes into my Inbox-notebook, if not otherwise specified. One reason I still use the Win client is the "Import Folder" feature. Next I try to keep the Inbox to zero, which means tagging & moving what is in there. My folders elsewhere I try to clean out every few weeks or so, to get rid of the duplicates and save disk space.

    Otherwise specified means any direct import into notebooks other than the Inbox. I do this for example with workflows generated by the app ScannerPro - all these are automatically tagged with a tag "ScannerPro" and are mostly send to specified notebooks other than Inbox directly. So maybe think about makin imported notes more identifiable by source for the future. Apply an automatic tag does not cost anything, but allows searched for. If ScannerPro would create bad files one day, I could at least find them.

    When you make your pdfs searchable with 3rd party applications, you have to export them from EN into that app, convert, and reimport them. When you manage to get them into the same note they originated from, all properties will be kept (like tags, creation date etc.). These "belong" to the note, not the attached pdf. The new pdfs are just a new attachment to that note. If you create new notes instead, you need to re-tag. I think if it is just the tags, re-tagging might be simpler than manually move the new attachments around into the existing notes.

    Finally, it is always cumbersome if you have invested time and effort, and for some reason things do not work. This is the same when you discover in 2019 that going paperless is a realistic thing today - and watch at your x meters of properly filed paper. Looking forward is liberating !

    First, I would make sure that any new file is searchable from now on, and that I can identify these.

    Second, I would give it a hard guess which of the old stuff I really need searchable in full text mode, and where some tags and note titles will do the job as well.

    Happy cleaning !

    On 5/26/2019 at 9:23 PM, DTLow said:

    As I said, check with Support

    fwiw,     I OCR'd the pdf externallyLiving with autism Perception.pdf

    I use app PDF OCR X

    Thanks for the reply. That is a "Shame".

    I am using Evernote to build up indexable/searchable evidence in building up a Employment Tribunal case so having everything searchable is sort of the point. 

    My problem now is identifying a workflow that creates searchable pdfs, some work needed and am on it. I can using Adobe tag all files converted.

    The biggest problem is the data already entered, is there anyway that I can identify those that are NOT searchable without testing manually?

  12. 23 hours ago, DTLow said:

    As I said, check with Support

    fwiw,     I OCR'd the pdf externallyLiving with autism Perception.pdf

    I use app PDF OCR X

    I have now had a reply from support and they confirm that it is not searchable, and suggested seeing if it was searchable in other applications, it isnt.

    But then they suggested opening it and then saving it in another application, so I tried doing that in Adobe and if I open it and then "Save as" it becomes searchable.

    Now I have Adobe and Nuance so I have the ability to make a whole folder of pdf's searchable. What I would like to do is make sure that all of the pdf's already entered into Evernote are searchable.

    Problems

    • I don't know where the pdf's are stored if at all.
    • In the beginning I was deleting files from my import folder once entered into Evernote
    • Not all pdf's would have been saved into my import folder but "sent" to Evernote.
    • Will processing them into searchable versions remove the tags etc that I have already added?
    • Processing every pdf one by one is not practical as there are hundred if not thousands in there.

    This could be a nightmare and rather removes the reason for using Evernote.

  13. 1 minute ago, DTLow said:

    I can confirm Evernote is bypassing the OCR process for this pdf - no search text was returned

    Just a guess, the pdf is parially OCR'd all ready

    You might want to open a support ticket for this.   https://www.evernote.com/SupportLogin.action

     

    Thanks for the confirmation.

    I think that the ones that Evernote is not  OCR'ing were created using the MS Print to pdf "print driver". Which is a sod as there are quite a lot of them and I cannot see how to tell them apart from other pdf's.

    Hopefully there will be a way to fix this....

  14. 3 hours ago, CalS said:

    Are you sure the notes are in a synced notebook and not a Local one?  If you are using Windows, highlight the notes with PDFs that aren't OCR'd, hold the Ctrl key while clicking on Help in the Menu bar, and then select Fix Selected Notes.  See if that helps.

    I just tried creating a local notebook and it shows its name in "grey", which I presume indicates it is a local note book?

    All the other note books have white text which I presume means these are synced notebooks?

    If so then all the pertinent note books are synced.

  15. 3 hours ago, CalS said:

    Are you sure the notes are in a synced notebook and not a Local one?  If you are using Windows, highlight the notes with PDFs that aren't OCR'd, hold the Ctrl key while clicking on Help in the Menu bar, and then select Fix Selected Notes.  See if that helps.

    Tried it but made no immediate difference.

    What does that do?

  16. I am a new Evernote user so be kind 😁

     

    I have added some pdfs to Evernote from MS Print to Pdf but even after syncing and waiting, well actually days, they are not searchable. How can I make sure the pdfs become searchable.

    I do have a pro account.

    • Like 1
×
×
  • Create New...