Jump to content

(Archived) PDF Index Garbage Words


Recommended Posts

Does EN index the full PDF on the web are there garbage words it doesn't index? I searched a PDF and it doesn't seem to find some words that I know are in the document and the only reason I can think of is that they are too common or the PDF is too big (it's 9MB) or it takes a while for the PDF to be fully indexed.

Of course, it is possible (although highly unlikely) that I'm doing something wrong. :-)

Thanks,

Randy

Link to comment
Does EN index the full PDF on the web are there garbage words it doesn't index? I searched a PDF and it doesn't seem to find some words that I know are in the document and the only reason I can think of is that they are too common or the PDF is too big (it's 9MB) or it takes a while for the PDF to be fully indexed.

Last I knew, the PDF's weren't indexed at all--you'd have to copy the contents and paste them into a note to have them indexed.

Link to comment

I'm searching a PDF via the web interface, but if I search (using words that I'm sure are in the document) via the web interface it doesn't seem to find them. Is it because I initially created the note on Windows and then did a sync, as opposed to doing it on a Mac?

Thanks.

Link to comment

We will only index and search the text content of PDF documents. If your PDF contains images with words in them, we don't yet do image processing on those images within PDFs. When we add this feature, we'll go back and retroactively process your existing documents.

This behavior on the web service is independent of the source of the note ... so we have the same limitation whether you mailed it in or uploaded it from either client.

Thanks

Link to comment

I think I understand. So pdf files such as journal articles are searchable now but pdfs coming off my scanner, even if they are of text documents are considered "images"? and so are not yet searchable.

I'm thinking about upgrading to the pro features but I want to be able to search these scanned documents in the near future (otherwise I would just go another ocr route). Any idea on the general timeline that this will happen? thanks

Link to comment

If you can scan your pages to an image format directly (PNG, JPEG, GIF), this will work correctly today. I'd like to get image processing in PDF done in the next 2-3 months, but it's a little tricky due to the number of different image formats that can be thrown into the PDF files.

Link to comment

I'm curious to know why PDF support is hard to do on Windows but is already implemented on the Mac and Browser. Is there something inherently difficult about it?

The implementation of PDFs in OneNote is very flexible, especially the option to "Insert file as printout so I can add notes to it". I'd like to see PDFs implemented in a similar fashion in Evernote - it gives great flexibility.

Link to comment
I'm curious to know why PDF support is hard to do on Windows but is already implemented on the Mac and Browser. Is there something inherently difficult about it?

The PDF format is fully integrated into OS X and its developer libraries (Cocoa):

http://developer.apple.com/cocoa/pdfkit.html

This means that we can just tell the Mac to stick a PDF somewhere, and it basically works. The Mac includes built-in code to pull out the text from a PDF so that you can put it into the search engine.

On Windows, Microsoft does not acknowledge that PDF is a de facto standard, so everyone needs to download a third-party app (Adobe Acrobat) to even look at a PDF when you set up a new computer. This means that we need to research, test, and license an external PDF-processing library for Windows to pull the text out of a PDF for searching. Displaying a PDF within the note (like the Mac does) would require a similar level of work.

To be clear, we definitely want to get this working on Windows. Since it was several times easier on the Mac, we had to choose between letting Mac users get full support before Windows, or just delay it for everyone until we could get it all finished on Windows.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...