Feature Request: Duplicate File Checking (on upload)

gazumped · February 9, 2012

Like lots of others here, I'm busily converting to a paperless state from a folder-driven hard drive (or several) and a small paper mountain.

One of the inefficiencies of my former folder lifestyle was that sometimes files got duplicated - either because I wanted to file something in two or more places, or I just forgot where I had the first copy and created another. It's not a major hazard - I just ran a duplicate checker against a library folder tree of over a thousand files and came up with 43 duplicates - but it does happen.

If I upload a duplicate file to Evernote it's not going to be a major issue either - sooner or later a search will find it/ them and I'll kill one or the other copy.

For efficiency and elegance (and bandwidth and upload limit) though, I wondered how non-trivial it would be for Evernote to have an option to run a content check against each file as it is submitted and flag duplicates with a "you uploaded this identical file before" error?

wordmuse · March 20, 2012

I'd like this too - especially so that I don't waste any time multiple-tagging things I already have. Especially if EN had the smarts to evaluate the actual file and note merely its metadata (file name, size, etc. - but could actually eval the content of the file). Then I would like two levels - one for a title match and then one for a file match.

- Bal

ClutterBGone · March 21, 2012

If I upload a duplicate file to Evernote it's not going to be a major issue either - sooner or later a search will find it/ them and I'll kill one or the other copy.

Your idea is super and I hope that it gets implemented. Here is a work around that I have used, not perfect by any means! Each of my notes has a title, like many of yours I am sure. Using this tip I found in the EN database of wisdom:

Print A List: of the note titles

Run the search

Select all the notes (Ctrl + A)

Right click and "Copy note links"

Paste into a new Evernote note and print, or just view.

Now, I am making an assumption here, and that is your "might be duplicate note" have the same title(s). This next step will take sometime, you scan down your list of titles and find duplicates. When you find one go back to EN and do a search for this title and remove one!

Boy, I sure hope your idea gets implemented!! I have done the above twice and it did pair down my notes!

Regards,

David in Wichita

dlu · March 21, 2012

Ok, not at all an expert on this, so how do we make sure the file itself is the same file? I guess if they're identical we can check the hash? And treat different versions of the same file as separate files?

gazumped · March 22, 2012

My take would be to check the hash of the file - if it's an identical file, then I don't want to store it twice.

I might want to store several evolutions of a file, possibly in different notes, as I add and change the content; but by me, different versions are not (by definition) identical files.

dlu · March 22, 2012

Gotcha. I'll note this as a request. I will say, this whole file hash thing makes me shudder a bit.

gazumped · March 22, 2012

Agreed - was thinking about going into it in more detail but then decided to let the techsperts earn their corn pulling it to bits. I can see how you could keep a mini-database of file hashes for each upload, then hash each new file and check it against the list, and raise a query if they match; but that's going to take time for each upload - more on the web - and it may be one of those things I'd be happy to get.. and then switch off so everything ran faster. It'd be a nice feature though if there's a way to do it without too much of a time penalty!

jefito · March 22, 2012

I'm no expert, but from point of view: hashes are generally quick to calculate, and depending on the algorithm, can be useful as a first step to determining duplication. Let's face it, if you used a 16-bit hash value, and it was generated in such a way as to be randomly distributed across the 64K values, that would mean relatively few matches of the hash key, particularly in the face of the 100,000 note limit. Generate a 32-bit hash, and that 4GB space overwhelms the note limit.

JMichaelTX · March 22, 2012

One of the inefficiencies of my former folder lifestyle was that sometimes files got duplicated - either because I wanted to file something in two or more places, or I just forgot where I had the first copy and created another. It's not a major hazard - I just ran a duplicate checker against a library folder tree of over a thousand files and came up with 43 duplicates - but it does happen.

Since Evernote stores all of the attachments on your local hard drive, why not just use the tool you already have?

jefito · March 22, 2012

In the Windows client, attachments are stored inside the database (.exb file). How does the tool work on that?

gazumped · March 22, 2012

Yup - attachments are part of the database, so there's nothing for any external tool to latch onto as a comparison. Evernote would have to keep a local library of hashes (or an equivalent) to log what's been saved to date, then hash and compare each new offering as files are imported. Since my comparison tool of choice can zap through a whole hard-disk's worth of random files and identify the duplicates in 10 minutes or so, you'd think that a library of existing Evernote files would be easy to hash (ignoring the slight problem of digging that individual file out of the database in the first place) in a pretty short time, and that new files could then easily be checked as part of the import process..

kabukiman · November 1, 2013

Any news on this feature request?

This would be super useful uploading pdf's and automatically checking if you have already uploaded it. Would save me some time!

Thanks

jefito · November 1, 2013

Any news on this feature request?

This would be super useful uploading pdf's and automatically checking if you have already uploaded it. Would save me some time!

Thanks

Since Evernote typically does not pre-announce their future development plans, the only news you can usually expect is a release note that this feature has been added. It has not.

gazumped · January 31, 2014

Hi,
i Usually prefer DuplicateFilesDeleter to find and remove duplicate content. you can also try this.

Not gonna work with Evernote.

elifarley · September 11, 2014

You can directly query the exb file using a SQLite client. Its tables already contain the MD5 hash of all resources you have stored.

These SQL queries below should do the trick. The first one should be enough, but somehow the second one gives slightly different results.

select upper(hash), count(*), group_concat(ra.[file_name] || ' [' || n.title || ' - ' || ra.source_url || ']') from resource_attr ra join note_attr n on n.uid = ra.note group by hash having count(*) > 1order by count(*) desc

select hex(md5), count(*), group_concat(ra.[file_name] || ' [' || ra.source_url || ' - ' || ra.note || ']') from resources r join resource_attr ra on ra.uid = r.uid group by hex(md5) having count(*) > 1order by count(*) desc

gazumped · September 12, 2014

Thanks for the suggestion, but what would you do with the information? Making any changes to the file directly would be unwise...

alexias · November 12, 2014

Hi Guys,

DuplicateFilesDeleter did a great job with me and my employee.it works very well and remove duplicate files quickly and very best result oriented solution.

Thanks

Feature Request: Duplicate File Checking (on upload)

Recommended Posts

gazumped 11,666

Link to comment

wordmuse 1

Link to comment

ClutterBGone 152

Link to comment

dlu 628

Link to comment

gazumped 11,666

Link to comment

dlu 628

Link to comment

gazumped 11,666

Link to comment

jefito 5,589

Link to comment

JMichaelTX 4,117

Link to comment

jefito 5,589

Link to comment

gazumped 11,666

Link to comment

kabukiman 0

Link to comment

jefito 5,589

Link to comment

gazumped 11,666

Link to comment

elifarley 1

Link to comment

gazumped 11,666

Link to comment

alexias 0

Link to comment

Archived

Community Resources