Jump to content

Feature Request: Duplicate File Checking (on upload)


gazumped

Recommended Posts

  • Level 5*

Like lots of others here, I'm busily converting to a paperless state from a folder-driven hard drive (or several) and a small paper mountain.

One of the inefficiencies of my former folder lifestyle was that sometimes files got duplicated - either because I wanted to file something in two or more places, or I just forgot where I had the first copy and created another. It's not a major hazard - I just ran a duplicate checker against a library folder tree of over a thousand files and came up with 43 duplicates - but it does happen.

If I upload a duplicate file to Evernote it's not going to be a major issue either - sooner or later a search will find it/ them and I'll kill one or the other copy.

For efficiency and elegance (and bandwidth and upload limit) though, I wondered how non-trivial it would be for Evernote to have an option to run a content check against each file as it is submitted and flag duplicates with a "you uploaded this identical file before" error?

Link to comment
  • 1 month later...

I'd like this too - especially so that I don't waste any time multiple-tagging things I already have. Especially if EN had the smarts to evaluate the actual file and note merely its metadata (file name, size, etc. - but could actually eval the content of the file). Then I would like two levels - one for a title match and then one for a file match.

- Bal

Link to comment

If I upload a duplicate file to Evernote it's not going to be a major issue either - sooner or later a search will find it/ them and I'll kill one or the other copy.

Your idea is super and I hope that it gets implemented. Here is a work around that I have used, not perfect by any means! Each of my notes has a title, like many of yours I am sure. Using this tip I found in the EN database of wisdom:

Print A List: of the note titles

Run the search

Select all the notes (Ctrl + A)

Right click and "Copy note links"

Paste into a new Evernote note and print, or just view.

Now, I am making an assumption here, and that is your "might be duplicate note" have the same title(s). This next step will take sometime, you scan down your list of titles and find duplicates. When you find one go back to EN and do a search for this title and remove one!

Boy, I sure hope your idea gets implemented!! I have done the above twice and it did pair down my notes!

Regards,

David in Wichita

Link to comment

Ok, not at all an expert on this, so how do we make sure the file itself is the same file? I guess if they're identical we can check the hash? And treat different versions of the same file as separate files?

Link to comment
  • Level 5*

My take would be to check the hash of the file - if it's an identical file, then I don't want to store it twice.

I might want to store several evolutions of a file, possibly in different notes, as I add and change the content; but by me, different versions are not (by definition) identical files.

Link to comment
  • Level 5*

Agreed - was thinking about going into it in more detail but then decided to let the techsperts earn their corn pulling it to bits. I can see how you could keep a mini-database of file hashes for each upload, then hash each new file and check it against the list, and raise a query if they match; but that's going to take time for each upload - more on the web - and it may be one of those things I'd be happy to get.. and then switch off so everything ran faster. It'd be a nice feature though if there's a way to do it without too much of a time penalty!

Link to comment
  • Level 5*

I'm no expert, but from point of view: hashes are generally quick to calculate, and depending on the algorithm, can be useful as a first step to determining duplication. Let's face it, if you used a 16-bit hash value, and it was generated in such a way as to be randomly distributed across the 64K values, that would mean relatively few matches of the hash key, particularly in the face of the 100,000 note limit. Generate a 32-bit hash, and that 4GB space overwhelms the note limit.

Link to comment
  • Level 5*

One of the inefficiencies of my former folder lifestyle was that sometimes files got duplicated - either because I wanted to file something in two or more places, or I just forgot where I had the first copy and created another. It's not a major hazard - I just ran a duplicate checker against a library folder tree of over a thousand files and came up with 43 duplicates - but it does happen.

Since Evernote stores all of the attachments on your local hard drive, why not just use the tool you already have?

Link to comment
  • Level 5*

Yup - attachments are part of the database, so there's nothing for any external tool to latch onto as a comparison. Evernote would have to keep a local library of hashes (or an equivalent) to log what's been saved to date, then hash and compare each new offering as files are imported. Since my comparison tool of choice can zap through a whole hard-disk's worth of random files and identify the duplicates in 10 minutes or so, you'd think that a library of existing Evernote files would be easy to hash (ignoring the slight problem of digging that individual file out of the database in the first place) in a pretty short time, and that new files could then easily be checked as part of the import process..

Link to comment
  • 1 year later...
  • Level 5*

Any news on this feature request? 

 

This would be super useful uploading pdf's and automatically checking if you have already uploaded it. Would save me some time!

 

Thanks

Since Evernote typically does not pre-announce their future development plans, the only news you can usually expect is a release note that this feature has been added. It has not.

Link to comment
  • 2 months later...
  • 7 months later...

You can directly query the exb file using a SQLite client. Its tables already contain the MD5 hash of all resources you have stored.

 

These SQL queries below should do the trick. The first one should be enough, but somehow the second one gives slightly different results.

 

select upper(hash), count(*), group_concat(ra.[file_name] || ' [' || n.title || ' - ' || ra.source_url || ']') from resource_attr ra join note_attr n on n.uid = ra.note group by hash having count(*) > 1order by count(*) desc
select hex(md5), count(*), group_concat(ra.[file_name] || ' [' || ra.source_url || ' - ' || ra.note || ']') from resources r join resource_attr ra on ra.uid = r.uid group by hex(md5) having count(*) > 1order by count(*) desc
Link to comment
  • 2 months later...

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...