Jump to content

Info: Exporting EN attachments or Save as HTML will not export duplicates


Recommended Posts

Hello, 

 

I thought I would post this so anyone who comes across it in the future can feel relatively save in their actions.

 

I had a few notes that had 50+ images in them.  I first started going to the note, selecting it, and selecting the "save attachments" feature.  Every time, I felt I was missing some images, the folder count was 45 or 55, when I count them in the actual note at 50+ or even 80+

 

The next step was to export as html, that way, I could at least compare the output the html to the note.  While the html output is not identical in render as in a browser, it is pretty close.  With the html export opened in Safari, all the images were there.  I wanted to be sure, so I put a text number above each image, all the way up to 97.

 

Yet the images folder only had 65 images in it.  Hmm, either Evernote ( EN ) is not exporting all images, or it is being smart, and not exporting duplicates.  I took the HTML file, grep'd out the images so I was left with a text file that had 97 line in it.  Ok, so at least the HTML was referencing the correct 97 images.

 

Sitting there with a text file with 97 file names in it, I opened up the terminal, and sorted the file and piped it through unique so it would number each line as to how many occurrences, or duplicates there were.  There were in fact duplicates, a few lines had a two and a three in front of them, most had a 1.  I then grep'd through the file for the number of lines, and it was...

65!

 

Ok, yay, so EN will not export as attachments or save as HTML any duplicate images in your notes.  But it will maintain them through HTML links or perhaps XML/XHTML links in the .enml export format.

 

Here are the shell commands I used:

 

This will read work.html and shove all lines that have the string 'img' in them into a new file called count.txt

grep -i img work.html >> count.txt

 

You can then count the lines in that file with 

wc -l count.txt

 

I then opened the file in a my text editor and told it to find and replace out everything in the file but the filename, so I was left with a 97 line file that had 4324-4689-3654-5679.JPG on 97 lines.

 

Next up, we want to sort it, and number each line for how many duplicate filenames there were:

sort count.txt | uniq -c | wc -l

 

The above will spit out to your screen a number of lines in the file, which should match the number of images exported.  The output to screen looks like this:

  1 E963A26D-ED10-4D6D-A715-EE4DF6491A74.JPG

   2 F60972FA-C8E6-4387-A75B-3E223ECACC72.JPG
   1 F91ACFBF-6833-44B1-89FE-804F8D1BE23D.JPG
 
As you can see, the leading 2 mens that image was references 2 times, but only exported once.
 
I am glad I took the time to check this, as I don't want to miss any images.  I am curious what criteria EN uses to determine if a file is an identical to another, I hope it is some sort of hash comparison, and not something with width/height or file size or a combination thereof.
 
I think I could test it by putting in two jpg images different by 1px and see if I export 1 or 2 images.
In testing, taking an image, duplicating it, and dropping both those into EN and then exporting the attachments, EN leans on the second image, or the image with a timestamp that is the newest.
 
I did a quick sneaky trick as just opening and reserving a jpg will massively change it, so I echo'd 1 byte to the end of the first jpg file.  I then exported the note as attachments and EN exported 2 images, as it should as they were now different by 1 byte.
 
As far as I can tell, all looks good and is working great.  I wanted to post this here in case someone searching out why their exports are not as many attachments as they suspect, this very well could be the reason.

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...