Duplicate Files and How to Deal With Them
Juggling with many files, we all accidentally end up with duplicates, causing wasted space and a lot of confusion. Luckily, there's ways to deal with it.
1 The Problem
If it's true that wasting space because of duplicates is a problem in itself, it's fair to say it's only a serious one when dealing with very large files such as videos and possibly photos. I once had to help somebody with a photo collection managed by iPhoto, which appears to have, on the one hand, an annoying tendency of duplicating files without asking and, on the other hand, gives such an opaque view of your photo collection that you don't really know anymore what you're doing. In that particular collection, I found some photos accidentally being duplicated up to 6 times. After cleaning it all up, I realised I'd saved several GBs. If storage is nowadays large enough that it tolerates the occasional duplicate files, this particular case was no small feat after all.
Where I find duplicate files to be even more annoying, is how they cause confusion. It's easy to find yourself in situations where you have a given file located in different paths of your filesystems – sometimes even in different devices – and with different names. Luckily, there's several tools which can help you sort things out.
2 Dealing with Duplicates with FDUPES
If some clever usage of the diff
command can already help, I quickly got seduced by the very convenient FDUPES. It can quickly search a whole file hierarchy for duplicates and produce a report of files grouped in sets of duplicates.
It's fast enough that, in practice, I like using by simply giving it a large directory to search for duplicates:
fdupes -r large-directory
Another example is when I get hold of a set of photos from another device and I want to know if I've already got some of them, I just run something like:
fdupes -r large-directory /media/that-other-device
3 Carefully Removing Duplicates
FDUPES has the --delete
option which interactively asks for which file you'd like to keep – and it can let you keep all of them if you want to. If you're dealing with thousands of duplicates, this can of course quickly become a long haul.
What I've also once done – and I wish I hadn't, so don't go there either – is to use the --omitfirst
option which causes FDUPES to no longer list the first copy of a set – only the rest of it. Used with e.g. xargs
and rm
, this can be used to remove all copies but the first one. In fact, if I'd read the man page, I would have been able to more conveniently rely on the --noprompt
option together with --delete
to do exactly that. But one approach or another, this comes with a serious caveat: the first copy of a set is high unpredictable: in the same FDUPES run, it can seemingly randomly come from one source or another, e.g. from large-directory
or /media/that-other-device
in my previous example.
I realised then that it might be altogether safer to not rely on --omitfirst
– and therefore not on --delete
and --noprompt
either – but in fact grep
the report for a specific and unique path before piping it on to xargs
and rm
. Something like that:
fdupes -r large-directory /media/that-other-device | \
grep /media/that-other-device | \
xargs rm