Duplicate Files and How to Deal With Them

12 Nov 2017

1 The Problem
2 Dealing with Duplicates with FDUPES
3 Carefully Removing Duplicates
4 References

Juggling with many files, we all accidentally end up with duplicates, causing wasted space and a lot of confusion. Luckily, there's ways to deal with it.

1 The Problem

If it's true that wasting space because of duplicates is a problem in itself, it's fair to say it's only a serious one when dealing with very large files such as videos and possibly photos. I once had to help somebody with a photo collection managed by iPhoto, which appears to have, on the one hand, an annoying tendency of duplicating files without asking and, on the other hand, gives such an opaque view of your photo collection that you don't really know anymore what you're doing. In that particular collection, I found some photos accidentally being duplicated up to 6 times. After cleaning it all up, I realised I'd saved several GBs. If storage is nowadays large enough that it tolerates the occasional duplicate files, this particular case was no small feat after all.

Where I find duplicate files to be even more annoying, is how they cause confusion. It's easy to find yourself in situations where you have a given file located in different paths of your filesystems – sometimes even in different devices – and with different names. Luckily, there's several tools which can help you sort things out.

2 Dealing with Duplicates with FDUPES

If some clever usage of the diff command can already help, I quickly got seduced by the very convenient FDUPES. It can quickly search a whole file hierarchy for duplicates and produce a report of files grouped in sets of duplicates.

It's fast enough that, in practice, I like using by simply giving it a large directory to search for duplicates:

fdupes -r large-directory

Another example is when I get hold of a set of photos from another device and I want to know if I've already got some of them, I just run something like:

fdupes -r large-directory /media/that-other-device

3 Carefully Removing Duplicates

FDUPES has the --delete option which interactively asks for which file you'd like to keep – and it can let you keep all of them if you want to. If you're dealing with thousands of duplicates, this can of course quickly become a long haul.

What I've also once done – and I wish I hadn't, so don't go there either – is to use the --omitfirst option which causes FDUPES to no longer list the first copy of a set – only the rest of it. Used with e.g. xargs and rm, this can be used to remove all copies but the first one. In fact, if I'd read the man page, I would have been able to more conveniently rely on the --noprompt option together with --delete to do exactly that. But one approach or another, this comes with a serious caveat: the first copy of a set is high unpredictable: in the same FDUPES run, it can seemingly randomly come from one source or another, e.g. from large-directory or /media/that-other-device in my previous example.

I realised then that it might be altogether safer to not rely on --omitfirst – and therefore not on --delete and --noprompt either – but in fact grep the report for a specific and unique path before piping it on to xargs and rm. Something like that:

fdupes -r large-directory /media/that-other-device | \
    grep /media/that-other-device | \
    xargs rm

1 The Problem

2 Dealing with Duplicates with FDUPES

3 Carefully Removing Duplicates

4 References