Inotify on Distributed Filesystems
If inotify is a great way to track changes on a filesystem, it can be treacherous on distributed ones if you don't understand some aspects of their workings.
1 Inotify
Inotify stands for inode notify. Unsurprisingly, the job of this kernel subsystem is to notify filesystem events. It's hugely more efficient and versatile than using busy waiting and periodically check if an event took place on the filesystem.
It comes with a library so you can easily collect from your programs events on files and directories you're interested in. There's bindings in a variety of programming languages. I once used one for Python – pyinotify.
2 Context
The CERN batch computing cluster does the number crunching for the LHC and other physics experiments, at the heart of the Worldwide LHC Computing Grid. So as to be able to control a fair scheduling of jobs from all our user communities and understand the resources we need to pledge, we have to collect information about each of the 300000 jobs/day that are submitted by physicists. The batch computing service runs on Platform LSF which records such information into accounting files. These files are the main data source of the accounting system. I use inotify to collect new job information as soon as it becomes available.
The LSF batch computing cluster comprises 4500 worker nodes, one master node and one failover node to back it up. The master node and its failover share an NFS mount to remain synchronised. The accounting files are also recorded on NFS.
3 The Issue
Being only accessible by master nodes, this is where I deployed the inotify collector daemon. However, I once noticed that only one of the two nodes was collecting data. As such, that's exactly what I wanted, because I wouldn't have wanted to have two nodes collecting the same data to later store it into the same database. But I became uncomfortable at what initially appeared to be a bug.
I later understood that changes can be notified by inotify only on the machine where they were made. Since the failover node doesn't record jobs to the accounting files, the inotify daemon running on it could see them appear either.
This isn't a phenomenon which happens on NFS only. CERN is a large AFS site and I saw exactly the same behaviour there too. In distributed systems, it may very well be intentional and something you'd like to keep this way. But it is certainly worth highlighting as it can be surprising at first.