Monitoring
A history of monitoring developments, tools and ideas, embracing the CERN Batch Service use case, challenging for the large scale of the cluster and user base.
It seems that many of my jobs over the years involved putting monitoring in place, which sort of makes sense when you realise that monitoring is they key to business success. Everybody seems to agree on that point, which is a problem when nobody seems to agree on how to best implement monitoring systems; the result is that I ended up trying several of them and, rather worryingly, I'm still not even sure I've found a single solution that quite cuts the mustard.
One particular trend I've noticed is that there are many suitable ways to collect, store, retrieve, correlate data. But through my endeavours developing monitoring for the CERN Batch System I found that what's invariably missing, is a good way to visually present data.
Sometimes I even find Vim best suited for that job.
1 Lightweight Monitoring
1.1 LSF Queues
Before I tackled the delicate problem of monitoring, there had already been some legacy monitoring tools, such as a PHP web application showing with RRDtool plots how many running and pending jobs are tracked in the 60 queues we provide in our batch system. This actually proved to be good monitoring as it outlived most of the other monitoring systems we've written since then.
This time, however, it'll have to go. Really.
1.2 Command-Line Tools
When I first wrote monitoring tools for our batch system, I did so while designing the new batch accounting system. I then had at my disposal a large amount of data which I stored in an Oracle database and I thought it was good material for monitoring too. I was regularly asked for a variety of plots to understand this or that behaviour of the batch system. I wrote command line tools for this purpose which would draw them with matplotlib. And soon, it became clear that such useful plots would be even more useful if end-users could generate them themselves easily via a web interface.
2 NoSQL Databases
I started writing monitoring tools for the batch system about when a new buzzword was introduced: NoSQL databases. I wasn't particularly unhappy with the performance delivered by Oracle, but I thought I'd give this new technology a whirl anyway.
2.1 Fairshare Monitoring
My first encounter with NoSQL databases was with Apache Cassandra, a distributed database system which I used to store LSF fairshare data to help users understand the share they get. I wrote a Django application to generate plots on the fly. It helped us understand one thing about LSF: we don't believe in the share information it reports.
2.2 Live Monitoring, Waiting Time Monitoring and Historical Usage
Cassandra's career was short-lived in the CERN Batch Monitoring business, not because it wasn't good enough, but because we found something even better: OpenTSDB, a time series database which came with its own basic dashboard.
It provides a convenient API which inspired us to write Django applications using data collected live from LSF and more long-term information such as how long a job lives in the batch system from submission to completion and some historical usage data.
2.3 Job Info
I was busy writing yet another Django application, this time to allow end-users to query information about their own jobs, when I was introduced to yet another solution, a solution so promising that it profoundly changed the way we do monitoring for the batch system, a solution that always raises eyebrows when I say its ridiculous name: Splunk.
2.4 The Splunk Revolution
Despite its distasteful name, Splunk is a tremendous monitoring system. It comes with its own database, its own clever indexing system and, which I truly like about it, its own dashboard toolkit. As far as I was concerned, that is what made it stand out from the previous solutions.
Unfortunately, Splunk didn't come without a storm on the horizon: its pricing was just such that we couldn't afford to seriously consider it for all of our monitoring. What's more, I once went to a meeting introducing a new version where they announced they would progressively get rid of their own, brilliant dashboard system in favour of custom JavaScript libraries. This was throwing out the baby with the bathwater and it simply defeated the whole purpose I saw in Splunk. So I decided to give it up.
2.5 Grid Services Statistics
Back to OpenTSDB, then, for another project to display statistics dashboards not only for the batch system but for other services run in our section too. Feeling let down by Splunk, I decided to try another dashboard solution I had actually considered for a long time; Ext JS offers a wealth a GUI widgets, including plots.
However, about the same time we dabbled with Ext JS, we saw the rise of a pair of giants which made us forget all else: Elasticsearch and Kibana.
3 Elasticsearch/Kibana
Elasticsearch does the storage, Kibana displays the dashboards. And together, they're seen as an alternative to Splunk. It's a free solution, not quite as good as Splunk, but good enough.
3.1 Batch Efficiency
Our first project with Elasticsearch/Kibana started about when we became interested in job efficiencies in our batch system. We were baffled at how easy it is to throw data into Elasticsearch and display it in Kibana.
Kibana comes with a complete web interface to set up your dashboards with a few clicks. Well, with many clicks, in fact.
3.2 Batch Operations
Kibana dashboards are stored as JSON data and the Batch Operations dashboard was a playground to edit that JSON data to automate recurrent GUI components. It's also where we started using templated dashboards allowing users to display custom information from the URL. Kibana goes even further in this respect with scripted dashboards.
3.3 Batch Shares
This is an overhaul of the Fairshare Monitoring discussed earlier on, this time using trustworthy accounting data and all we've learned about Elasticsearch and Kibana.
3.4 LSF Queues – Again
We might have gone full circle with this revamp of the due-to-leave LSF Queues dashboard. We'd like to even further improve it with user and group information.
4 Outlook
What's common of all these dashboards systems is that they're web-based, i.e. not particularly quick. I've always wondered if – at least for internal monitoring – we wouldn't be better off with e.g. Qt and Qwt which are perfect for writing fast local applications and offer a myriad of useful widgets.
Of course, that's not particularly snappy over an SSH connection. I've tried Urwid which might then be of interest. In the same vein, I've once been told about termui, inspired from blessed-contrib which both look startling.