Jérôme Belleman
Home  •  Tools  •  Posts  •  Talks  •  Travels  •  Graphics  •  About Me

Tuning IBM Platform LSF for Performance

28 Nov 2012

The Platform LSF batch system comes with a wealth of parameters to get the best performance out of it. Here are a few which had an impact in our setup at CERN.

The meaning of these parameters, their consequences and side effects aren't always straightforward to grasp. I'm trying to describe them below.

1 The Job Accept Interval

The JOB_ACCEPT_INTERVAL parameter determines the time to wait before dispatching the next job, to loosely quote the documentation. If set to 0 in lsb.params, it can put quite a bit of load on the system. Better comment it out and leave the default of 1, according to experts.

2 Maximum Number of Connections

Follow this recipe to decide what to set MAX_SBD_CONNS to:

MAX_SBD_CONNS = # hosts + 2 × LSB_MAX_JOB_DISPATCH_PER_SESSION + 200
= 4500 + 2 × LSB_MAX_JOB_DISPATCH_PER_SESSION + 200
= 4700 + 2 × 3500

This suggests to set MAX_SBD_CONNS=11700 in lsb.params.

3 Lifting Logging

To relieve LSF from potentially unnecessary load, a possibly useful thing to do is to set LSF_LOG_MASK=LOG_INFO in lsf.conf. Likewise, consider setting LSB_DEBUG_SBD=LC_COMM.

4 Number of Job Decisions per Scheduling Session

LSB_MAX_JOB_DISPATCH_PER_SESSION should be set to number of hosts + 2002 in lsf.conf.

5 Handshake

Ensure that LSB_ENABLE_SUB_HANDSHAKE=N in lsf.confg, which is the default. This doesn't appear to be documented anymore in more recent LSF versions, suggesting the obsolescence of this parameter.

6 Load to Server Hosts

The LSB_LOAD_TO_SERVER_HOSTS parameter is reportedly obsolete and no longer documented. Nevertheless, it may be advisable to set LSB_LOAD_TO_SERVER_HOSTS=Y in lsf.conf, just to be on the safe side.

7 Resource Usage Update Interval

The LSB_RUSAGE_UPDATE_INTERVAL parameter is no longer documented for later versions. It is suggested to set LSB_RUSAGE_UPDATE_INTERVAL=40 in lsf.conf. In effect, it means the SBD should report to MBD if ΔCPU>10%, or every 40 s. It defaults to reporting to MBD if ΔCPU>10% or every 20 s.

8 Communications

The rusage – resource usage – is sent from all the SDBs (i.e. from each server) to the MBD. This operation causes some load on the master and that's one of the reason you don't want more servers than strictly necessary. Some clean-up of broken/useless hosts should therefore be performed from time to time. You can actually grep "rusageJob rtime" in timing traces to assess the impact.

9 CPU Binding

I once thought that our CPU binding was wrong because the core ids in /proc/cpuinfo didn't match as they're not in a uniform sequence. It doesn't matter, as it turns out. What does matter, however, is that we should give the first core to LIM. There doesn't seem to be any particular recommendation as to how many cores to bind to each process.

One important aspect when setting up CPU binding is the relationship between the processes: mbatchd doesn't take all the queries for itself, it only takes the bsubs and the dialogue with LIM. It leaves the bjobs, bqueues, bhosts queries to another mbatchd which it has forked. Having said that, the parent mbatchd forks every time a query (or request in LSF parlance) comes in so that makes the parent mbatchd very busy anyway. That's why it should deserve a few cores.

The taskset tool can come in handy. It's a standard Linux tool which is useful to make sure the binding you've set up is working.

10 Scheduling

If you find in /var/log/lsf/mbatchd.log errors looking like:

start_job: Failed to call sbatchd on host <foobar>: Timeout on connect call to server

... something is definitely amiss. This spells a slowdown in scheduling if each call from the master to the server's sbatchd might mean waiting for a timeout: clean-up is clearly needed in such a case.

11 Dead Processes

Another typical problem are jobs failing to be cleaned up properly. There's two sides to this problem: jobs whose UNIX process are long dead but not cleaned from LSF's process table and – the other way around – jobs cleaned up from LSF's process table but whose UNIX processes are still running and wasting resources, common with MPI jobs causing process intercommunication and zombies.

12 References