Tuning IBM Platform LSF for Performance
The Platform LSF batch system comes with a wealth of parameters to get the best performance out of it. Here are a few which had an impact in our setup at CERN.
The meaning of these parameters, their consequences and side effects aren't always straightforward to grasp. I'm trying to describe them below.
1 The Job Accept Interval
The JOB_ACCEPT_INTERVAL
parameter determines the time to wait before dispatching the next job, to loosely quote the documentation. If set to 0 in lsb.params
, it can put quite a bit of load on the system. Better comment it out and leave the default of 1, according to experts.
2 Maximum Number of Connections
Follow this recipe to decide what to set MAX_SBD_CONNS
to:
MAX_SBD_CONNS
|
= |
# hosts + 2 × LSB_MAX_JOB_DISPATCH_PER_SESSION + 200
|
= |
4500 + 2 × LSB_MAX_JOB_DISPATCH_PER_SESSION + 200
|
|
= | 4700 + 2 × 3500 |
This suggests to set MAX_SBD_CONNS=11700
in lsb.params
.
3 Lifting Logging
To relieve LSF from potentially unnecessary load, a possibly useful thing to do is to set LSF_LOG_MASK=LOG_INFO
in lsf.conf
. Likewise, consider setting LSB_DEBUG_SBD=LC_COMM
.
4 Number of Job Decisions per Scheduling Session
LSB_MAX_JOB_DISPATCH_PER_SESSION
should be set to in lsf.conf
.
5 Handshake
Ensure that LSB_ENABLE_SUB_HANDSHAKE=N
in lsf.confg
, which is the default. This doesn't appear to be documented anymore in more recent LSF versions, suggesting the obsolescence of this parameter.
6 Load to Server Hosts
The LSB_LOAD_TO_SERVER_HOSTS
parameter is reportedly obsolete and no longer documented. Nevertheless, it may be advisable to set LSB_LOAD_TO_SERVER_HOSTS=Y
in lsf.conf
, just to be on the safe side.
7 Resource Usage Update Interval
The LSB_RUSAGE_UPDATE_INTERVAL
parameter is no longer documented for later versions. It is suggested to set LSB_RUSAGE_UPDATE_INTERVAL=40
in lsf.conf
. In effect, it means the SBD should report to MBD if ΔCPU>10%, or every 40 s. It defaults to reporting to MBD if ΔCPU>10% or every 20 s.
8 Communications
The rusage – resource usage – is sent from all the SDBs (i.e. from each server) to the MBD. This operation causes some load on the master and that's one of the reason you don't want more servers than strictly necessary. Some clean-up of broken/useless hosts should therefore be performed from time to time. You can actually grep "rusageJob rtime"
in timing traces to assess the impact.
9 CPU Binding
I once thought that our CPU binding was wrong because the core id
s in /proc/cpuinfo
didn't match as they're not in a uniform sequence. It doesn't matter, as it turns out. What does matter, however, is that we should give the first core to LIM. There doesn't seem to be any particular recommendation as to how many cores to bind to each process.
One important aspect when setting up CPU binding is the relationship between the processes: mbatchd
doesn't take all the queries for itself, it only takes the bsub
s and the dialogue with LIM. It leaves the bjobs
, bqueues
, bhosts
queries to another mbatchd
which it has forked. Having said that, the parent mbatchd
forks every time a query (or request in LSF parlance) comes in so that makes the parent mbatchd
very busy anyway. That's why it should deserve a few cores.
The taskset
tool can come in handy. It's a standard Linux tool which is useful to make sure the binding you've set up is working.
10 Scheduling
If you find in /var/log/lsf/mbatchd.log
errors looking like:
start_job: Failed to call sbatchd on host <foobar>: Timeout on connect call to server
... something is definitely amiss. This spells a slowdown in scheduling if each call from the master to the server's sbatchd
might mean waiting for a timeout: clean-up is clearly needed in such a case.
11 Dead Processes
Another typical problem are jobs failing to be cleaned up properly. There's two sides to this problem: jobs whose UNIX process are long dead but not cleaned from LSF's process table and – the other way around – jobs cleaned up from LSF's process table but whose UNIX processes are still running and wasting resources, common with MPI jobs causing process intercommunication and zombies.