IBM Platform LSF Debugging Techniques
IBM Platform LSF is a very sophisticated batch scheduler. Here are a few ideas I collected from intensive debugging sessions with our setup at CERN.
Platform LSF, like all sophisticated systems, doesn't always make it obvious to understand what it's up to. The debugging techniques describe below can help understand its workings.
1 Debug Traces
It can be useful to have a precise idea of everything a given LSF command is busy with by setting a few environment variables:
export LSF_LOG_MASK=LOG_DEBUG3
export LSB_DEBUG_CMD="LC_EXEC LC_TRACE LC_COMM"
There's even an LC_HANG
environment variable you may want to add to the other three. These environment variables can be used both in the master and in the servers. I think it's obvious that in the master, debugging is enabled by setting parameters in lsf.conf
(e.g.LSF_LOG_MASK=LOG_INFO
, LSB_DEBUG_SBD=LC_COMM
).
We used it a lot to try and understand why bsub
tended to hang (which slowly became one of our main focus). The command would typically hang on the waiting for reply timeout=0 ms
message, which suggests bsub
was waiting for the master.
Dec 4 17:13:23 2012 22746 8 7.06 call_server: serv_connect() get server sock <1>
Dec 4 17:13:23 2012 22746 8 7.06 chanRpc_(): Entering ... chfd=1
Dec 4 17:13:23 2012 22746 7 7.06 chanRpc_(): sending 6248 bytes
Dec 4 17:13:23 2012 22746 7 7.06 chanRpc_(): waiting for reply timeout=0 ms
<hang>
Dec 4 17:13:24 2012 22746 9 7.06 chanRpc_(): reading reply header
2 Master Debugging
This is a heavy operation, so you shouldn't let it run indefinitely. We used it a lot to trace what exactly the master was up to. You need to use the mbddebug
command:
badmin mbddebug -l 3 -c "LC_EXEC LC_TRACE LC_COMM" -f /tmp/batchdebug
Debugging is turned off with:
badmin mbddebug -o
3 Master Timing
Timing mbatchd
involves mbddebug
(which makes it expensive too) and the mbdtime
commands:
badmin mbddebug -l 3 -c LC_COMM -f /tmp/batchdebug
badmin mbdtime -l 5 -f /tmp/batchtiming
Debugging is turned off with:
badmin mbddebug -o
badmin mbdtime -o
4 LIM Debugging
The limdebug
command is used in a similar way as the previously-discussed ones:
lsadmin limdebug -c "LC_COMM LC_TRACE" -l 1 -f /tmp/limdebug
lsadmin limdebug -o
5 Tracing Processes
strace
ing mbatchd
is something we commonly did too, and no one said anything about it being heavy. The ltrace
command would bring the master to its knees, however. Something interesting that strace
s yield is that mbatchd
very often stats /etc/localtime
after which nothing seems to be happening for a few seconds. Googling up stat localtime
, it appears that exporting TZ
to something along the lines of ":/etc/localtime"
may solve this, even though it might altogether be useless. One sample session we ran was:
strace -t -T -e open -o batchtrace -p 12131
We also strace
d bsub
.
6 A Sample Debugging Session
The first step is about turning off LSF and starting tcpdump
in the submission host:
badmin hshutdown
tcpdump -vvv -w /tmp/batchtcpdump "tcp&&(port 3881)"
You should then start debugging in the master:
badmin mbddebug -l 1 -c LC_TRACE -f /tmp/batchdebug
Back in the submission host, fire a couple of bsubs.
% date; bsub date; date
Wed Nov 28 09:02:36 CET 2012
Job <335274245> is submitted to default queue <8nm>.
Wed Nov 28 09:02:36 CET 2012
% date; bsub date; date
Wed Nov 28 09:02:37 CET 2012
Job <335274267> is submitted to default queue <8nm>.
Wed Nov 28 09:02:37 CET 2012
% date; bsub date; date
Wed Nov 28 09:02:38 CET 2012
Job <335274274> is submitted to default queue <8nm>.
Wed Nov 28 09:02:38 CET 2012
Don't forget to turn off debugging at the end:
badmin mbddebug -o
You may now look at the traces and TCP dumps.
7 A Sample Timing Session
Turn on this expensive debugging:
badmin mbddebug -c "LC_TRACE LC_COMM LC_EXEC" -l 1 -f /tmp/batchdebug badmin mbdtime -l 3 -f /tmp/batchtiming
Turn on debugging before submission:
export LSB_CMD_LOG_MASK=LOG_DEBUG export LSB_DEBUG_CMD="LC_TRACE LC_COMM LC_EXEC" export LSB_DEBUG_TIME=3 export LSB_CMD_LOGDIR=/tmp
- Run
bsub
until a submission hangs a bit. - Check
/tmp/batchdebug.mbatchd.log
in the master and/tmp/bsub.log
. These are typically the kind of files to send when requesting support. Don't forget to turn off all the debugging.