Jérôme Belleman
Home  •  Tools  •  Posts  •  Talks  •  Travels  •  Graphics  •  About Me

IBM Platform LSF Debugging Techniques

4 Dec 2012

IBM Platform LSF is a very sophisticated batch scheduler. Here are a few ideas I collected from intensive debugging sessions with our setup at CERN.

Platform LSF, like all sophisticated systems, doesn't always make it obvious to understand what it's up to. The debugging techniques describe below can help understand its workings.

1 Debug Traces

It can be useful to have a precise idea of everything a given LSF command is busy with by setting a few environment variables:

export LSF_LOG_MASK=LOG_DEBUG3
export LSB_DEBUG_CMD="LC_EXEC LC_TRACE LC_COMM"

There's even an LC_HANG environment variable you may want to add to the other three. These environment variables can be used both in the master and in the servers. I think it's obvious that in the master, debugging is enabled by setting parameters in lsf.conf (e.g.LSF_LOG_MASK=LOG_INFO, LSB_DEBUG_SBD=LC_COMM).

We used it a lot to try and understand why bsub tended to hang (which slowly became one of our main focus). The command would typically hang on the waiting for reply timeout=0 ms message, which suggests bsub was waiting for the master.

Dec  4 17:13:23 2012 22746 8 7.06 call_server: serv_connect() get server sock <1>
Dec  4 17:13:23 2012 22746 8 7.06 chanRpc_(): Entering ... chfd=1
Dec  4 17:13:23 2012 22746 7 7.06 chanRpc_(): sending 6248 bytes
Dec  4 17:13:23 2012 22746 7 7.06 chanRpc_(): waiting for reply timeout=0 ms
<hang>
Dec  4 17:13:24 2012 22746 9 7.06 chanRpc_(): reading reply header

2 Master Debugging

This is a heavy operation, so you shouldn't let it run indefinitely. We used it a lot to trace what exactly the master was up to. You need to use the mbddebug command:

badmin mbddebug -l 3 -c "LC_EXEC LC_TRACE LC_COMM" -f /tmp/batchdebug

Debugging is turned off with:

badmin mbddebug -o

3 Master Timing

Timing mbatchd involves mbddebug (which makes it expensive too) and the mbdtime commands:

badmin mbddebug -l 3 -c LC_COMM -f /tmp/batchdebug
badmin mbdtime -l 5 -f /tmp/batchtiming

Debugging is turned off with:

badmin mbddebug -o
badmin mbdtime -o

4 LIM Debugging

The limdebug command is used in a similar way as the previously-discussed ones:

lsadmin limdebug -c "LC_COMM LC_TRACE" -l 1 -f /tmp/limdebug
lsadmin limdebug -o

5 Tracing Processes

straceing mbatchd is something we commonly did too, and no one said anything about it being heavy. The ltrace command would bring the master to its knees, however. Something interesting that straces yield is that mbatchd very often stats /etc/localtime after which nothing seems to be happening for a few seconds. Googling up stat localtime, it appears that exporting TZ to something along the lines of ":/etc/localtime" may solve this, even though it might altogether be useless. One sample session we ran was:

strace -t -T -e open -o batchtrace -p 12131

We also straced bsub.

6 A Sample Debugging Session

The first step is about turning off LSF and starting tcpdump in the submission host:

badmin hshutdown
tcpdump -vvv -w /tmp/batchtcpdump "tcp&&(port 3881)"

You should then start debugging in the master:

badmin mbddebug -l 1 -c LC_TRACE -f /tmp/batchdebug

Back in the submission host, fire a couple of bsubs.

% date; bsub date; date
Wed Nov 28 09:02:36 CET 2012
Job <335274245> is submitted to default queue <8nm>.
Wed Nov 28 09:02:36 CET 2012
% date; bsub date; date
Wed Nov 28 09:02:37 CET 2012
Job <335274267> is submitted to default queue <8nm>.
Wed Nov 28 09:02:37 CET 2012
% date; bsub date; date
Wed Nov 28 09:02:38 CET 2012
Job <335274274> is submitted to default queue <8nm>.
Wed Nov 28 09:02:38 CET 2012

Don't forget to turn off debugging at the end:

badmin mbddebug -o

You may now look at the traces and TCP dumps.

7 A Sample Timing Session

  1. Turn on this expensive debugging:

    badmin mbddebug -c "LC_TRACE LC_COMM LC_EXEC" -l 1 -f /tmp/batchdebug
    badmin mbdtime -l 3 -f /tmp/batchtiming
  2. Turn on debugging before submission:

    export LSB_CMD_LOG_MASK=LOG_DEBUG
    export LSB_DEBUG_CMD="LC_TRACE LC_COMM LC_EXEC"
    export LSB_DEBUG_TIME=3
    export LSB_CMD_LOGDIR=/tmp
  3. Run bsub until a submission hangs a bit.
  4. Check /tmp/batchdebug.mbatchd.log in the master and /tmp/bsub.log. These are typically the kind of files to send when requesting support.
  5. Don't forget to turn off all the debugging.

8 Reference