Jérôme Belleman
Home  •  Tools  •  Posts  •  Talks  •  Travels  •  Graphics  •  About Me

Kdumps to a Remote Server

13 Oct 2014

Sending crash dumps to a remote server instead of keeping them locally seems like a good idea when panicking nodes can't even write to their filesystems.

We've once had cases where nodes crashed so badly they didn't even get a chance to write the dump to a local filesystem. Sending it over the network, however, still seemed to work.

1 Recipe

  1. Check /proc/cmdline for something that looks like crashkernel=128M@16M. It might already be the case for your nodes in the computer centre.
  2. Make sure /etc/kdump.conf contains these lines:

    net root@yourkdumpserver.example.com
    core_collector makedumpfile -c -d 31

    The kdump.conf you'll find in existing nodes may contain a few lines you'll want to comment out because they won't be removed by the crashdump NCM component. The ext3 /dev/md6 one would cause the dump to be written to the local filesystem and the path /crash/ line would cause the dump to be written to a file where there might not be enough room in the remote crash dump server:

    wassh -t 120 -l root -h list,of,crashing,host,names 'sed -i-20130806 -e "s/^ext3 \/dev\/md/#&/" -e "s/^path \/crash/#&/" /etc/kdump.conf'
  3. Make sure /etc/sysconfig/kdump contains this line:

    MKDUMPRD_ARGS="--builtin=libafs"
    Although, in practice, we don't care, particularly. But if service kdump restart doesn't fly, that might be why.
  4. Run service kdump restart. If this fails, it's probably because you need to run service kdump propagate first. We need a public/private key pair for this and the SSH config file referring to them to be able to SSH to the crash dump server.

    Another problem is that to be able to do so massively with wassh, one needs to add the crash dump server's fingerprint to ~/.ssh/known_hosts because otherwise service kdump propagate will interactively cause SSH to ask for confirmation every time it's not known.
  5. Test it all out by crashing a sample node:

    echo c > /proc/sysrq-trigger

One thing which is worth keeping in mind is that kdump is about starting a new kernel when the main one crashes. This new kernel will only have the simple job of dumping the memory. It will also reboot the machine once done.

2 Just Rebooting

Sometimes, you're not interested in scrutinising dumps, but only to ensure as much uptime as possible. You might therefore want to just reboot. Just add a reboot delay in seconds into /etc/sysctl.conf like so:

kernel.panic = 20

3 References