Kdumps to a Remote Server
Sending crash dumps to a remote server instead of keeping them locally seems like a good idea when panicking nodes can't even write to their filesystems.
We've once had cases where nodes crashed so badly they didn't even get a chance to write the dump to a local filesystem. Sending it over the network, however, still seemed to work.
1 Recipe
- Check
/proc/cmdline
for something that looks likecrashkernel=128M@16M
. It might already be the case for your nodes in the computer centre. Make sure
/etc/kdump.conf
contains these lines:net root@yourkdumpserver.example.com core_collector makedumpfile -c -d 31
The
kdump.conf
you'll find in existing nodes may contain a few lines you'll want to comment out because they won't be removed by thecrashdump
NCM component. Theext3 /dev/md6
one would cause the dump to be written to the local filesystem and thepath /crash/
line would cause the dump to be written to a file where there might not be enough room in the remote crash dump server:wassh -t 120 -l root -h list,of,crashing,host,names 'sed -i-20130806 -e "s/^ext3 \/dev\/md/#&/" -e "s/^path \/crash/#&/" /etc/kdump.conf'
Make sure
/etc/sysconfig/kdump
contains this line:Although, in practice, we don't care, particularly. But ifMKDUMPRD_ARGS="--builtin=libafs"
service kdump restart
doesn't fly, that might be why.Run
Another problem is that to be able to do so massively with wassh, one needs to add the crash dump server's fingerprint toservice kdump restart
. If this fails, it's probably because you need to runservice kdump propagate
first. We need a public/private key pair for this and the SSHconfig
file referring to them to be able to SSH to the crash dump server.~/.ssh/known_hosts
because otherwiseservice kdump propagate
will interactively cause SSH to ask for confirmation every time it's not known.Test it all out by crashing a sample node:
echo c > /proc/sysrq-trigger
One thing which is worth keeping in mind is that kdump is about starting a new kernel when the main one crashes. This new kernel will only have the simple job of dumping the memory. It will also reboot the machine once done.
2 Just Rebooting
Sometimes, you're not interested in scrutinising dumps, but only to ensure as much uptime as possible. You might therefore want to just reboot. Just add a reboot delay in seconds into /etc/sysctl.conf
like so:
kernel.panic = 20