Kdumps to a Remote Server
Sending crash dumps to a remote server instead of keeping them locally seems like a good idea when panicking nodes can't even write to their filesystems.
We've once had cases where nodes crashed so badly they didn't even get a chance to write the dump to a local filesystem. Sending it over the network, however, still seemed to work.
1 Recipe
- Check
/proc/cmdlinefor something that looks likecrashkernel=128M@16M. It might already be the case for your nodes in the computer centre. Make sure
/etc/kdump.confcontains these lines:net root@yourkdumpserver.example.com core_collector makedumpfile -c -d 31The
kdump.confyou'll find in existing nodes may contain a few lines you'll want to comment out because they won't be removed by thecrashdumpNCM component. Theext3 /dev/md6one would cause the dump to be written to the local filesystem and thepath /crash/line would cause the dump to be written to a file where there might not be enough room in the remote crash dump server:wassh -t 120 -l root -h list,of,crashing,host,names 'sed -i-20130806 -e "s/^ext3 \/dev\/md/#&/" -e "s/^path \/crash/#&/" /etc/kdump.conf'Make sure
/etc/sysconfig/kdumpcontains this line:Although, in practice, we don't care, particularly. But ifMKDUMPRD_ARGS="--builtin=libafs"service kdump restartdoesn't fly, that might be why.Run
Another problem is that to be able to do so massively with wassh, one needs to add the crash dump server's fingerprint toservice kdump restart. If this fails, it's probably because you need to runservice kdump propagatefirst. We need a public/private key pair for this and the SSHconfigfile referring to them to be able to SSH to the crash dump server.~/.ssh/known_hostsbecause otherwiseservice kdump propagatewill interactively cause SSH to ask for confirmation every time it's not known.Test it all out by crashing a sample node:
echo c > /proc/sysrq-trigger
One thing which is worth keeping in mind is that kdump is about starting a new kernel when the main one crashes. This new kernel will only have the simple job of dumping the memory. It will also reboot the machine once done.
2 Just Rebooting
Sometimes, you're not interested in scrutinising dumps, but only to ensure as much uptime as possible. You might therefore want to just reboot. Just add a reboot delay in seconds into /etc/sysctl.conf like so:
kernel.panic = 20