How using KDUMP can help you get to RCA faster - UPDATED!

Latest response

As much as you want it never to happen someday, somewhere (probably at the WORST time) your computer or server is going to crash. It could be small or it could be epic. No matter the time or the cause it’ll be tragic, tears will be shed, teeth will be gnashed, wailing will be heard. How can you prepare yourself for that day, and get back online as quickly as possible and get to Root Cause so the error doesn’t occur again?

Of all the things Scouting has taught me that has stuck with me most of all is their motto: “Always be prepared.” With a few simple steps you can set yourself up to capture vital data about a problem, so the work you do today will pay off down the road in huge dividends by being able to understand exactly what happened the day of “the Great Server Crash of ‘”.

Whenever there’s an issue, there are a few vital pieces of data that can help determine precisely what happened. First off is running an sosreport, which gathers critical logs and settings so we can see a bit more into the situation. The next, and arguably more important file, would be a core file captured by kdump. The core files grabs what’s going on in memory at the time of the crash, which provides us essential data to zero in on which process failed and what was going on.

First let me share these helpful articles with you In them, the process for setting KDUMP up is laid out. It takes some time, so during your next change window you might want to consider adding it in to existing servers. Make plans to change your standard builds to incorporate it out of the gate:

How can I collect system information to provide to Red Hat Support for analysis when the system hangs?

How do I troubleshoot kernel crashes with kexec/kdump on Red Hat Enterprise Linux?

Once you've got that vmcore you'll want to get it to Red Hat for analysis. The following document explains how to provide vmcores to Red Hat support. This makes the analysis and response faster:

How can I provide large files (vmcore, rhev logcollector, large sosreports, heap dumps, big log files, etc) to Red Hat Support?

KDUMP is great, but sometimes you need a bit more, so here are two more articles on how you can make small changes and capture even more data from scripts (like for application debugging) and how to use kdump over a bonded network interface:

How do I run commands using kdump, before vmcore is saved?

How to setup Kdump with bonding and VLAN?

And probably our most-requested issue when it comes to KDUMP comes from the sizing of the crash kernel. Sometimes 128mb of RAM isn’t enough and you have to make adjustments for systems with larger memory footprints or with 3rd-party products and drivers:

kdump hangs with "Out of memory" error during vmcore collection

And to circle back to the useful axiom of always being prepared...once you set this up, TEST IT. Make sure that your settings are correct and there’s enough storage or that the firewall rules are open. Integrate kdump into your standard disaster recovery processes to ensure it’s there and working when you need it.

For those of you that are concerned about PII or sensitive information that could be stored within those cores you'll want to review the output with your local security/goverance team. Many companies that deal in sensitive data forbid the enabling of coredump collection. It goes hand-in-hand with testing, as you introduce any big change like this to your environment check with any stakeholders (like your storage or network groups and of course internal security). Better to overcommunicate and have someone angry over deleting one extra email than to be drug before an audit committee!

One option for folks that might not be able to share data like that outside of their networks is setting up an internal CAS (Core Analysis System) server inside their network. You can get some more details here:

https://fedorahosted.org/cas/

https://admin.fedoraproject.org/pkgdb/acls/name/cas

Once this is setup, all you'd need to port over to us would be the backtrace information (saving you a huge upload to the portal).

If you have to install CAS on a RHEL 5 server, you'll have to upgrade Python to 2.6, before you can install CAS (version 0.16). This makes things a little complicated, because RHEL 5 ships with Python 2.4. In this scenario, I compiled and install Python 2.6 from source (which puts RHEL 5 in an unsupported state).

Feel free to share any tips you might have about how to be prepared for problems and things you do to ensure you’re protected.

Responses