How using KDUMP can help you get to RCA faster [Moved]

Latest response

As much as you want it never to happen someday, somewhere (probably at the WORST time) your computer or server is going to crash. It could be small or it could be epic. No matter the time or the cause it’ll be tragic, tears will be shed, teeth will be gnashed, wailing will be heard. How can you prepare yourself for that day, and get back online as quickly as possible and get to Root Cause so the error doesn’t occur again?

 

Of all the things Scouting has taught me that has stuck with me most of all is their motto: “Always be prepared.” With a few simple steps you can set yourself up to capture vital data about a problem, so the work you do today will pay off down the road in huge dividends by being able to understand exactly what happened the day of “the Great Server Crash of ‘<insert date>”.

 

Whenever there’s an issue, there are a few vital pieces of data that can help determine precisely what happened. First off is running an sosreport, which gathers critical logs and settings so we can see a bit more into the situation. The next, and arguably more important file, would be a core file captured by kdump. The core files grabs what’s going on in memory at the time of the crash, which provides us essential data to zero in on which process failed and what was going on.

 

First let me share these helpful articles with you:

How can I collect system information to provide to Red Hat Support for analysis when the system hangs? 

https://access.redhat.com/knowledge/solutions/23069

 

How do I troubleshoot kernel crashes with kexec/kdump on Red Hat Enterprise Linux?

https://access.redhat.com/knowledge/solutions/6038

 

In it, the process for setting KDUMP up is laid out. It takes some time, so during your next change window you might want to consider adding it in to existing servers. Make plans to change your standard builds to incorporate it out of the gate.

 

KDUMP is great, but sometimes you need a bit more, so here are two more articles on how you can make small changes and capture even more data from scripts (like for application debugging) and how to use kdump over a bonded network interface:

 

How do I run commands using kdump, before vmcore is saved?

https://access.redhat.com/knowledge/solutions/2163

 

How to setup Kdump with bonding and VLAN?

https://access.redhat.com/knowledge/solutions/71313

 

And probably our most-requested issue when it comes to KDUMP comes from the sizing of the crash kernel. Sometimes 128mb of RAM isn’t enough and you have to make adjustments for systems with larger memory footprints or with 3rd-party products and drivers:

 

kdump hangs with "Out of memory" error during vmcore collection

https://access.redhat.com/knowledge/solutions/21187

 

And to circle back to the useful axiom of always being prepared...once you set this up, TEST IT. Make sure that your settings are correct and there’s enough storage or that the firewall rules are open. Integrate kdump into your standard disaster recovery processes to ensure it’s there and working when you need it.

 

Feel free to share any tips you might have about how to be prepared for problems and things you do to ensure you’re protected.

Responses