How using KDUMP can help you get to RCA faster [Moved]
As much as you want it never to happen someday, somewhere (probably at the WORST time) your computer or server is going to crash. It could be small or it could be epic. No matter the time or the cause it’ll be tragic, tears will be shed, teeth will be gnashed, wailing will be heard. How can you prepare yourself for that day, and get back online as quickly as possible and get to Root Cause so the error doesn’t occur again?
Of all the things Scouting has taught me that has stuck with me most of all is their motto: “Always be prepared.” With a few simple steps you can set yourself up to capture vital data about a problem, so the work you do today will pay off down the road in huge dividends by being able to understand exactly what happened the day of “the Great Server Crash of ‘<insert date>”.
Whenever there’s an issue, there are a few vital pieces of data that can help determine precisely what happened. First off is running an sosreport, which gathers critical logs and settings so we can see a bit more into the situation. The next, and arguably more important file, would be a core file captured by kdump. The core files grabs what’s going on in memory at the time of the crash, which provides us essential data to zero in on which process failed and what was going on.
First let me share these helpful articles with you:
How can I collect system information to provide to Red Hat Support for analysis when the system hangs?
https://access.redhat.com/knowledge/solutions/23069
How do I troubleshoot kernel crashes with kexec/kdump on Red Hat Enterprise Linux?
https://access.redhat.com/knowledge/solutions/6038
In it, the process for setting KDUMP up is laid out. It takes some time, so during your next change window you might want to consider adding it in to existing servers. Make plans to change your standard builds to incorporate it out of the gate.
KDUMP is great, but sometimes you need a bit more, so here are two more articles on how you can make small changes and capture even more data from scripts (like for application debugging) and how to use kdump over a bonded network interface:
How do I run commands using kdump, before vmcore is saved?
https://access.redhat.com/knowledge/solutions/2163
How to setup Kdump with bonding and VLAN?
https://access.redhat.com/knowledge/solutions/71313
And probably our most-requested issue when it comes to KDUMP comes from the sizing of the crash kernel. Sometimes 128mb of RAM isn’t enough and you have to make adjustments for systems with larger memory footprints or with 3rd-party products and drivers:
kdump hangs with "Out of memory" error during vmcore collection
https://access.redhat.com/knowledge/solutions/21187
And to circle back to the useful axiom of always being prepared...once you set this up, TEST IT. Make sure that your settings are correct and there’s enough storage or that the firewall rules are open. Integrate kdump into your standard disaster recovery processes to ensure it’s there and working when you need it.
Feel free to share any tips you might have about how to be prepared for problems and things you do to ensure you’re protected.
Responses
Following document explains how to provide vmcore to Red Hat support. This makes the analysis and response faster.
How can I provide large files (vmcore, rhev logcollector, large sosreports, heap dumps, big log files, etc) to Red Hat Support?
https://access.redhat.com/knowledge/solutions/2112
Thing to be wary of, before enabling coredump collection, is what your enterprise's security folks think of the practice. Many institutions that deal in sensitive data forbid the enabling of coredump collection. It'd suck to get called to the carpet because a security audit turned up that you violated policy by enabling coredump collection.
After a long drawn out discuss with the customer's legal department and our legal department, regarding PHI (Personal Health Information) and the possibility that information could be in the "vmcore" file. I suggested installing the CAS server at the customer's site, which I did just recently. I used an internal document to install it in my test lab, before doing it at the customer's site.
Because, my customer follows ITIL almost to the letter, and has to comply with HIPPA law, they needed things well documented before they can put anything into production, so I customized the internal document and gave it to my customer as a reference document. At some point my customer wants to install a CAS server at each data center and possibly at the hospital's site.
If you have to install CAS on a RHEL 5 server, you'll have to upgrade Python to 2.6, before you can install CAS (version 0.16). This makes things a little complicated, because RHEL 5 ships with Python 2.4. In this scenario, I compiled and install Python 2.6 from source (which puts RHEL 5 in an unsupported state). I haven't tried an older version of CAS which may only need Python 2.4, but I wanted to use the version listed in the internal document. I conviced my customer to stand up a RHEL 6 VM with about 400 GB of storage to hold the kernel debug symbols and the compressed "vmcore" file.
Sorry, for be long winded, but I hope this information is helpful. Don't hesitate to reach out to me if I can help out with anything.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
