How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux
Environment
- Red Hat Enterprise Linux (RHEL)
Issue
- How do I configure kexec/kdump on RHEL systems?
- Root Cause Analysis (RCA) of kernel panic / server crash is required
- How do I troubleshoot and investigate an unexpected reboot?
- How do I generate a kernel memory core dump (vmcore) on my system?
- Our system entered a hung state or became unresponsive, how can we troubleshoot?
- How much time is required to capture a vmcore?
- How much disk space is required to generate a vmcore?
Resolution
Review the kdump documentation of the relevant Red Hat Enterprise Linux (RHEL) version you are running, in order to properly configure the service to your requirements.
For your convenience, you can refer to the below documentation links:
RHEL 6 - The kdump Crash Recovery Service
RHEL 7 - Kernel Crash Dump Guide
RHEL 8 - Dumping a Crashed Kernel for later Analysis
RHEL 9 - Chapter 11. Installing kdump
- Note: When making a change to the main kdump configuration file (
/etc/kdump.conf
), restarting the service is required via theservice kdump restart
command.- If you will be rebooting the system later, this command can be skipped.
To configure kdump more extensively, or in non-standard environments, please refer to the Extended KDUMP Configurations section.
Contents
- Background / Overview
- Prerequisites
- Installing KDUMP
- Extended KDUMP Configurations
- Using the KDUMP Helper Tool
- Sizing Local Dump Targets
- KDUMP in Clustered Systems
- Testing KDUMP
- Vmcore Capture Time
- Controlling which events trigger a Kernel Panic
Background / Overview
kexec
is a fastboot mechanism that allows to boot a Linux kernel from the context of an already running kernel without going through the BIOS. Since BIOS checks at startup can be very time consuming (especially on big servers with numerous peripherals), kexec
can save a lot of time for developers who need to reboot a machine often for testing purposes. Using kexec
for rebooting into a normal kernel is simple, but not within the scope of this article. See the kexec(1) man page.
kdump
is a reliable kernel crash-dumping mechanism that utilizes the kexec software. The crash dumps are captured from the context of a freshly booted kernel; not from the context of the crashed kernel. Kdump uses kexec to boot into a second kernel whenever the system crashes. This second kernel, often called a capture kernel, boots with very little memory and captures the dump image.
The first kernel reserves a section of memory that the second kernel uses to boot. Be aware that the memory reserved for the kdump kernel at boot time cannot be used by the standard kernel, which changes the actual minimum memory requirements of Red Hat Enterprise Linux. To compute the actual minimum memory requirements for a system, refer to Red Hat Enterprise Linux technology capabilities and limits for the listed minimum memory requirements and add the amount of memory used by kdump to determine the actual minimum memory requirements.
Using kdump allows booting the capture kernel without going through BIOS hence the contents of the first kernel's memory are preserved, which is essentially the kernel crash dump. At the moment of a kernel panic the secondary kernel will boot up, collect, compress and dump the first kernel’s memory based on the kdump configuration.
Note: A memory core dump (vmcore) is a copy of the data stored in a system's memory at the time of a kernel panic. Therefore, it may contain sensitive data, and should be treated as such.
Prerequisites
- For dumping cores to a network target, access to a server over NFS or SSH is required.
- Whether dumping locally or to a network target, a volume, device or directory with enough free disk space is needed to hold the core file. See the Sizing Local Dump Targets section for more information.
Installing KDUMP
Verify the kexec-tools
package is installed:
# rpm -q kexec-tools
If it is not installed, proceed to install it via yum:
# yum install kexec-tools
For IBM Power (ppc64
) architecture up to RHEL 5.x, and for IBM System z (s390x
) architecture up to RHEL 7.x, the capture kernel is provided in a separate package called kernel-kdump
which must be installed for kdump to function:
# yum install kernel-kdump
Note: This package is not necessary (and in fact does not exist) in newer versions of the aforementioned architectures, nor in other architectures.
Extended KDUMP Configurations
If your system or environment requires more extended or non-standard kdump configuration, please refer to the below links:
- For RHEL 3 and RHEL 4,
netdump
must be used. Refer to How do I configure netdump on Red Hat Enterprise Linux 3 and 4? - For Xen guests,
xendump
must be used. Refer to How do I configure Xendump on Red Hat Enterprise Linux 5? - For Hyper-V guests, refer to How to configure kdump for a Red Hat Enterprise Linux system running on Microsoft Hyper-V
- To troubleshoot system hangs on Hyper-V guests, please refer to How to panic a hung RHEL Guest on a Hyper V host using an NMI
- For Azure guests, refer to How to configure kdump to capture a vmcore for Microsoft Azure virtual machines
- To send an NMI to Azure guests, refer to How to send NMI to an Azure VM
- For KVM and RHEV, refer to How to capture vmcore dump from a KVM guest?
- For encountering system hangs on VMware guests, refer to How to capture a vmcore of hung Red Hat Enterprise Linux VMware guest system using vmss2core tool?
- For RHEL 5 or RHEL 6.0, 6.1, 6.2 running s390 architectures, refer to How to capture memory dump of a z/VM guest?
- To trigger a panic on systems running on s390x architectures, refer to How do I trigger a panic on a s390x system?
- For Red Hat CoreOS 4 and Red Hat Openshift Container Platform 4, refer to Setting up kdump in Red Hat Openshift Container Platform and Red Hat CoreOS
- For encountering system hangs on Red Hat OpenStack Platform instances, refer to How to capture vmcore of a OpenStack instance ?
- For RHEL and AWS EC2 Nitro, refer to Trigger a Kernel Panic on AWS EC2 Nitro Instances by NMI method
- For RHEL and Nutanix AHV, refer to Trigger a Kernel Panic on Nutanix Instances by SYSRQ method to capture vmcore
Note: Though KVM and RHEV guests are not required to use the aforementioned method, it is an additional option for capturing a vmcore when the VM is unresponsive.
Using the KDUMP Helper tool
Red Hat provides the KDump Helper tool to help you set up kdump in RHEL 5 and later.
You can input a minimum amount of information and the tool will generate an all-in-one script for you to set up kdump with a very basic configuration, or you can generate a script to set up kdump with extended configurations for a number of particular scenarios (like system hang, Process D state, or soft lockups).
Running the generated script will figure out the correct crashkernel=
parameter and add it to the currently active grub menu line.
You can refer to the the KDump Helper Blog post for more information, and leave any feedback at the KDump Helper App Info.
Sizing Local Dump Targets
The size of the vmcore file, and therefore the amount of disk space necessary to store it, will mainly depend on the following:
- How much of the system’s RAM was in use at the moment of the kernel panic
- What type of data is stored on the RAM.
- The type of compression and the dump level stated in the “core_collector” parameter of the
/etc/kdump.conf
file
In more recent RHEL versions, and with the default compression level discarding pages not related to kernel memory, the average size of a vmcore is relatively small (when compared to total system RAM). You can refer to the latest user statistics in order to estimate the amount of free space to reserve for the dump target.
That being said, the only reliable way to guarantee that a full vmcore is generated is for the dump target to have free space at least equal in size to the physical RAM.
To determine the actual size of a vmcore, and to verify that the desired kdump configuration works, it is recommended to manually crash the system.
Note: Testing requires down time for the intended systems.
KDUMP in Clustered Systems
Cluster nodes can be fenced/rebooted before kdump has time to complete. In clustered environments it is generally necessary to configure additional time for kdump to complete before fencing.
Please refer to the following for more information on clusters running the Red Hat High Availability, Resilient Storage Add-ons, RHEL Advanced Platform Cluster, or Red Hat Cluster Suite:
How do I configure kdump for use with the RHEL High Availability Add-On?
Testing KDUMP
After configuring kdump, please schedule down time for the relevant systems in order to manually test a system crash and to verify that a full vmcore is generated in the configured dump target.
Warning: These testing procedures will panic your kernel, killing all services on the machine.
-
We recommend you to first test the kdump configuration by issuing a Kernel panic via the SysRq-Trigger.
The SysRq-Facility is a special key combination that, when enabled, allows the user to force a system’s kernel to respond to a specific command. This feature is mostly for troubleshooting kernel-related problems, or to force a response from a system while it is in a non-responsive state (hang).
-
After confirming a full vmcore is generated from a SysRq panic, we recommend you to continue testing by issuing a Non-Maskable Interrupt (NMI). This can be triggered by pushing an NMI button.
An NMI is an interrupt that is unable to be ignored by standard operating system mechanisms. It is generally used only for critical hardware errors. This feature can be used to signal an operating system when other standard input mechanisms (keyboard, ssh, network, etc.) have ceased to function.
- Triggering a panic via the NMI button is a more trustworthy method of obtaining a vmcore when the system hangs than using the SysRq-Facility trigger, as in some cases the NMI is able to force the system to respond even when standard keyboard input will not be accepted.
The preferred testing procedure is described below:
- Test the kdump configuration by using the SysRq-Facility to trigger a kernel panic. If kdump works correctly, the system is rebooted and a full vmcore is saved.
- If a full vmcore is saved, configure the NMI related sysctl parameters.
- Reboot the system once to make sure the configuration is persistent.
- For testing the NMI button, push the button to trigger a kernel panic. If the NMI button works correctly, the system is rebooted and a full vmcore is saved.
Configuring and manually crashing a system:
First, configure the SysRq-Facility to permit all triggers:
# sysctl -w kernel.sysrq=1
OR
# echo 1 > /proc/sys/kernel/sysrq
You can trigger the panic by issuing the # echo c > /proc/sysrq-trigger
command.
You can also trigger a SysRq-Facility panic by pressing the <ALT>+<SYSRQ>+C
console key combination.
- For more information about the SysRq-Facility, please refer to What is the SysRq-Facility and how do I use it?
Confirm a full vmcore is generated, and move on to configure the NMI related parameters.
If only an incomplete vmcore was saved, please refer to the Sizing Local Dump Targets and the Diagnostic Steps sections.
To configure the kdump to panic and generate a vmcore when the NMI button is pushed, enter the following commands:
# vim /etc/sysctl.conf
…
kernel.unknown_nmi_panic = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_unrecovered_nmi = 1
Afterwards, reboot the system once and make sure the NMI configuration persisted.
Then, generate an NMI from the respective platform and verify that a full vmcore has been generated in the dump path.
- For more information on Non-Maskable Interrupts, please refer to An Introduction to Non-Maskable Interrupts (NMIs) and What is an NMI and what can I use it for? at our Knowledge Base.
Please note: The NMI functions depend on the system’s hardware or virtualization platform. If you are unsure how to perform this function, please contact the relevant platform or hardware vendor.
Time required to capture vmcore
Dumping time depends on the options that are used for its configuration. Refer to How to determine the time required for dumping a vmcore file with kdump?
Controlling which events trigger a Kernel Panic
There are several parameters that control under which circumstances kdump is activated. Most of these can be enabled via sysctl
tunable parameters, you can refer to the most commonly used below.
When configuring a sysctl tunable via a sysctl.conf file, make sure to enforce the rule and make it persistent by issuing the sysctl -p <file path>
command via sudo or the root user (if a file path is not specified, the default is /etc/sysctl.conf).
Note: While it is recommended to enable as many of these tunables, so as to make sure a vmcore is generated in as many scenarios as possible, please verify beforehand that each tunable is suitable for the expected workload and environment.
System hangs due to NMI
-
Occurs when a Non-Maskable Interrupt is issued, usually due to a hardware fault.
- To configure the kernel to panic when an NMI occurs, add the following to your sysctl.conf file:
# vim /etc/sysctl.conf … kernel.unknown_nmi_panic = 1 kernel.panic_on_io_nmi = 1 kernel.panic_on_unrecovered_nmi = 1
- For more information on configuring the system to panic when an NMI is issued, please refer to How can I configure my system to crash when NMI switch is pushed?
Out of memory (OOM) Kill event
-
Occurs when a memory request (Page Fault or kernel memory allocation) is made while not enough memory is available, thus the system terminates an active task (usually a non-prioritized process utilizing a lot of memory).
- To configure the kernel to panic when an OOM-Kill event occurs, add the following to your sysctl.conf file:
# vim /etc/sysctl.conf … vm.panic_on_oom = 1
- For more information on configuring the system to panic at OOM-Kill, and other relevant tunables, refer to What are the sysctl tunables for the OOM Killer configuration, available for RHEL6?
CPU Soft Lockup event
-
Occurs when a task is using the CPU for more than time the allowed threshold (the tunable kernel.watchdog_thresh, default is 20 seconds).
- To configure the kernel to panic when a CPU Soft Lockup occurs, add the following to your sysctl.conf file:
# vim /etc/sysctl.conf … kernel.softlockup_panic = 1
- For more information on CPU Soft Lockups, refer to What is a CPU soft lockup? article.
- Note: This setting is not recommended for virtual machines, as they are more susceptible to Soft Lockups when the hypervisor is over-committed. For more information, refer to Virtual machine reports a "BUG: soft lockup" (or multiple at the same time).
Hung / Blocked Task event
-
Occurs when a process is stuck in Uninterruptible-Sleep (D-state) for more time than the allowed threshold (the tunable kernel.hung_task_timeout_secs, default is 120 seconds).
- To configure the kernel to panic when a task becomes hung, add the following to your sysctl.conf file:
# vim /etc/sysctl.conf … kernel.hung_task_panic = 1
- More information regarding Hung Task Check and relevant tunables, refer to How do I use hung task check in RHEL? Solution.
Diagnostic Steps
If you are encountering issues with configuring kdump, or with generating a full vmcore, please refer to the common KDUMP troubleshooting article.
If these issues persist, or if you are encountering an unexpected behavior, please submit a new Technical Support case.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments