How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux

Solution Verified - Updated Saturday at 11:39 AM -

Environment

Red Hat Enterprise Linux (RHEL)

Issue

How do I configure kexec/kdump on RHEL systems?
Root Cause Analysis (RCA) of kernel panic / server crash is required
How do I troubleshoot and investigate an unexpected reboot?
How do I generate a kernel memory core dump (vmcore) on my system?
Our system entered a hung state or became unresponsive, how can we troubleshoot?
How much time is required to capture a vmcore?
How much disk space is required to generate a vmcore?

Resolution

Review the kdump documentation of the relevant Red Hat Enterprise Linux (RHEL) version you are running, in order to properly configure the service to your requirements.
For your convenience, you can refer to the below documentation links:

RHEL 6 - The kdump Crash Recovery Service

RHEL 7 - Kernel Crash Dump Guide

RHEL 8 - Dumping a Crashed Kernel for later Analysis

RHEL 9 - Installing kdump

Note: When making a change to the main kdump configuration file (/etc/kdump.conf), restarting the service is required via the service kdump restart command.
- If you will be rebooting the system later, this command can be skipped.

To configure kdump more extensively, or in non-standard environments, please refer to the Extended KDUMP Configurations section.

Background / Overview
Prerequisites
Installing KDUMP
Extended KDUMP Configurations
Using the KDUMP Helper Tool
Sizing Local Dump Targets
KDUMP in Clustered Systems
Testing KDUMP
Vmcore Capture Time
Controlling which events trigger a Kernel Panic

Background / Overview

kexec is a fastboot mechanism that allows to boot a Linux kernel from the context of an already running kernel without going through the BIOS. Since BIOS checks at startup can be very time consuming (especially on big servers with numerous peripherals), kexec can save a lot of time for developers who need to reboot a machine often for testing purposes. Using kexec for rebooting into a normal kernel is simple, but not within the scope of this article. See the kexec(1) man page.

kdump is a reliable kernel crash-dumping mechanism that utilizes the kexec software. The crash dumps are captured from the context of a freshly booted kernel; not from the context of the crashed kernel. Kdump uses kexec to boot into a second kernel whenever the system crashes. This second kernel, often called a capture kernel, boots with very little memory and captures the dump image.

The first kernel reserves a section of memory that the second kernel uses to boot. Be aware that the memory reserved for the kdump kernel at boot time cannot be used by the standard kernel, which changes the actual minimum memory requirements of Red Hat Enterprise Linux. To compute the actual minimum memory requirements for a system, refer to Red Hat Enterprise Linux technology capabilities and limits for the listed minimum memory requirements and add the amount of memory used by kdump to determine the actual minimum memory requirements.

Using kdump allows booting the capture kernel without going through BIOS hence the contents of the first kernel's memory are preserved, which is essentially the kernel crash dump. At the moment of a kernel panic the secondary kernel will boot up, collect, compress and dump the first kernel’s memory based on the kdump configuration.

Note: A memory core dump (vmcore) is a copy of the data stored in a system's memory at the time of a kernel panic. Therefore, it may contain sensitive data, and should be treated as such.

Prerequisites

For dumping cores to a network target, access to a server over NFS or SSH is required.
Whether dumping locally or to a network target, a volume, device or directory with enough free disk space is needed to hold the core file. See the Sizing Local Dump Targets section for more information.

Installing KDUMP

Verify the kexec-tools package is installed:

# rpm -q kexec-tools

If it is not installed, proceed to install it via yum / dnf:

 # yum install kexec-tools

For IBM Power (ppc64) architecture up to RHEL 5.x, and for IBM System z (s390x) architecture up to RHEL 7.x, the capture kernel is provided in a separate package called kernel-kdump which must be installed for kdump to function:

 # yum install kernel-kdump

Note: This package is not necessary (and in fact does not exist) in newer versions of the aforementioned architectures, nor in other architectures.

Extended KDUMP Configurations

If your system or environment requires an extended or non-standard kdump configuration, please refer to the below links:

For RHEL and AWS EC2 Nitro, refer to Trigger a Kernel Panic on AWS EC2 Nitro Instances by NMI method
For RHEL and Nutanix AHV, refer to Trigger a Kernel Panic on Nutanix Instances by SYSRQ method to capture vmcore
For Hyper-V guests, refer to How to configure kdump for a Red Hat Enterprise Linux system running on Microsoft Hyper-V
- To troubleshoot system hangs on Hyper-V guests, please refer to How to panic a hung RHEL Guest on a Hyper V host using an NMI
For Azure guests, refer to How to configure kdump to capture a vmcore for Microsoft Azure virtual machines
- To send an NMI to Azure guests, refer to How to send NMI to an Azure VM
For encountering system hangs on VMware guests, refer to How to capture a vmcore of hung Red Hat Enterprise Linux VMware guest system using vmss2core tool?
- To send an NMI to VMware guests, refer to How to use VMWare ESX command line to force NMI panic on RHEL guest O/S
For Red Hat CoreOS 4 and Red Hat OpenShift Container Platform 4, refer to the below Troubleshooting operating system issues documentation for each minor version:
- 4.12 | 4.13 | 4.14 | 4.15
- 4.16 | 4.17 | 4.18
For encountering system hangs on Red Hat OpenStack Platform instances, refer to How to capture vmcore of a OpenStack instance ?
For KVM and RHEV, refer to How to capture vmcore dump from a KVM guest?
For RHEL 3 and RHEL 4, netdump must be used. Refer to How do I configure netdump on Red Hat Enterprise Linux 3 and 4?
For RHEL 5, refer to The kdump Crash Recovery Service documentation
For Xen guests, xendump must be used. Refer to How do I configure Xendump on Red Hat Enterprise Linux 5?
For RHEL 5 or RHEL 6.0, 6.1, 6.2 running s390 architectures, refer to How to capture memory dump of a z/VM guest?
- To trigger a panic on systems running on s390x architectures, refer to How do I trigger a panic on a s390x system?

Note: Though KVM and RHEV guests are not required to use the aforementioned method, it is an additional option for capturing a vmcore when the virtuel guest is unresponsive.

Using the KDUMP Helper tool

Red Hat provides the KDump Helper tool to help you set up kdump in RHEL 5 and later.
You can input a minimum amount of information and the tool will generate an all-in-one script for you to set up kdump with the basic configuration, or you can generate a script to set up kdump with extended configurations for a number of particular scenarios (like system hangs, processes stuck in D-state, CPU Soft Lockups, etc).
Running the generated script will figure out the correct crashkernel= parameter and add it to the currently active grub menu line.
You can refer to the KDump Helper Blog post for more information.

Sizing Local Dump Targets

The size of the vmcore file, and therefore the amount of disk space necessary to store it, will mainly depend on the following:

How much of the system’s RAM was in use at the moment of the kernel panic
What type of data is stored on the RAM.
The type of compression and the dump level stated in the “core_collector” parameter of the /etc/kdump.conf file

In more recent RHEL versions, and with the default compression level discarding pages not related to kernel memory, the average size of a vmcore is relatively small (when compared to total system RAM). You can refer to the latest user statistics in order to estimate the amount of free space to reserve for the dump target.

That being said, the only reliable way to guarantee that a full vmcore is generated is for the dump target to have free space at least equal in size to the physical RAM.

To determine the actual size of a vmcore, and to verify that the desired kdump configuration works, it is recommended to manually crash the system.
Note: Testing requires down time for the intended systems.

KDUMP in Clustered Systems

Cluster nodes can be fenced/rebooted before kdump has time to complete.
In clustered environments it is generally necessary to configure additional time for kdump to complete before fencing.

Please refer to the following for more information on clusters running the Red Hat High Availability, Resilient Storage Add-ons, RHEL Advanced Platform Cluster, or Red Hat Cluster Suite:
How do I configure kdump for use with the RHEL High Availability Add-On?

Testing KDUMP

After configuring kdump, please schedule down time for the relevant systems in order to manually test a system crash and to verify that a full vmcore is generated in the configured dump target.
Warning: These testing procedures will panic your kernel, killing all services on the machine.

We recommend you to first test the kdump configuration by issuing a Kernel panic via the SysRq-Trigger.

The SysRq-Facility is a special key combination that, when enabled, allows the user to force a system’s kernel to respond to a specific command. This feature is mostly for troubleshooting kernel-related problems, or to force a response from a system while it is in a non-responsive state (hang).
After confirming a full vmcore is generated from a SysRq panic, we recommend you to continue testing by issuing a Non-Maskable Interrupt (NMI). This can be triggered by pushing an NMI button.

An NMI is an interrupt that is unable to be ignored by standard operating system mechanisms. It is generally used only for critical hardware errors. This feature can be used to signal an operating system when other standard input mechanisms (keyboard, ssh, network, etc.) have ceased to function.
- Triggering a panic via the NMI button is a more trustworthy method of obtaining a vmcore when the system hangs than using the SysRq-Facility trigger, as in some cases the NMI is able to force the system to respond even when standard keyboard input will not be accepted.

The preferred testing procedure is described below:

Test the kdump configuration by using the SysRq-Facility to trigger a kernel panic. If kdump works correctly, the system is rebooted and a full vmcore is saved.
If a full vmcore is saved, configure the NMI-related sysctl parameters.
Reboot the system once to make sure the configuration is persistent.
For testing the NMI button, push the button to trigger a kernel panic. If the NMI button works correctly, the system is rebooted and a full vmcore is saved.

Configuring and manually crashing a system:

First, configure the SysRq-Facility to permit all triggers:

# sysctl -w  kernel.sysrq=1
   OR
# echo 1 > /proc/sys/kernel/sysrq

You can trigger the panic by issuing the # echo c > /proc/sysrq-trigger command.
You can also trigger a SysRq-Facility panic by pressing the <ALT>+<SYSRQ>+C console key combination.

For more information about the SysRq-Facility, please refer to What is the SysRq-Facility and how do I use it?

Confirm a full vmcore is generated, and move on to configure the NMI related parameters.
If only an incomplete vmcore was saved, please refer to the Sizing Local Dump Targets and Diagnostic Steps sections.

To configure the kdump to panic and generate a vmcore when the NMI button is pushed, enter the following commands:

 # vim /etc/sysctl.conf
…
    kernel.unknown_nmi_panic = 1
    kernel.panic_on_io_nmi = 1
    kernel.panic_on_unrecovered_nmi = 1

Afterwards, reboot the system once and make sure the NMI configuration persisted.
Then, generate an NMI from the respective platform and verify that a full vmcore has been generated in the dump path.

For more information on Non-Maskable Interrupts, please refer to An Introduction to Non-Maskable Interrupts (NMIs) and What is an NMI and what can I use it for? at our Knowledge Base.

Please note: The NMI functions depend on the system’s hardware or virtualization platform. If you are unsure how to perform this function, please contact the relevant platform or hardware vendor.

Time required to capture vmcore

Dumping time depends on the options that are used for its configuration. Refer to How to determine the time required for dumping a vmcore file with kdump?

Controlling which events trigger a Kernel Panic

There are several parameters that control under which circumstances kdump is activated. Most of these can be enabled via sysctl tunable parameters, you can refer to the most commonly used below.
When configuring a sysctl tunable via a sysctl.conf file, make sure to enforce the rule and make it persistent by issuing the sysctl -p <file path> command via sudo or the root user (if a file path is not specified, the default is /etc/sysctl.conf).

Note: While it is possible to enable multiple such tunables simultaneously, so as to make sure a vmcore is generated in as many scenarios as possible, please verify beforehand that each tunable is suitable for the expected workload and environment.

System hangs due to an NMI

Occurs when a Non-Maskable Interrupt is issued, usually due to a hardware fault.
- To configure the kernel to panic when an NMI occurs, add the following to your sysctl.conf file:
Raw
```
# vim /etc/sysctl.conf
…
kernel.unknown_nmi_panic = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_unrecovered_nmi = 1
```
- For more information on configuring the system to panic when an NMI is issued, please refer to How can I configure my system to crash when NMI switch is pushed?

Out of Memory (OOM) Kill event

Occurs when a memory request (Page Fault or kernel memory allocation) is made while not enough memory is available, thus the system terminates an active task (usually a non-prioritized process utilizing a lot of memory).
- To configure the kernel to panic when an OOM-Kill event occurs, add the following to your sysctl.conf file:
Raw
```
# vim /etc/sysctl.conf
…
  vm.panic_on_oom = 1
```
- For more information on configuring the system to panic at OOM-Kill, and other relevant tunables, refer to What are the sysctl tunables for the OOM Killer configuration, available for RHEL6 and later?

CPU Soft Lockup event

Occurs when a task is using the CPU for more than time the allowed threshold (the tunable kernel.watchdog_thresh, default is 20 seconds).
- To configure the kernel to panic when a CPU Soft Lockup occurs, add the following to your sysctl.conf file:
Raw
```
# vim /etc/sysctl.conf
…
  kernel.softlockup_panic = 1
```
- For more information on CPU Soft Lockups, refer to What is a CPU soft lockup? article.
- Note: This setting is not recommended for virtual machines, as they are more susceptible to Soft Lockups when the hypervisor is over-committed. For more information, refer to Virtual machine reports a "BUG: soft lockup" (or multiple at the same time).

Hung / Blocked Task event

Occurs when a process is stuck in Uninterruptible-Sleep (D-state) more time than the allowed threshold (the tunable kernel.hung_task_timeout_secs, default is 120 seconds).
- To configure the kernel to panic when a task becomes hung, add the following to your sysctl.conf file:
Raw
```
# vim /etc/sysctl.conf
…
  kernel.hung_task_panic = 1
```
- More information regarding the Hung Task Check mechanism and its relevant tunables, refer to How do I use hung task check in RHEL? Solution.

Diagnostic Steps

If you are encountering issues with configuring kdump, or with generating a full vmcore, please refer to the Common KDUMP troubleshooting article.

If these issues persist, or if you are encountering an unexpected behavior, please submit a new Technical Support case.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Ansible.com

Red Hat Ecosystem Catalog

Red Hat Hybrid Cloud Console

Red Hat Store

Red Hat Summit and AnsibleFest

How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux

Environment

Issue

Resolution

RHEL 6 - The kdump Crash Recovery Service

RHEL 7 - Kernel Crash Dump Guide

RHEL 8 - Dumping a Crashed Kernel for later Analysis

RHEL 9 - Installing kdump

Contents

Background / Overview

Prerequisites

Installing KDUMP

Extended KDUMP Configurations

Using the KDUMP Helper tool

Sizing Local Dump Targets

KDUMP in Clustered Systems

Testing KDUMP

Configuring and manually crashing a system:

Time required to capture vmcore

Controlling which events trigger a Kernel Panic

System hangs due to an NMI

Out of Memory (OOM) Kill event

CPU Soft Lockup event

Hung / Blocked Task event

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Contents

Background / Overview

Prerequisites

Installing KDUMP

Extended KDUMP Configurations

Using the KDUMP Helper tool

Sizing Local Dump Targets

KDUMP in Clustered Systems

Testing KDUMP

Configuring and manually crashing a system:

Time required to capture vmcore

Controlling which events trigger a Kernel Panic

System hangs due to an NMI

Out of Memory (OOM) Kill event

CPU Soft Lockup event

Hung / Blocked Task event

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links