How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5 [Below steps are not applicable for s390 or z/VM RHEL 5 instances]
  • Red Hat Enterprise Linux 6 [RHEL 6 update 3 is required for s390 or z/VM RHEL instances]
  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 8

Issue

  • How do I configure kexec/kdump on RHEL?
  • How much disk space is required for kdump to dump the vmcore?
  • Need RCA of kernel crash/panic
  • How do I troubleshoot a server crash/reboot?
  • Want root cause of a system reboot
  • How do I capture a vmcore on my server?
  • My server hung or became unresponsive, how to troubleshoot?
  • Problem collecting a core file with kdump on a host
  • How much time is required to capture vmcore?
  • System freezes unexpectedly, how to troubleshoot?

Resolution

For RHEL 3 and RHEL 4, netdump must be used. Refer to How do I configure netdump on Red Hat Enterprise Linux 3 and 4?
For Xen guests, xendump must be used. Refer to How do I configure Xendump on Red Hat Enterprise Linux 5?
For Hyper-V guests, refer to How to configure kdump for a Red Hat Enterprise Linux system running on Microsoft Hyper-V
For KVM and RHEV, refer to How to capture vmcore dump from a KVM guest?
For RHEL 5 or RHEL 6.0, 6.1, 6.2 s390 instances, refer How to capture memory dump of a z/VM guest?
For Red Hat CoreOS 4.6 or earlier and Red Hat OpenShift Container Platform 4.6 or earlier, refer How to configure kdump in Red Hat CoreOS ?
For Red Hat CoreOS 4.7 and Red Hat Openshift Container Platform 4.7, refer How to enable kdump in Red Hat Openshift Container Platform 4.7?

Note: KVM and RHEV guests are not required to use the above method, though it is an additional option for capturing a vmcore when the VM is unresponsive.

Contents

  1. Background / Overview
  2. Prerequisites
  3. Installing kdump
  4. Adding Boot Parameters
  5. Specifying Kdump Location
  6. Dumping Directly to a Device
  7. Dumping to a file on Disk
  8. Dumping to a Network Device using NFS
  9. Dumping to a Network Device using SSH
  10. Dumping to a SAN Device (For RHEL5)
  11. Dumping to a SAN Device ( For RHEL6 with blacklist of multipath)
  12. Dumping to a SAN Device ( For RHEL6 with multipath device)
  13. Sizing Local Dump Targets
  14. Specifying Page Selection and Compression
  15. Clustered Systems
  16. Testing
  17. Vmcore Capture Time
  18. Controlling when kdump is activated
  19. Comments

Background / Overview

kexec is a fastboot mechanism that allows booting a Linux kernel from the context of an already running kernel without going through the BIOS. Since BIOS checks at startup can be very time consuming (especially on big servers with numerous peripherals), kexec can save a lot of time for developers who need to reboot a machine often for testing purposes. Using kexec for rebooting into a normal kernel is simple, but not within the scope of this article. See the kexec(1) man page.

kdump is a reliable kernel crash-dumping mechanism that utilizes the kexec software. The crash dumps are captured from the context of a freshly booted kernel; not from the context of the crashed kernel. Kdump uses kexec to boot into a second kernel whenever the system crashes. This second kernel, often called a capture kernel, boots with very little memory and captures the dump image.

The first kernel reserves a section of memory that the second kernel uses to boot. Be aware that the memory reserved for the kdump kernel at boot time cannot be used by the standard kernel, which changes the actual minimum memory requirements of Red Hat Enterprise Linux. To compute the actual minimum memory requirements for a system, refer to Red Hat Enterprise Linux technology capabilities and limits for the listed minimum memory requirements and add the amount of memory used by kdump to determine the actual minimum memory requirements.

Using kdump allows booting the capture kernel without going through BIOS hence the contents of the first kernel's memory are preserved, which is essentially the kernel crash dump.

The following instructions must be followed in order to start capturing kernel cores with kdump.


Prerequisites

  • For dumping cores to a network target, access to a server over NFS or ssh is required.
  • Whether dumping locally or to a network target, a device or directory with enough free disk space is needed to hold the core. See the "Sizing Local Dump Targets" section below for more information on how much space will be needed.
  • For configuring kdump on a system running a Xen kernel, it is required to have a regular kernel of the same version as the running Xen kernel installed on the system. (If the system is 32-bit with more than 4GB of RAM, kernel-pae should be installed alongside kernel-xen instead of kernel.) Note: The kernel need only be installed. You can continue running the Xen kernel, and no reboot is required.


Installing kdump

Verify the kexec-tools package is installed:

# rpm -q kexec-tools

If it is not installed, proceed to install it via yum:

# yum install kexec-tools

On IBM Power (ppc64) and IBM System z (s390x), the capture kernel is provided in a separate package called kernel-kdump which must be installed for kdump to function:

# yum install kernel-kdump

This package is not necessary (and in fact does not exist) on other architectures.


Adding Boot Parameters


Red Hat provides a KDump Helper tool to help you set up kdump within RHEL 5/6 kernels. Input a minimum amount of information and this tool will generate an all-in-one script for you to set up kdump with a very basic configuration, or you can generate a script to set up kdump with extended configurations for a number of particular scenarios like system hang, Process D state, or soft lockup. Running the generated script will figure out the correct crashkernel= parameter and add it to the currently active grub menu line. Read the KDump Helper Blog and leave feedback at KDump Helper App Info. The KDump Helper helps automate the following steps.


The option crashkernel must be added to the kernel command line parameters in order to reserve memory for the kdump kernel:

  • For i386 and x86_64 architectures on RHEL 5, edit /boot/grub/grub.conf, and append crashkernel=128M@16M to the end of the kernel line. {using @16M on Rhel6 has caused kdump to fail}
  • For RHEL 6 i386 and x86_64 systems, use crashkernel=128M

It may be possible to use less than 128M, but testing with only 64M has proven unreliable.

For more information regarding the crashkernel parameter, please refer to the following Red Hat Enterprise Linux (RHEL) version specific kbases:

The following is an example of /boot/grub/grub.conf with the kdump options added for RHEL 5:

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You do not have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /, eg.
#          root (hd0,0)
#          kernel /boot/vmlinuz-version ro root=/dev/hda1
#          initrd /boot/initrd-version.img
#boot=/dev/hda
default=0
timeout=5
splashimage=(hd0,0)/boot/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Client (2.6.17-1.2519.4.21.el5)
        root (hd0,0)
        kernel /boot/vmlinuz-2.6.17-1.2519.4.21.el5 ro root=LABEL=/ rhgb quiet crashkernel=128M@16M
        initrd /boot/initrd-2.6.17-1.2519.4.21.el5.img

Or for RHEL 6:

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/mapper/vg_example-lv_root
#          initrd /initrd-[generic-]version.img
# boot=/dev/vda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Server (2.6.32-71.7.1.el6.x86\_64)
       root (hd0,0)
       kernel /vmlinuz-2.6.32-71.7.1.el6.x86_64 ro root=/dev/mapper/vg_example-lv_root rd_LVM_LV=vg_example/lv_root rd_LVM_LV=vg_example/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=128M rhgb quiet
       initrd /initramfs-2.6.32-71.7.1.el6.x86_64.img

If you are using a Xen kernel on RHEL 5, you will need to add the crashkernel parameter at the end of the kernel commandline, not the module command line even though the module command line references the vmlinuz Linux kernel.

For RHEL 5 when using a Xen kernel:

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You do not have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /, eg.
#          root (hd0,0)
#          kernel /boot/vmlinuz-version ro root=/dev/hda1
#          initrd /boot/initrd-version.img
# boot=/dev/hda
default=0
timeout=5
splashimage=(hd0,0)/boot/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Server (2.6.18-194.17.1.el5xen)
       root (hd0,0)
       kernel /xen.gz-2.6.18-194.17.1.el5 crashkernel=128M@16M
       module /vmlinuz-2.6.18-194.17.1.el5xen ro root=/dev/myvg/rootvol
       module /initrd-2.6.18-194.17.1.el5xen.img

After adding the crashkernel parameter the system must be rebooted for the crashkernel memory to be reserved for use by kdump. This reboot can be performed now or after the below steps to configure kdump have been completed.

Please note: If the kdump service is not configured to start on boot, the crashkernel= will be not be set aside. In order to fully configure and bring the service online, the crashkernel= must be in place and chkconfig kdump on command be executed prior to a reboot.


Specifying the Kdump Location

The location of the kdump vmcore can be specified in /etc/kdump.conf. You can either dump directly to a device, to a file, or to some location on the network via NFS or SSH. For RHEL6, if a target location is not specified in the configuration, default values will be used resulting in cores being saved to /var/crash on the root file system. For information about supported dump targets and supported networking configurations, see the following knowledge-base articles:


Dumping Directly to a Device

Kdump can be configured to dump directly to a device by using the raw directive in /etc/kdump.conf. The syntax to be used is:

raw *<devicename>*

For example:

raw /dev/sda1

This will overwrite any data that was previously on the device.


Dumping to a file on Disk

kdump can be configured to mount a partition and dump to a file on disk. This is done by specifying the filesystem type followed by the device /etc/kdump.conf. The device may be specified as a device node, a filesystem label, or filesystem UUID in the same manner as /etc/fstab. For example:

    ext3 /dev/sda1

will mount `/dev/sda1` as an ext3 device and dump the core to `/var/crash/` directory (creating it if necessary), whereas:

    ext3 LABEL=/boot

will mount the device that is ext3 with the label `/boot` and use that to dump the core.

The label may need to be set manually for storage devices that have been configured after Red Hat Enterprise Linux has been installed. For example, the following will set the label 'crash' on the storage device '/dev/sdd1':

    e2label /dev/sdd1 crash

To view the label for a storage device, run 'e2label' with the device as the only argument:

    e2label /dev/sdd1

An easy way to find how to specify the device is to look at what you're currently using in /etc/fstab (the filesystem you're dumping to does not need to be persistently mounted via fstab). The default directory in which the core will be dumped is <filesystem>/var/crash/*<date>*/ where *<date>* is the current date at the time of the crash dump. This can be changed by using the path directive in /etc/kdump.conf. For example:

    ext3 UUID=f15759be-89d4-46c4-9e1d-1b67e5b5da82 
    path /usr/local/cores

will dump the vmcore to <filesystem>/usr/local/cores/ instead of the default <filesystem>/var/crash/ location.


Dumping to a Network Device using NFS

To configure kdump to dump to an NFS mount, edit /etc/kdump.conf and add a line with the following format:

When using a version of kexec-tools prior to 2.0.0-232:

net *<nfs server>:</nfs/mount>*

When using version 2.0.0-232 or later:

nfs *<nfs server>:</nfs/mount>*

For example:

net nfs.example.com:/export/vmcores

This will dump the vmcore to /export/vmcores/*<hostname>*-*<date>*/ on the server nfs.example.com. The client system must have access to write to this mount point.

Please note NFSv4 is supported from RHEL-6.3 onwards

When dumping to a network location over a bonded interface, it may be necessary to define the bonding module options in the kdump.conf file. See kdump doesn't accept module options from ifcfg-* files for more information.


Dumping to a Network Device using SSH

SSH has the advantage of encrypting network communications while dumping. For this reason this is the best solution when you're required to dump a vmcore across a publicly accessible network such as the Internet or a corporate WAN:
When using a version of kexec-tools prior to 2.0.0-232:

net *<user>@<ssh server>*

When using version 2.0.0-232 or later:

ssh *<user>@<ssh server>*

For example:

net kdump@crash.example.com

In this case, kdump will use scp to connect to the crash.example.com server using the kdump user. It will copy the vmcore to the /var/crash/*<hostname>*-*<date>*``*/* directory. The kdump user will need the necessary write permissions on the remote server. If you want a different path than /var/crash specify it using the path option in kdump.conf. For example:

path /home/kdump/vmcores
ssh kdump@dumpserver.example.com

Additionally, when first configuring kdump to use SSH, it will attempt to use the mktemp binary on the target system to ensure write permissions in the target path. If your kdump target server is running an operating system without the mktemp binary, you will need to use a different method to save a vmcore to that target.
For ssh you will also need to add the -F option to the makedumpfile command in kdump.conf. This will produce a vmcore.flat file instead of the usual vmcore. For example:

core_collector makedumpfile -c --message-level 1 -d 31 -F

To make these changes take effect, run one of the following commands:

In RHEL 6 and earlier:

# service kdump propagate
Generating new ssh keys... done,
kdump@crash.example.com's password:
/root/.ssh/kdump_id_rsa.pub has been added to
~kdump/.ssh/authorized_keys2 on crash.example.com

In RHEL 7 and later (using systemd):

# kdumpctl propagate
Using existing keys...
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@crashtarget's password: 
Number of key(s) added: 1
Now try logging into the machine, with:   "ssh 'root@crashtarget"
and check to make sure that only the key(s) you wanted were added.
/root/.ssh/kdump_id_rsa has been added to ~root/.ssh/authorized_keys on crashtarget

Make sure the free diskspace of the partition or network location which you specified for storing the vmcore is at least larger than the whole physical memory on this system.

The default sshkey value is /root/.ssh/kdump_id_rsa, add this default value to /etc/ssh/ssh_config:

IdentityFile ~/.ssh/kdump_id_rsa


Dumping to a SAN Device (For RHEL5)
  1. Get the wwid for the SAN paths:

    # /sbin/scsi_id -g -u -s /block/sd<x>
    
  2. Blacklist this LUN from multipath by editing /etc/multipath.conf:

    blacklist {
      wwid "3600601f0d057000019fc7845f46fe011"  
    }
    
  3. Reload multipath configuration:

    # /etc/init.d/multipathd reload
    
  4. Now let's get a partition created on the lun, make sure to select the correct one:

    # fdisk -l  
    # /sbin/scsi_id -g -u -s /block/sd<x>
    # fdisk /dev/sd<x>
    
  5. Create a Linux partition on the disk:

    # partprobe /dev/sd<x>
    
  6. Validate the partition is there:

    # fdisk -l 
    
  7. Put an ext3/ext4/xfs filesystem on it:

    # mkfs.ext3 /dev/sd<x>1
    
  8. Now, let's get a udev rule in place:

    # cat 99-crashlun.rules
    KERNEL=="sd*", BUS=="scsi", ENV{ID_SERIAL}=="3600601f0d057000019fc7845f46fe011", SYMLINK+="crashsan%n"
    
  9. Trigger udev in a way as to not affect everything else:

    # echo change > /sys/block/sd<x>/sd<x>1/uevent
    
  10. Validate that the udev rule worked, looking for /dev/crashsan1:

    # ls /dev/
    
  11. Now update /etc/fstab adding the following to the end of the file:

    /dev/crashsan1         /var/crash       ext3    defaults    0 0
    
  12. Validate that the file system will mount automatically:

    # mount -a 
    # mount
    
  13. Edit /etc/kdump.conf accordingly:

    # ext3 /dev/crashsan1
     -OR-
    # ext3 UUID=c992f458-9dbd-4017-bb73-0eba87633035
    
  14. Restart kdump:

    # service kdump restart
    
  15. Make sure sysrq is enabled and test the crash. WARNING! This will crash the system, so do it at a planned time if this a production system.

    # echo 'c' > /proc/sysrq-trigger
    
  16. Once the system boots back, check to confirm that it worked.

    # tree /var/crash/
    /var/crash/
    |-- 2012-08-03-13:57
    |   `-- vmcore
    `-- lost+found
    
  17. This was validated on RHEL 5:

    # cat /etc/redhat-release 
    Red Hat Enterprise Linux Server release 5.8 (Tikanga)
    
    # uname -a
    Linux somecoolserver.redhat.com 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
    


Dumping to a SAN Device ( For RHEL6 with blacklist of multipath)

Note: This is a workaround method therefore it depends on each environment.
Please just refer to the following method.
This method is not supported by Red Hat.

  1. Get the wwid for the SAN paths:

    #/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/sd<X>
    
  2. Blacklist this lun from multipath by editing /etc/multipath.conf:

    blacklist {
      wwid "3600601f0d057000019fc7845f46fe011"  
    }
    
  3. Reload the multipath configuration:

    # /etc/init.d/multipathd reload  
    
  4. Now let's get a partition created on our LUN. Be sure to select the right one:

    # fdisk -l  
    # /lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/sd<X>
    # fdisk /dev/sd<x>
    
  5. Create a Linux partition on the disk:

    # partprobe /dev/sd<x>
    
  6. Validate the partition is there:

    # fdisk -l 
    
  7. Put an ext3/ext4/xfs file system on it:

    # mkfs.ext3 /dev/sd<x>1
    
  8. Comment any unnecessary wwid entries in the following two files using the "#" character:

    • Switch into the multipath configuration directory:

      # cd /etc/multipath
      
    • Edit the wwids file and comment out the unnecessary wwid entries (the following is an example):

      # vi wwids
      {output truncated}
      # /3600144f08c3d8b000000511256f00001/
      
    • Edit the bindings file and do the same (the following is an example):

      # vi bindings
      {output truncated}
      # mpathc 3600144f08c3d8b000000511256f00001
      
  9. Add the multipath configuration to the initial ramdisk (initramfs):

    # dracut --force --add multipath --include /etc/multipath /etc/multipath
    
  10. Now update /etc/fstab adding the following to the end of the file using the UUID:

    • Check the uuid with blkid:

      # blkid
      
    • Ex: /etc/fstab:

      UUID=4262c8fc-23ad-42b2-9c5d-af9c64d5bb92    /var/crash    ext3    defaults        0 0
      
  11. Validate that the filesystem will mount automatically:

    # mount -a 
    # mount
    
  12. Edit /etc/kdump.conf accordingly:

    ext3 UUID=4262c8fc-23ad-42b2-9c5d-af9c64d5bb92
    
  13. Restart kdump and chkconfig it on:

    # service kdump restart
    # chkconfig kdump on
    
  14. Make sure sysrq is enabled and test the crash. WARNING! This will crash the system, so do it at a planned time if this a production system.

    # echo 'c' > /proc/sysrq-trigger
    
  15. Once system boots back, let's check to confirm that it worked:

    # tree /var/crash/
    /var/crash/
    ├── 127.0.0.1-2013-02-12-21:11:03
    │   └── vmcore
    └── lost+found
    

Note: Checking environments is below.

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.3 (Santiago)

# uname -a
Linux xxxxx 2.6.32-279.22.1.el6.x86_64 #1 SMP Sun Jan 13 09:21:40 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa | grep kexec
kexec-tools-2.0.0-245.el6.x86_64

# rpm -qa | grep multipath
device-mapper-multipath-0.4.9-56.el6_3.1.x86_64
device-mapper-multipath-libs-0.4.9-56.el6_3.1.x86_64
Dumping to a SAN Device ( For RHEL6 with multipath device)

Note: This method is supported by Red Hat. Please read below sentences.

This configuration is only vaildate from kexec-tools-2.0.0-245.el6.x86_64 version,if user uses old kexec-tools package,user can not use multipath device for kdump.
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.3 (Santiago)

# uname -a
Linux xxxxx 2.6.32-279.22.1.el6.x86_64 #1 SMP Sun Jan 13 09:21:40 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa | grep kexec
kexec-tools-2.0.0-245.el6.x86_64

# rpm -qa | grep multipath
device-mapper-multipath-0.4.9-56.el6_3.1.x86_64
device-mapper-multipath-libs-0.4.9-56.el6_3.1.x86_64

Checking multipath status

# multipath -ll
mpathf (3600144f08c3d8b000000511a51b10002) dm-7 
size=100G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 12:0:0:1 sdk 8:160 active ready running
  |- 13:0:0:1 sdm 8:192 active ready running
  |- 14:0:0:1 sdo 8:224 active ready running
  |- 15:0:0:1 sdq 65:0  active ready running
  `- 16:0:0:1 sds 65:32 active ready running

Now let's get a partition created on our lun, make sure you have the right one

# fdisk -l  
# fdisk /dev/mapper/mpath<x>

Create linux partition on the disk

# partprobe /dev/mapper/mpath<x>
# multipath -r

Validate the partition is there

# fdisk -l 

Put an ext3 fs on it (probably could do ext4)

# mkfs.ext3  /dev/mapper/mpath<x>p1

Now update /etc/fstab adding the following to the end of the file
Using UUID.

Check uuid with blkid command.

# blkid
# vi /etc/fstab
  Ex:
        UUID=b2d74f2e-2dbf-4714-9787-ba1c147c4386           /var/crash            ext4     defaults,_netdev 0 0    <---for iscsi multipath
        UUID=b2d74f2e-2dbf-4714-9787-ba1c147c4386           /var/crash            ext4     defaults               0 0    <---for SAN Multipath

Validate that the partition will mount automatically

# mount -a 
# mount

Now edit /etc/kdump.conf accordingly

ext3 UUID=b2d74f2e-2dbf-4714-9787-ba1c147c4386

Restart kdump and chkconfig on.

# service kdump restart

# chkconfig kdump on

Make sure sysrq is enabled and test the crash. This will crash the system, so do it at the right time if this a production system.

# echo 'c' > /proc/sysrq-trigger

Once system boots back check to confirm that it worked

# tree /var/crash/
/var/crash/
├── 127.0.0.1-2013-02-12-21:11:03
│   └── vmcore
└── lost+found


Sizing Local Dump Targets

The size of the core file, and therefore the amount of disk space necessary to store it, will depend on how much RAM is in use and what type of data is stored there. The only sure way to guarantee a successful dump is to have free space on disk at least equal to physical RAM. However using the core_collector options (see the "Specifying Page Selection and Compression" section below) you can compress the core dump and remove specific types of pages from it. This should save you a large amount of space, but again it depends on how the system is being used. The compression ratio achieved using the "-c" option is entirely dependent on the content stored in RAM; some will compress better than others.

The best thing to do to determine the space requirements is to test the dump under typical system usage by using the "c" SysRq to crash the system and generate a sample core. Dumping to a dedicated dump server via NFS or SSH using the "net" option in kdump.conf (see the "Dumping to a Network Device" sections above) can help eliminate the need for reserved local storage and reduce overall dump storage requirements. Centralized network dump servers reduce overall storage needs through economies of scale, specifically the improbability that all the systems sharing the central dump server will need to use the storage during overlapping periods.


Specifying Page Selection and Compression

On large memory systems, it is advisable to both discard pages that are not needed and to compress remaining pages. This is done in kdump.conf with the core_collector command. At this point in them the only fully supported core collector is makedumpfile. The options can be viewed with makedumpfile --help. The -d option specifies which types of pages should be left out. The option is a bit mask, having each page type specified like so:

zero pages   = 1
cache pages   = 2
cache private = 4
user  pages   = 8
free  pages   = 16

In general, these pages may not contain relevant information. To set all these flags and leave out these pages, use a value of -d 31. However, if there are no size/space/time constraints, use a value of -d 1 to strip zero pages only. The -c tells makedumpfile to compress the remaining data pages.

# throw out zero pages (containing no data)
# core_collector makedumpfile -d 1 
# compress all pages, but leave them all
# core_collector makedumpfile -c            
# throw out zero pages and compress 
# core_collector makedumpfile -d 1 -c
# throw out all trival pages and compress (recommended)
core_collector makedumpfile -d 31 -c      

Note: When making a change to kdump.conf, a service kdump restart is required. If you will be rebooting the system later, this step can be skipped


Clustered Systems

Cluster nodes can be fenced/rebooted before kdump has time to complete. In clustered environments it is generally necessary to configure additional time for kdump to complete before fencing. Please refer to the following for more information for clusters running the Red Hat High Availability or Resilient Storage Add-ons, RHEL Advanced Platform Cluster, or Red Hat Cluster Suite. How do I configure kdump for use with the RHEL High Availability Add-On?


Testing

After making the above changes, reboot the system. The 128M of memory (starting 16M into the memory) is left untouched by the normal system, reserved for the capture kernel. Take note that the output of free -m shows 128M less memory than without this parameter, which is expected.

Now that the reserved memory region is set up, turn on the kdump init script and start the service:

#  chkconfig kdump on
#  service kdump start

This will create a /boot/initrd-kdump.img, leaving the system ready to capture a vmcore upon crashing. To test this, force-crash the system by triggering a crash with sysrq:

Warning: This will panic your kernel, killing all services on the machine

# echo c > /proc/sysrq-trigger

(For more information about sysrq, refer to What is the SysRq facility and how do I use it?.)

This causes the kernel to panic, followed by the system restarting into the kdump kernel. When the boot process gets to the point where it starts the kdump service, the vmcore should be copied to the location specified in the /etc/kdump.conf file.

Note that in case of system hang, a vmcore may not be generated automatically. A system crash can be manually activated to produce a vmcore for analysis. You can force a crash by entering the command echo c > /proc/sysrq-trigger or by pressing the console key combination <ALT>+<SYSRQ>+C.


Time required to capture vmcore

Dumping time depends on the options that are used for its configuration. Refer to How to determine the time required for dumping a vmcore file with kdump?

Controlling when kdump is activated

There are several parameters that control under which circumstances kdump is activated. kdump can be activated when

  • a system hang is detected through the Non-Maskable Interrupt (NMI) Watchdog mechanism.
    This mechanism is enabled through the nmi_watchdog=1 kernel parameter. Refer to What is NMI and what can I use it for? for details.
  • a hardware NMI button is pressed.
    This mechanism is enabled by setting the sysctl kernel.unknown_nmi_panic=1 .
  • an "unrecovered" NMI has occurred.
    This mechanism is enabled by setting the sysctl kernel.panic_on_unrecovered_nmi=1 . The following kernel warning messages are associated with "unrecovered" NMIs:
Uhhuh. NMI received for unknown reason *hexnumber* on CPU *CPUnumber*.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
  • the out-of-memory killer (oom-killer) would otherwise be triggered.
    This can be configured by setting the sysctl vm.panic_on_oom=1 .

Under many circumstances it is advisable to enable multiple tunables from the above list. As an example, in the event of hang events, it is adviseable to enable kernel.unknown_nmi_panic, kernel.softlockup_panic, and also nmi_watchdog=1. This will increase the likelihood that a vmcore will result from an event that an administrator may not be directly monitoring at the time. Please note, however, Red Hat Enterprise Linux systems residing on z/VM hosts do not have the facility to panic when an NMI is sent to them. Please refer to this solution for more information.

Reducing the size of the vmcore when uploading to Red Hat Support

In most cases a vmcore analysis only requires the critical kernel pages within a vmcore so the rest of the pages can be filtered out to reduce the size of the file for faster uploading to Red Hat support. So if the vmcore file is very large and has not already had all non-critical pages filtered out (ie kdump did not use makedumpfile -d 31) then use this command to filter out the pages and upload the output file for analysis.

# makedumpfile -c -d 31 <vmcore> <output file>

Keep the original vmcore file saved in case the analysis requires some of the filtered pages and in that case the full vmcore may need to be uploaded.


Comments

Console frame-buffers and X are not properly supported. On a system typically run with something like "vga=791" in the kernel config line or with X running, console video will be garbled when a kernel is booted via kexec. The kdump kernel should still be able to capture a dump, and when the system reboots, video should be restored to normal.

debug_mem_level is a new parameter from RHEL6.3, it turns on debug/verbose output of kdump scripts regarding free/used memory at various points of execution. Higher level means more debugging output.

If unable to obtain a kernel dump but the machine can be rebooted, consider checking the system's RAM.

Diagnostic Steps

  • If you are dumping to local storage and utilize the hpsa storage module that you may run into difficulty capturing a core. In that event, please ensure you are on the latest kexec-tools package.
  • To output a list of configured dump locations, run the following egrep command:

    egrep "path|raw|nfs|ssh|ext4|ext3|ext2|minix|btrfs|xfs|auto" /etc/kdump.conf
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

15 Comments

XEN-kernel:

title Red Hat Enterprise Linux Server (2.6.18-194.el5xen)
        root (hd0,0)
        kernel /xen.gz-2.6.18-194.el5 dom0_mem=2048M crashkernel=128M@16M
        module /vmlinuz-2.6.18-194.el5xen ro root=/dev/VolGroupXen03/LogVolXen0301 rhgb quiet
        module /initrd-2.6.18-194.el5xen.img

Keep in mind that using the -d and -c options will marginally increase the ammount of time required to gather a cores.

This is frequently not the case.  Storing a page on disk or sending it over the network usually takes significantly longer than determining that the page should not be saved.  I am not sure about compression.

Using -d 31 may significantly reduce the amount of time required to gather a core.

To  do selective dumps, you need the corresponding kernel debuginfo.

From /usr/share/doc/kexec-tools-1.102pre/kexec-kdump-howto.txt

A typical setup is 'core_collector makedumpfile -c', but check the output of

'/sbin/makedumpfile --help' for a list of all available options (-i and -g

don't need to be specified, they're automatically taken care of). Note that

use of makedumpfile requires that the kernel-debuginfo package corresponding

with your running kernel be installed.

During dumping vmcore to a Network Device using NFS, make it sure following :

1. "vmcores" sub-directory has proper permission on NFS sever.

2.  It has correct export options. I suggest to use

rw,sync,no\_all\_squash\, If there is permission issue to start kdump, please allow whole subnet\, As an example it'll look like :

cat /etc/exports
/export/vmcores 192\,168\,1\,0/24\(rw,sync,no\_all\_squash\)

3. vmcore file will be saved inside  /export/vmcores/var/crash directory.

does it also require ulimit -c to be set to reasonably large like "unlimited" for kdump to work?

No, this is not required for kdump.  The ulimit -c value is specific to application cores and does not affect kdump capturing vmcores.

Kdump on a xen server seems not to wok with makedumpfile standard args ... It has been demonstarted that the -E option added and the removal of any other flags is necessary for kdump to suceed. This is not documented and was found by our TAM in a non-published doc - Steve Vik

Yes, this would nice to patch in this howto. To start around this issue I recommend reading of "makedumpfile --help".

This is a very helpful doc, thanks!

Add multipath device and multipath device with blacklist in RHEL6.

Once server is rebooted after crash , do we require to reboot again for original kernel or it would be running on crash kernel on production environment.....

I would say it will reboot to the kernal it booted from previously.
Did you install new Kernel and did a reboot on the new one when it crashed ?
If so, you can intercept the boot process and switch to the old one if it is a recurring crash.

cat /etc/fstab | grep -i crash
cat /etc/fstab
/dev/rootvg/varcrashlv /var/crash ext3 defaults 1 2

df -h /var/crash
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-varcrashlv
8.1G 138M 7.5G 2% /var/crash
which name do i use in /etc/kdump.conf i have this at present ext3 /dev/mapper/rootvg-varcrashlv

For a System with 64 GB Memory, do I need to go beyond crashkernel=128@16M ? Is it possible to do a crash analysis on a Fedora machine, or do I have to use RHEL to get the debug kernel RPMS installed?

This article doesn't really cover doing your own crash analysis, but to analyze a RHEL vmcore on Fedora you would need to extract the necessary file(s) from the matching RHEL kernel-debuginfo RPM and tell the crash program where to find it.

Pages