Kdump Generates Incomplete Vmcore Via NMI Panic Due to "Reboot System on NMI" Setting

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux
  • IBM/Lenovo Systems
  • XClarity Controller (XCC)
  • Integrated Management Module (IMM)

Issue

  • Can only manually produce a full vmcore via a SysRq panic, but not via a Non-Maskable Interrupt (NMI) panic.
  • No vmcore file or a vmcore-incomplete file is dumped when I issue an NMI panic to the system from the XClarity Controller.

Resolution

Disable the "Reboot System on NMI" or "Auto reboot on NMI" setting from the F1 system setup menu, or from XCC or IMM.

From the Home tab of XClarity Controller (UI)

NOTE: This may require a reboot in order to access the machine's system setup menu during boot. Please engage your hardware vendor if assistance is needed accessing the machine's F1 system setup menu.

  1. Go to Quick Actions > Power Action > Select Boot Server to System Setup.

  2. Select Ok when prompted from the "Do you want to boot server to system setup? If the OS fails to restart, try the Restart Server Immediately action." message.

  3. Select F1:System Setup and wait while the system boots to the setup menu.

  4. Select System Settings > Recovery and RAS > System Recovery.

  5. Arrow down to Reboot System on NMI > hit enter and select Disabled.

  6. Esc out to main menu > Select Save Settings.

  7. Once the settings have finished saving, select Exit Setup Utility > Select Y to completely exit the setup utility.

F1 System Setup Menu of Lenovo's XClarity Controller With Setting Disabled
System Recovery Menu With Setting Disabled

From the XClarity Controller (XCC) (Command Line)

From the XCC, the following is used to check the current state of Reboot System on NMI without requiring a reboot of the system;

  1. SSH to the XClarity Controller.

    $ ssh <USERNAME>@<IP_ADDRESS_OF_XCLARITY_CONTROLLER>
    
  2. Use the native asu (Advanced Settings Utility) command to query the setting.

    system> asu show SystemRecovery.RebootSystemOnNMI
    SystemRecovery.RebootSystemOnNMI=Enabled           <--------- current state of setting
    ok
    

It should be noted that Enable/Enabled is the default setting for the RebootSystemOnNMI parameter for these IBM/Lenovo machines.

Alternatively, from the XCC, the following is used to change the current state of Reboot System on NMI;

  1. Use the native asu command to change the setting.

    system> asu set SystemRecovery.RebootSystemOnNMI Disabled
    Executed one commands successfully.
    ok
    
  2. Confirm the parameter has been updated.

    system> asu show SystemRecovery.RebootSystemOnNMI
    SystemRecovery.RebootSystemOnNMI=Disabled
    ok
    
  3. This modified UEFI setting should take effect immediately with no reboot needed.

Once the above steps have been completed an NMI can be sent from the via the controller with the command below;

system> reset -nmi

NOTE: If Kdump and NMI panic tunables are configured in the RHEL OS, the above command will trigger a kernel panic followed by a vmcore dump.


From the Integrated Management Module (IMM)

On IBM/Lenovo systems that use the older Integrated Management Module (IMM), the process is similar, except for the "Enable/Disable" parameter syntax.

On IMM controllers, the parameter value needed is "Disable" (not "Disabled" as needed on XCC controllers).

Below is an example on disabling Reboot System on NMI on an IMM controller:

system> asu set SystemRecovery.RebootSystemOnNMI Disable
UEFI.SystemRecovery.RebootSystemonNMI=Disable
Waiting for command completion status.
Command completed successfully.     

The information contained herein is just an example of what the XClarity Controller & Integrated Management Module may look like and is provided for convenience only.

IMPORTANT: This article is not an authoritative source of information on IBM, Lenovo, Lenovo's XClarity Controller for configuring the "Reboot System on NMI", or Lenovo's Integrated Manangement Module, or "Auto reboot on NMI" setting to work with an NMI triggered panic. The XClarity Controller, IMM, and IBM/Lenovo products are not shipped by Red Hat.

Please contact your hardware vendor for assistance and product documentation.

Additional documentation can be found using the links below;
ThinkSystem server UEFI Parameter Reference Guide - SR645/SR665 - Chapter 2 - pg.63
UEFI Manual for ThinkSystem Server with Intel Xeon 6 Processors - Chapter 4 - pg.51
ThinkSystem Server with Intel Xeon SP (3rd Gen) - Chapter 3 - pg.51
ThinkSystem server with Intel Xeon SP (1st, 2nd Gen) - Chapter 3 - pg.39
Lenovo System x3550 M4 Installation and Service Guide - Chapter 3 - pg.100
Lenovo System x3550 M5 Installation and Service Guide - Chapter 2 - pg.38
System x3850 X6 and x3950 X6 Installation and Service Guide - Chapter 3 - pg.126
Lenovo Integrated Management Module User Guide

Root Cause

The "Reboot System on NMI" or "Auto reboot on NMI" setting from the F1 system setup menu determines whether or not the system will reboot the server after the NMI signal is issued. Having this setting enabled (enabled is the default state) causes an interruption of the kdump process when collecting and saving a vmcore file.

ThinkSystem server UEFI Parameter Reference Guide
Lenovo System Recovery Default Settings

Diagnostic Steps

  1. Confirm the system information.

    # dmidecode | grep 'System Info' -A2
    System Information
            Manufacturer: Lenovo
            Product Name: ThinkSystem SR850/SR860 V3 Main Board      <---------
    

    or

    # dmidecode | grep 'System Info' -A2
    System Information
            Manufacturer: Lenovo
            Product Name: ThinkSystem SR650 -[xxxxxxxxxx]-           <---------
    

    or

    # dmidecode | grep 'System Info' -A2
    System Information
        Manufacturer: IBM
        Product Name: IBM System x3550 M4 Server -[xxxxxxxxx]-           <---------
    
  2. Confirm the system is configured to panic on NMI.

    # sysctl -a | grep '_nmi'
    kernel.panic_on_io_nmi = 1
    kernel.panic_on_unrecovered_nmi = 1
    kernel.unknown_nmi_panic = 1
    
  3. Confirm the kdump service is enabled and active.

    # systemctl status kdump | head -3
    ● kdump.service - Crash recovery kernel arming
    Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
    Active: active (exited) since Thu 2024-06-06 13:58:15 EDT; 22s ago
    
  4. Locate the dump target location from /etc/kdump.conf.

    # grep -v ^# /etc/kdump.conf
    path /var/crash/
    core_collector makedumpfile -l --message-level 7 -d 31
    
  5. Test kdump:

    [5a] If NMI is issued to panic the system while the hardware "Reboot System on NMI" setting enabled, a vmcore will not be generated, or a vmcore-incomplete file is generated.

    # tree /var/crash/
    /var/crash/
    └── 127.0.0.1-2024-06-06-14:21:38
    └── vmcore-dmesg.txt               <--------- NO VMCORE File Present
    
    # grep nmi_panic /var/crash/127.0.0.1-2024-06-06-14\:21\:38/vmcore-dmesg.txt
    [ 246.146851] nmi_panic.cold.11+0xc/0xc
    

    or

     # tree /var/crash/
     /var/crash/
    └── 127.0.0.1-2024-06-06-14:32:04
    ├── kexec-dmesg.log
    ├── vmcore-incomplete              <--------- VMCORE-INCOMPLETE File Present                           
    └── vmcore-dmesg.txt
    
     # grep nmi_panic /var/crash/127.0.0.1-2024-06-06-14\:32\:04/vmcore-dmesg.txt
     [ 246.147256] nmi_panic.cold.11+0xc/0xc
    

    [5b] If SysRq is issued to panic the system while the hardware "Reboot System on NMI" setting enabled, a vmcore will be generated.

    # echo 1 > /proc/sys/kernel/sysrq
    # echo c > /proc/sysrq-trigger
    
    # tree /var/crash/
    /var/crash/
    └── 127.0.0.1-2024-06-06-14:00:04
    ├── kexec-dmesg.log
    ├── vmcore                         <--------- VMCORE File Present                           
    └── vmcore-dmesg.txt
    

    [5c] If NMI is issued to panic the system while the hardware "Reboot System on NMI" setting is disabled, a vmcore will be generated.

    # tree /var/crash/
    /var/crash/
    └── 127.0.0.1-2024-06-06-15:09:09
    ├── kexec-dmesg.log
    ├── vmcore                         <--------- VMCORE File Present  
    └── vmcore-dmesg.txt
    
    # grep nmi_panic /var/crash/127.0.0.1-2024-06-06-15\:09\:09/vmcore-dmesg.txt
    [ 247.156871] nmi_panic.cold.11+0xc/0xc
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments