Kdump Generates Incomplete Vmcore Via NMI Panic Due to "Reboot System on NMI" Setting
Environment
- Red Hat Enterprise Linux
- IBM/Lenovo Systems
- XClarity Controller (XCC)
- Integrated Management Module (IMM)
Issue
- Can only manually produce a full vmcore via a
SysRq
panic, but not via aNon-Maskable Interrupt (NMI)
panic. - No vmcore file or a
vmcore-incomplete
file is dumped when I issue anNMI
panic to the system from theXClarity Controller
.
Resolution
Disable the "Reboot System on NMI"
or "Auto reboot on NMI" setting from the F1
system setup menu, or from XCC
or IMM
.
From the Home tab of XClarity Controller (UI)
NOTE: This may require a reboot in order to access the machine's system setup menu during boot. Please engage your hardware vendor if assistance is needed accessing the machine's F1 system setup menu.
-
Go to Quick Actions > Power Action > Select
Boot Server to System Setup
. -
Select
Ok
when prompted from the "Do you want to boot server to system setup? If the OS fails to restart, try the Restart Server Immediately action." message. -
Select
F1:System Setup
and wait while the system boots to the setup menu. -
Select
System Settings
>Recovery and RAS
>System Recovery
. -
Arrow down to
Reboot System on NMI
> hit enter and selectDisabled
. -
Esc out to main menu > Select
Save Settings
. -
Once the settings have finished saving, select
Exit Setup Utility
> SelectY
to completely exit the setup utility.
From the XClarity Controller (XCC) (Command Line)
From the XCC, the following is used to check the current state of Reboot System on NMI
without requiring a reboot of the system;
-
SSH
to theXClarity Controller
.$ ssh <USERNAME>@<IP_ADDRESS_OF_XCLARITY_CONTROLLER>
-
Use the native
asu
(Advanced Settings Utility) command to query the setting.system> asu show SystemRecovery.RebootSystemOnNMI SystemRecovery.RebootSystemOnNMI=Enabled <--------- current state of setting ok
It should be noted that Enable/Enabled
is the default setting for the RebootSystemOnNMI
parameter for these IBM/Lenovo
machines.
Alternatively, from the XCC
, the following is used to change the current state of Reboot System on NMI
;
-
Use the native
asu
command to change the setting.system> asu set SystemRecovery.RebootSystemOnNMI Disabled Executed one commands successfully. ok
-
Confirm the parameter has been updated.
system> asu show SystemRecovery.RebootSystemOnNMI SystemRecovery.RebootSystemOnNMI=Disabled ok
-
This modified
UEFI
setting should take effect immediately with no reboot needed.
Once the above steps have been completed an NMI can be sent from the via the controller with the command below;
system> reset -nmi
NOTE: If Kdump
and NMI
panic tunables are configured in the RHEL OS
, the above command will trigger a kernel panic followed by a vmcore dump.
From the Integrated Management Module (IMM)
On IBM/Lenovo
systems that use the older Integrated Management Module (IMM)
, the process is similar, except for the "Enable/Disable"
parameter syntax.
On IMM
controllers, the parameter value needed is "Disable"
(not "Disabled" as needed on XCC
controllers).
Below is an example on disabling Reboot System on NMI
on an IMM
controller:
system> asu set SystemRecovery.RebootSystemOnNMI Disable
UEFI.SystemRecovery.RebootSystemonNMI=Disable
Waiting for command completion status.
Command completed successfully.
The information contained herein is just an example of what the XClarity Controller
& Integrated Management Module
may look like and is provided for convenience only.
IMPORTANT: This article is not an authoritative source of information on IBM, Lenovo, Lenovo's XClarity Controller for configuring the "Reboot System on NMI", or Lenovo's Integrated Manangement Module, or "Auto reboot on NMI" setting to work with an NMI triggered panic. The XClarity Controller, IMM, and IBM/Lenovo products are not shipped by Red Hat.
Please contact your hardware vendor for assistance and product documentation.
Additional documentation can be found using the links below;
ThinkSystem server UEFI Parameter Reference Guide - SR645/SR665 - Chapter 2 - pg.63
UEFI Manual for ThinkSystem Server with Intel Xeon 6 Processors - Chapter 4 - pg.51
ThinkSystem Server with Intel Xeon SP (3rd Gen) - Chapter 3 - pg.51
ThinkSystem server with Intel Xeon SP (1st, 2nd Gen) - Chapter 3 - pg.39
Lenovo System x3550 M4 Installation and Service Guide - Chapter 3 - pg.100
Lenovo System x3550 M5 Installation and Service Guide - Chapter 2 - pg.38
System x3850 X6 and x3950 X6 Installation and Service Guide - Chapter 3 - pg.126
Lenovo Integrated Management Module User Guide
Root Cause
The "Reboot System on NMI"
or "Auto reboot on NMI" setting from the F1
system setup menu determines whether or not the system will reboot the server after the NMI signal is issued. Having this setting enabled (enabled is the default state) causes an interruption of the kdump process when collecting and saving a vmcore file.
Diagnostic Steps
-
Confirm the system information.
# dmidecode | grep 'System Info' -A2 System Information Manufacturer: Lenovo Product Name: ThinkSystem SR850/SR860 V3 Main Board <---------
or
# dmidecode | grep 'System Info' -A2 System Information Manufacturer: Lenovo Product Name: ThinkSystem SR650 -[xxxxxxxxxx]- <---------
or
# dmidecode | grep 'System Info' -A2 System Information Manufacturer: IBM Product Name: IBM System x3550 M4 Server -[xxxxxxxxx]- <---------
-
Confirm the system is configured to panic on
NMI
.# sysctl -a | grep '_nmi' kernel.panic_on_io_nmi = 1 kernel.panic_on_unrecovered_nmi = 1 kernel.unknown_nmi_panic = 1
-
Confirm the
kdump
service is enabled and active.# systemctl status kdump | head -3 ● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) since Thu 2024-06-06 13:58:15 EDT; 22s ago
-
Locate the dump target location from
/etc/kdump.conf
.# grep -v ^# /etc/kdump.conf path /var/crash/ core_collector makedumpfile -l --message-level 7 -d 31
-
Test
kdump
:[5a] If NMI is issued to panic the system while the hardware
"Reboot System on NMI"
setting enabled, a vmcore will not be generated, or a vmcore-incomplete file is generated.# tree /var/crash/ /var/crash/ └── 127.0.0.1-2024-06-06-14:21:38 └── vmcore-dmesg.txt <--------- NO VMCORE File Present # grep nmi_panic /var/crash/127.0.0.1-2024-06-06-14\:21\:38/vmcore-dmesg.txt [ 246.146851] nmi_panic.cold.11+0xc/0xc
or
# tree /var/crash/ /var/crash/ └── 127.0.0.1-2024-06-06-14:32:04 ├── kexec-dmesg.log ├── vmcore-incomplete <--------- VMCORE-INCOMPLETE File Present └── vmcore-dmesg.txt # grep nmi_panic /var/crash/127.0.0.1-2024-06-06-14\:32\:04/vmcore-dmesg.txt [ 246.147256] nmi_panic.cold.11+0xc/0xc
[5b] If SysRq is issued to panic the system while the hardware
"Reboot System on NMI"
setting enabled, a vmcore will be generated.# echo 1 > /proc/sys/kernel/sysrq # echo c > /proc/sysrq-trigger # tree /var/crash/ /var/crash/ └── 127.0.0.1-2024-06-06-14:00:04 ├── kexec-dmesg.log ├── vmcore <--------- VMCORE File Present └── vmcore-dmesg.txt
[5c] If NMI is issued to panic the system while the hardware
"Reboot System on NMI"
setting is disabled, a vmcore will be generated.# tree /var/crash/ /var/crash/ └── 127.0.0.1-2024-06-06-15:09:09 ├── kexec-dmesg.log ├── vmcore <--------- VMCORE File Present └── vmcore-dmesg.txt # grep nmi_panic /var/crash/127.0.0.1-2024-06-06-15\:09\:09/vmcore-dmesg.txt [ 247.156871] nmi_panic.cold.11+0xc/0xc
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments