Chapter 7. Keeping kernel panic parameters disabled in virtualized environments
When configuring a virtualized environment in RHEL 9, you should not enable the
nmi_watchdog kernel parameters, as the virtualized environment may trigger a spurious soft lockup that should not require a system panic.
The following sections explain the reasons behind this advice by summarizing:
- What causes a soft lockup.
- Describing the kernel parameters that control a system’s behavior on a soft lockup.
- Explaining how soft lockups may be triggered in a virtualized environment.
7.1. What is a soft lockup
A soft lockup is a situation usually caused by a bug, when a task is executing in kernel space on a CPU without rescheduling. The task also does not allow any other task to execute on that particular CPU. As a result, a warning is displayed to a user through the system console. This problem is also referred to as the soft lockup firing.
7.2. Parameters controlling kernel panic
The following kernel parameters can be set to control a system’s behavior when a soft lockup is detected.
Controls whether or not the kernel will panic when a soft lockup is detected.
Type Value Effect
kernel does not panic on soft lockup
kernel panics on soft lockup
By default, on RHEL8 this value is 0.
In order to panic, the system needs to detect a hard lockup first. The detection is controlled by the
Controls whether lockup detection mechanisms (
watchdogs) are active or not. This parameter is of integer type.
disables lockup detector
enables lockup detector
The hard lockup detector monitors each CPU for its ability to respond to interrupts.
Controls frequency of watchdog
hrtimer, NMI events, and soft/hard lockup thresholds.
Default threshold Soft lockup threshold
Setting this parameter to zero disables lockup detection altogether.
7.3. Spurious soft lockups in virtualized environments
The soft lockup firing on physical hosts, as described in What is a soft lockup, usually represents a kernel or hardware bug. The same phenomenon happening on guest operating systems in virtualized environments may represent a false warning.
Heavy work-load on a host or high contention over some specific resource such as memory, usually causes a spurious soft lockup firing. This is because the host may schedule out the guest CPU for a period longer than 20 seconds. Then when the guest CPU is again scheduled to run on the host, it experiences a time jump which triggers due timers. The timers include also watchdog
hrtimer, which can consequently report a soft lockup on the guest CPU.
Because a soft lockup in a virtualized environment may be spurious, you should not enable the kernel parameters that would cause a system panic when a soft lockup is reported on a guest CPU.
To understand soft lockups in guests, it is essential to know that the host schedules the guest as a task, and the guest then schedules its own tasks.