System panics due to an NMI hard lockup in RHEL 7
Environment
- Red Hat Enterprise Linux (RHEL) 7
- seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64)
- the
/proc/sys/kernel/watchdog_thresh
parameter is set to a higher value than the default - Docker
The issue that is described in detail under Root Cause is specific to the RHEL7 kernel; it does not exist in the RHEL6 kernel or earlier versions of the Red Hat Enterprise Linux kernel.
Issue
- The following panic was seen in a Docker environment.
[83076.933317] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 4
[83077.019826] CPU: 4 PID: 3487 Comm: docker Tainted: GF IO-------------- 3.10.0-210.el7.x86_64 #1
[83077.134427] Hardware name: HP ProLiant DL160 G6 , BIOS O33 08/16/2010
[83077.212602] ffffffff8182aef0 000000008d81c4ac ffff88033fc06c48 ffffffff816038f7
[83077.301624] ffff88033fc06cc8 ffffffff815fd14b 0000000000000010 ffff88033fc06cd8
[83077.390629] ffff88033fc06c78 000000008d81c4ac ffff88033fc06c88 0000000000000004
[83077.479636] Call Trace:
[83077.508892] <NMI> [<ffffffff816038f7>] dump_stack+0x19/0x1b
[83077.577827] [<ffffffff815fd14b>] panic+0xd8/0x1e7
[83077.635184] [<ffffffff8110a7b0>] ? watchdog_enable_all_cpus.part.2+0x40/0x40
[83077.720642] [<ffffffff8110a872>] watchdog_overflow_callback+0xc2/0xd0
[83077.798820] [<ffffffff8114c971>] __perf_event_overflow+0xa1/0x250
[83077.872836] [<ffffffff8114d474>] perf_event_overflow+0x14/0x20
[83077.943724] [<ffffffff8103022d>] intel_pmu_handle_irq+0x1fd/0x410
[83078.017734] [<ffffffff811906a1>] ? unmap_kernel_range_noflush+0x11/0x20
[83078.097992] [<ffffffff81372134>] ? ghes_copy_tofrom_phys+0x124/0x210
[83078.175125] [<ffffffff8160cb4b>] perf_event_nmi_handler+0x2b/0x50
[83078.249134] [<ffffffff8160c299>] nmi_handle.isra.0+0x69/0xb0
[83078.317942] [<ffffffff8160c3b0>] do_nmi+0xd0/0x340
[83078.376337] [<ffffffff8160b6f1>] end_repeat_nmi+0x1e/0x2e
- Kernel Panic under I/O
time dd if=/dev/zero of=/xx/yy/zzz bs=AA count=BBBBBBB
Above command is able to crash the system.
Resolution
Red Hat Enterprise Linux 7.1: Fixed in kernel-3.10.0-229.11.1.el7 and later.
Red Hat Enterprise Linux 7.2: Fixed in kernel-3.10.0-327.el7 and later.
To work around the issue on previous kernels, avoid setting the /proc/sys/kernel/watchdog_thresh
parameter to a higher value than the default if feasible.
Root Cause
The kernel functions that are involved in the scenario leading to the hard lockup panic are:
watchdog_timer_fn
This is a high-resolution timer function that updates a CPU-specific interrupt counter periodically while executing in interrupt context.
watchdog_overflow_callback
This function is invoked from Non-Maskable Interrupt (NMI) context. If a CPU is busy, this function executes periodically and it checks whether watchdog_timer_fn
has incremented the CPU-specific counter during the past interval. If the counter has not been incremented, watchdog_overflow_callback
assumes that the CPU is 'locked up' in a section of kernel code where interrupts are disabled, and a panic occurs unless 'panic on hard lockup' is explicitly disabled via the nmi_watchdog=nopanic
parameter on the kernel command line.
For this watchdog mechanism it is essential that watchdog_timer_fn
is invoked at a significantly higher rate than watchdog_overflow_callback
, so that watchdog_timer_fn
gets a chance to increment the CPU-specific interrupt counter at least once during watchdog_overflow_callback
's interval for example, as shown in the following diagram:
watchdog_timer_fn executes on CPU N approx. at t1, t2, t3, etc.
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|-> time
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
|---------|---------|---------|---------|---------|---------|-> time
o1 o2 o3 o4 o5 o6 o7
watchdog_overflow_callback executes on CPU N approx. at o1, o2, o3, etc.
The interval for watchdog_timer_fn
and watchdog_overflow_callback
is derived from the /proc/sys/kernel/watchdog_thresh
parameter. The value of this parameter is 10 by default. Due to a flaw in the watchdog code, only the interval for watchdog_timer_fn
is adjusted if the parameter value is changed (for example, via an entry in sysctl.conf
). If the new value is higher than the default, it is possible that watchdog_overflow_callback
will execute at a higher rate than watchdog_timer_fn
on a busy CPU, and thus watchdog_overflow_callback
will not observe an increase of the CPU-specific interrupt counter, as shown in the following diagram.
watchdog_timer_fn executes on CPU N approx. at t1, t2, t3, etc.
|-----------------------|-----------------------|->
t1 t2 t3
no counter increase
|---------|---------|---------|---------|---------|---------|-> time
o1 o2 o3 o4 o5 o6 o7
watchdog_overflow_callback executes on CPU N approx. at o1, o2, o3, etc.
Hence, it is possible that the hard lockup detector yields a false positive: A panic will occur at o2 even though the CPU is actually not 'locked up'.
Diagnostic Steps
The following vmcore analysis shows an example of a false positive that occurred on a machine with 2.40 GHz CPU clock speed, and /proc/sys/kernel/watchdog_thresh
set to the value 60.
- Bit 9 (IF) is set in the
RFLAGS
register which means that interrupts are enabled. The content of theCS
(Code Segment) andSS
(Stack Segment) register indicates that the NMI even occurred while the processor was executing 'user space' code.
crash> bt
PID: 3487 TASK: ffff8802337916c0 CPU: 4 COMMAND: "docker"
#0 [ffff88033fc06b20] machine_kexec at ffffffff8104c4fb
#1 [ffff88033fc06b80] crash_kexec at ffffffff810e1fb2
#2 [ffff88033fc06c50] panic at ffffffff815fd152
#3 [ffff88033fc06cd0] watchdog_overflow_callback at ffffffff8110a872
#4 [ffff88033fc06ce0] __perf_event_overflow at ffffffff8114c971
#5 [ffff88033fc06d58] perf_event_overflow at ffffffff8114d474
#6 [ffff88033fc06d68] intel_pmu_handle_irq at ffffffff8103022d
#7 [ffff88033fc06e60] perf_event_nmi_handler at ffffffff8160cb4b
#8 [ffff88033fc06e80] nmi_handle at ffffffff8160c299
#9 [ffff88033fc06ec8] do_nmi at ffffffff8160c3b0
#10 [ffff88033fc06ef0] end_repeat_nmi at ffffffff8160b6f1
RIP: 000000000058a551 RSP: 000000c20856c5d0 RFLAGS: 00000202
RAX: 000000c20816c670 RBX: 0000000000000019 RCX: 0000000000000000
RDX: 000000c20816c670 RSI: 000000c208045538 RDI: 0000000000000000
RBP: 0000000000000003 R8: 0000000000000017 R9: 0000000000000000
R10: 0000000000000006 R11: 000000c20823b3cb R12: 0000000000000000
R13: 0000000000000186 R14: 0000000000000000 R15: 000000c208356da0
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
--- <NMI exception stack> ---
- The interval for
watchdog_overflow_callback
is 24000210000 CPU clock cycles (10 seconds).
crash> pd cpu_khz
cpu_khz = $1 = 2400021
crash> px watchdog_ev | grep "\[4\]"
[4]: ffff88033fc0eba0
crash> rd -x 0xffff88033fc0eba0
ffff88033fc0eba0: ffff8803395d0000
crash> pd ((struct perf_event *)0xffff8803395d0000)->hw.sample_period
$2 = 24000210000
/proc/sys/kernel/watchdog_thresh
is set to the value 60, so the resulting interval forwatchdog_timer_fn
is 24000000000 nanoseconds (24 seconds).
crash> pd watchdog_thresh
watchdog_thresh = $3 = 60
crash> pd sample_period
sample_period = $4 = 24000000000
Consequently, watchdog_overflow_callback
executes at a higher rate than watchdog_timer_fn
.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments