System panics due to an NMI hard lockup in RHEL 7

Solution In Progress - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 7
  • seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64)
  • the /proc/sys/kernel/watchdog_thresh parameter is set to a higher value than the default
  • Docker

The issue that is described in detail under Root Cause is specific to the RHEL7 kernel; it does not exist in the RHEL6 kernel or earlier versions of the Red Hat Enterprise Linux kernel.

Issue

  • The following panic was seen in a Docker environment.
[83076.933317] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 4
[83077.019826] CPU: 4 PID: 3487 Comm: docker Tainted: GF         IO--------------   3.10.0-210.el7.x86_64 #1
[83077.134427] Hardware name: HP ProLiant DL160 G6  , BIOS O33 08/16/2010
[83077.212602]  ffffffff8182aef0 000000008d81c4ac ffff88033fc06c48 ffffffff816038f7
[83077.301624]  ffff88033fc06cc8 ffffffff815fd14b 0000000000000010 ffff88033fc06cd8
[83077.390629]  ffff88033fc06c78 000000008d81c4ac ffff88033fc06c88 0000000000000004
[83077.479636] Call Trace:
[83077.508892]  <NMI>  [<ffffffff816038f7>] dump_stack+0x19/0x1b
[83077.577827]  [<ffffffff815fd14b>] panic+0xd8/0x1e7
[83077.635184]  [<ffffffff8110a7b0>] ? watchdog_enable_all_cpus.part.2+0x40/0x40
[83077.720642]  [<ffffffff8110a872>] watchdog_overflow_callback+0xc2/0xd0
[83077.798820]  [<ffffffff8114c971>] __perf_event_overflow+0xa1/0x250
[83077.872836]  [<ffffffff8114d474>] perf_event_overflow+0x14/0x20
[83077.943724]  [<ffffffff8103022d>] intel_pmu_handle_irq+0x1fd/0x410
[83078.017734]  [<ffffffff811906a1>] ? unmap_kernel_range_noflush+0x11/0x20
[83078.097992]  [<ffffffff81372134>] ? ghes_copy_tofrom_phys+0x124/0x210
[83078.175125]  [<ffffffff8160cb4b>] perf_event_nmi_handler+0x2b/0x50
[83078.249134]  [<ffffffff8160c299>] nmi_handle.isra.0+0x69/0xb0
[83078.317942]  [<ffffffff8160c3b0>] do_nmi+0xd0/0x340
[83078.376337]  [<ffffffff8160b6f1>] end_repeat_nmi+0x1e/0x2e
  • Kernel Panic under I/O
 time dd if=/dev/zero of=/xx/yy/zzz  bs=AA count=BBBBBBB

Above command is able to crash the system.

Resolution

Red Hat Enterprise Linux 7.1: Fixed in kernel-3.10.0-229.11.1.el7 and later.
Red Hat Enterprise Linux 7.2: Fixed in kernel-3.10.0-327.el7 and later.

To work around the issue on previous kernels, avoid setting the /proc/sys/kernel/watchdog_thresh parameter to a higher value than the default if feasible.

Root Cause

The kernel functions that are involved in the scenario leading to the hard lockup panic are:

watchdog_timer_fn
This is a high-resolution timer function that updates a CPU-specific interrupt counter periodically while executing in interrupt context.

watchdog_overflow_callback
This function is invoked from Non-Maskable Interrupt (NMI) context. If a CPU is busy, this function executes periodically and it checks whether watchdog_timer_fn has incremented the CPU-specific counter during the past interval. If the counter has not been incremented, watchdog_overflow_callback assumes that the CPU is 'locked up' in a section of kernel code where interrupts are disabled, and a panic occurs unless 'panic on hard lockup' is explicitly disabled via the nmi_watchdog=nopanic parameter on the kernel command line.

For this watchdog mechanism it is essential that watchdog_timer_fn is invoked at a significantly higher rate than watchdog_overflow_callback, so that watchdog_timer_fn gets a chance to increment the CPU-specific interrupt counter at least once during watchdog_overflow_callback's interval for example, as shown in the following diagram:

watchdog_timer_fn executes on CPU N approx. at t1, t2, t3, etc.

|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|-> time
t1  t2  t3  t4  t5  t6  t7  t8  t9  t10 t11 t12 t13 t14 t15 t16

|---------|---------|---------|---------|---------|---------|-> time
o1        o2        o3        o4        o5        o6        o7

watchdog_overflow_callback executes on CPU N approx. at o1, o2, o3, etc.

The interval for watchdog_timer_fn and watchdog_overflow_callback is derived from the /proc/sys/kernel/watchdog_thresh parameter. The value of this parameter is 10 by default. Due to a flaw in the watchdog code, only the interval for watchdog_timer_fn is adjusted if the parameter value is changed (for example, via an entry in sysctl.conf). If the new value is higher than the default, it is possible that watchdog_overflow_callback will execute at a higher rate than watchdog_timer_fn on a busy CPU, and thus watchdog_overflow_callback will not observe an increase of the CPU-specific interrupt counter, as shown in the following diagram.

watchdog_timer_fn executes on CPU N approx. at t1, t2, t3, etc.

|-----------------------|-----------------------|->
t1                      t2                      t3
 no counter increase
|---------|---------|---------|---------|---------|---------|-> time
o1        o2        o3        o4        o5        o6        o7

watchdog_overflow_callback executes on CPU N approx. at o1, o2, o3, etc.

Hence, it is possible that the hard lockup detector yields a false positive: A panic will occur at o2 even though the CPU is actually not 'locked up'.

Diagnostic Steps

The following vmcore analysis shows an example of a false positive that occurred on a machine with 2.40 GHz CPU clock speed, and /proc/sys/kernel/watchdog_thresh set to the value 60.

  • Bit 9 (IF) is set in the RFLAGS register which means that interrupts are enabled. The content of the CS (Code Segment) and SS (Stack Segment) register indicates that the NMI even occurred while the processor was executing 'user space' code.
crash> bt
PID: 3487   TASK: ffff8802337916c0  CPU: 4   COMMAND: "docker"
 #0 [ffff88033fc06b20] machine_kexec at ffffffff8104c4fb
 #1 [ffff88033fc06b80] crash_kexec at ffffffff810e1fb2
 #2 [ffff88033fc06c50] panic at ffffffff815fd152
 #3 [ffff88033fc06cd0] watchdog_overflow_callback at ffffffff8110a872
 #4 [ffff88033fc06ce0] __perf_event_overflow at ffffffff8114c971
 #5 [ffff88033fc06d58] perf_event_overflow at ffffffff8114d474
 #6 [ffff88033fc06d68] intel_pmu_handle_irq at ffffffff8103022d
 #7 [ffff88033fc06e60] perf_event_nmi_handler at ffffffff8160cb4b
 #8 [ffff88033fc06e80] nmi_handle at ffffffff8160c299
 #9 [ffff88033fc06ec8] do_nmi at ffffffff8160c3b0
#10 [ffff88033fc06ef0] end_repeat_nmi at ffffffff8160b6f1
    RIP: 000000000058a551  RSP: 000000c20856c5d0  RFLAGS: 00000202
    RAX: 000000c20816c670  RBX: 0000000000000019  RCX: 0000000000000000
    RDX: 000000c20816c670  RSI: 000000c208045538  RDI: 0000000000000000
    RBP: 0000000000000003   R8: 0000000000000017   R9: 0000000000000000
    R10: 0000000000000006  R11: 000000c20823b3cb  R12: 0000000000000000
    R13: 0000000000000186  R14: 0000000000000000  R15: 000000c208356da0
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
--- <NMI exception stack> ---
  • The interval for watchdog_overflow_callback is 24000210000 CPU clock cycles (10 seconds).
  crash> pd cpu_khz
  cpu_khz = $1 = 2400021

  crash> px watchdog_ev | grep "\[4\]"
    [4]: ffff88033fc0eba0

  crash> rd -x 0xffff88033fc0eba0
  ffff88033fc0eba0:  ffff8803395d0000

  crash> pd ((struct perf_event *)0xffff8803395d0000)->hw.sample_period
  $2 = 24000210000
  • /proc/sys/kernel/watchdog_thresh is set to the value 60, so the resulting interval for watchdog_timer_fn is 24000000000 nanoseconds (24 seconds).
  crash> pd watchdog_thresh
  watchdog_thresh = $3 = 60

  crash> pd sample_period
  sample_period = $4 = 24000000000

Consequently, watchdog_overflow_callback executes at a higher rate than watchdog_timer_fn.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments