Sometimes HPE Proliant servers crash due to a NMI while the system is shutting down.

Solution Verified - Updated -

Issue

Background

The hpwdt driver ships are part of the kernel and is registered as the NMI handler of last resort. When an NMI happens that is not claimed by any other registered NMI handler the hpwdt driver will bring the system down. This is the expected behavior of hpwdt.

This can happen in multiple ways but on HPE Proliant servers primarily this is done by opening /dev/watchdog and then closing it without stopping the watchdog. When this happens the following message is always seen (note that it does not identify the process that closed /dev/watchdog without stopping the watchdog):

[ 4900.993470] hpwdt: Unexpected close, not stopping watchdog!

Although hp-asrd was not running in this case, if the process was, it can also start a countdown timer (by default at 600 seconds) but it does not use /dev/watchdog by default - it talks directly to the hardware. If the hp-asrd was running, it updates the watchdog (by default) every second which means for hp-asrd to cause an ASR would mean it could not run for at least 10 minutes (when shutdown correctly it should stop its countdown timer).

Description

  • Systems are crashing with following panic message:

    [ 5492.146364] Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:
                   1. Integrated Management Log (IML)
                   2. OA Syslog
                   3. OA Forward Progress Log
                   4. iLO Event Log
    [ 5492.505988] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-123.9.2.el7.x86_64 #1
    [ 5492.605615] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
    [ 5492.692636]  ffffffffa03ae2d8 17844fa82b224426 ffff880fffa06de0 ffffffff815e241b
    [ 5492.789893]  ffff880fffa06e60 ffffffff815db8d9 0000000000000008 ffff880fffa06e70
    [ 5492.887146]  ffff880fffa06e10 17844fa82b224426 ffffffff8101a6a9 ffffc9001cb22072
    [ 5492.984397] Call Trace:
    [ 5493.016447]  <NMI>  [<ffffffff815e241b>] dump_stack+0x19/0x1b
    [ 5493.092031]  [<ffffffff815db8d9>] panic+0xd8/0x1e7
    [ 5493.155010]  [<ffffffff8101a6a9>] ? sched_clock+0x9/0x10
    [ 5493.224869]  [<ffffffffa03ad8ed>] hpwdt_pretimeout+0xdd/0xe0 [hpwdt]
    [ 5432188793.308464]  [<ffffffff815eb0d9>] nmi_handle.isra.0+0x69/0xb0
    [ 5493.384033]  [<ffffffff815eb246>] do_nmi+0x126/0x340
    [ 5493.449296]  [<ffffffff815ea531>] end_repeat_nmi+0x1e/0x2e
    [ 5493.521458]  [<ffffffff8131fa67>] ? intel_idle+0xe7/0x160
    [ 5493.592448]  [<ffffffff8131fa67>] ? intel_idle+0xe7/0x160
    [ 5493.663438]  [<ffffffff8131fa67>] ? intel_idle+0xe7/0x160
    [ 5493.734432]  <<EOE>>  [<ffffffff814835a0>] cpuidle_enter_state+0x40/0xc0
    [ 5493.822634]  [<ffffffff814836e5>] cpuidle_idle_call+0xc5/0x200
    [ 5493.899368]  [<ffffffff8101bcae>] arch_cpu_idle+0xe/0x30
    [ 5493.969241]  [<ffffffff810b47e5>] cpu_startup_entry+0xf5/0x290
    [ 5494.045960]  [<ffffffff815c3cb7>] rest_init+0x77/0x80
    [ 5494.112394]  [<ffffffff81a08fa7>] start_kernel+0x429/0x44a
    [ 5494.184531]  [<ffffffff81a08987>] ? repair_env_string+0x5c/0x5c
    [ 5494.262390]  [<ffffffff81a08120>] ? early_idt_handlers+0x120/0x120
    [ 5494.343686]  [<ffffffff81a085ee>] x86_64_start_reservations+0x2a/0x2c
    [ 5494.428419]  [<ffffffff81a08742>] x86_64_start_kernel+0x152/0x175
    
  • IML log has the following entry:

    An Unrecoverable System Error (NMI) has occurred (System error code 0x0000002B, 0x00000000)
    

Environment

  • Red Hat Enterprise Linux 7
  • kernel-3.10.0-123.13.1.el7.x86_64
  • systemd-208-11.el7_0.5.x86_64
  • HP ProLiant DL380p Gen8
  • HP ProLiant DL380 Gen9
  • hp-asrd or ASR related process is not running.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In