Hard lockup occurs sometimes during reboot on systems with a large number of CPUs (960 CPUs)

Solution Verified - Updated -

Issue

Sometimes RHEL boot up failure and trigger OS panic with a large number of CPUs machines. A core trace will be showen below.

[   18.000464] ACPI: Added _OSI(Module Device)
[   18.000467] ACPI: Added _OSI(Processor Device)
[   18.000469] ACPI: Added _OSI(3.0 _SCP Extensions)
[   18.000470] ACPI: Added _OSI(Processor Aggregator Device)
[   18.415690] ACPI: 8 ACPI AML tables successfully acquired and loaded
[   18.460078] ACPI: Dynamic OEM Table Load:
[   19.371682] ACPI: _OSC evaluated successfully for all CPUs
[   19.374741] ACPI: Interpreter enabled
[   19.374781] ACPI: PM: (supports S0 S5)
[   19.374784] ACPI: Using IOAPIC for interrupt routing
[   19.388584] HEST: Enabling Firmware First mode for corrected errors.
[   19.408846] NMI watchdog: Watchdog detected hard LOCKUP on cpu 870 <----
[   19.408846] Modules linked in:
[   19.408846] CPU: 870 PID: 0 Comm: swapper/870 Tainted: G          I       -------  ---  5.14.0-570.12.1.el9_6.x86_64 #1
[   19.408846] Hardware name: Lenovo ThinkSystem SR950 V3/SC57B77397, BIOS EBE120C-7.21 06/20/2025
[   19.408846] RIP: 0010:native_queued_spin_lock_slowpath+0x27b/0x2b0
[   19.408846] Code: c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 00 48 03 00 48 03 04 d5 40 0c 18 8a 48 89 18 8b 43 08 85 c0 75 09 f3 90 <8b> 43 08 85 c0 74 f7 48 8b 13 48 85 d2 74 83 0f 0d 0a e9 7b ff ff
[   19.408846] RSP: 0000:ff5e89656fb74f58 EFLAGS: 00000046
[   19.408846] RAX: 0000000000000000 RBX: ff47e6d87f6b4800 RCX: 000000000d9c0000
[   19.408846] RDX: 0000000000000186 RSI: 0000000004700100 RDI: ffffffff8b6d8098
[   19.408846] RBP: ffffffff8b6d8098 R08: 0000000484db7ccc R09: 0000000000000000
[   19.408846] R10: 0000000000000000 R11: ff5e89656fb74ff8 R12: 0000000000000000
[   19.408846] R13: 0000000000000366 R14: ffffffff88a54330 R15: ff5e8965000efd9c
[   19.408846] FS:  0000000000000000(0000) GS:ff47e6d87f680000(0000) knlGS:0000000000000000
[   19.408846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   19.408846] CR2: 0000000000000000 CR3: 00000af6a3e10001 CR4: 0000000000771ef0
[   19.408846] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   19.408846] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   19.408846] PKRU: 55555554
[   19.408846] Call Trace:
[   19.408846]  <NMI>
[   19.408846]  ? show_trace_log_lvl+0x1c4/0x2df
[   19.408846]  ? show_trace_log_lvl+0x1c4/0x2df
[   19.408846]  ? _raw_spin_lock_irqsave+0x30/0x40
[   19.408846]  ? watchdog_overflow_callback.cold+0x1e/0x70
[   19.408846]  ? __perf_event_overflow+0x112/0x320
[   19.408846]  ? handle_pmi_common+0x128/0x410
[   19.408846]  ? intel_pmu_handle_irq+0x103/0x2a0
[   19.408846]  ? perf_event_nmi_handler+0x28/0x50
[   19.408846]  ? nmi_handle+0x5b/0x120
[   19.408846]  ? default_do_nmi+0x40/0x130
[   19.408846]  ? exc_nmi+0x100/0x180
[   19.408846]  ? end_repeat_nmi+0xf/0x60
[   19.408846]  ? _{_}pfx{_}__mce_disable_bank+0x10/0x10
[   19.408846]  ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[   19.408846]  ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[   19.408848]  ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[   19.408850]  </NMI>
[   19.408851]  <IRQ>
[   19.408852]  _raw_spin_lock_irqsave+0x30/0x40
[   19.408854]  cmci_disable_bank+0x54/0x90
[   19.408857]  __flush_smp_call_function_queue+0x87/0x3d0
[   19.408865]  __sysvec_call_function+0x18/0xc0
[   19.408870]  sysvec_call_function+0x6d/0x90
[   19.408873]  </IRQ>
[   19.408873]  <TASK>
[   19.408874]  asm_sysvec_call_function+0x16/0x20
[   19.408876] RIP: 0010:default_idle+0xb/0x20
[   19.408878] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 eb 07 0f 00 2d c3 6c 2e 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90
[   19.408879] RSP: 0000:ff5e896564557ed0 EFLAGS: 00000202
[   19.408880] RAX: ffffffff89725600 RBX: ff47e55e8e092380 RCX: 0000000000000000
[   19.408881] RDX: 4000000000000000 RSI: ff47e6d87f6a3fe0 RDI: 0000000000013084
[   19.408882] RBP: 0000000000000000 R08: 0000000000013084 R09: 00000000fa83b2da
[   19.408883] R10: 0000000002e4f59e R11: 0000000002d51517 R12: 0000000000000000
[   19.408883] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   19.408884]  ? __pfx_default_idle+0x10/0x10
[   19.408886]  default_idle_call+0x2e/0xd0
[   19.408888]  cpuidle_idle_call+0x125/0x160
[   19.408895]  ? sched_clock_cpu+0x5a/0x190
[   19.408900]  do_idle+0x7b/0xe0
[   19.408902]  cpu_startup_entry+0x26/0x30
[   19.408903]  start_secondary+0x115/0x140
[   19.408906]  secondary_startup_64_no_verify+0x187/0x18b
[   19.408911]  </TASK>
[   19.408913] Kernel panic - not syncing: Hard LOCKUP
[   19.409267] CPU: 870 PID: 0 Comm: swapper/870 Tainted: G          I       -------  ---  5.14.0-570.12.1.el9_6.x86_64 #1
[   19.409267] Hardware name: Lenovo ThinkSystem SR950 V3/SC57B77397, BIOS EBE120C-7.21 06/20/2025
[   19.409267] Call Trace:
[   19.409267]  <NMI>
[   19.409267]  dump_stack_lvl+0x34/0x48
[   19.409267]  panic+0x107/0x2bb
[   19.409267]  nmi_panic.cold+0xc/0xc
[   19.409267]  watchdog_overflow_callback.cold+0x5c/0x70
[   19.409267]  __perf_event_overflow+0x112/0x320
[   19.409267]  handle_pmi_common+0x128/0x410
[   19.409267]  intel_pmu_handle_irq+0x103/0x2a0
[   19.409267]  perf_event_nmi_handler+0x28/0x50
[   19.409267]  nmi_handle+0x5b/0x120
[   19.409267]  default_do_nmi+0x40/0x130
[   19.409267]  exc_nmi+0x100/0x180
[   19.409267]  end_repeat_nmi+0xf/0x60
[   19.409267] RIP: 0010:native_queued_spin_lock_slowpath+0x27b/0x2b0
[   19.409267] Code: c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 00 48 03 00 48 03 04 d5 40 0c 18 8a 48 89 18 8b 43 08 85 c0 75 09 f3 90 <8b> 43 08 85 c0 74 f7 48 8b 13 48 85 d2 74 83 0f 0d 0a e9 7b ff ff
[   19.409267] RSP: 0000:ff5e89656fb74f58 EFLAGS: 00000046
[   19.409267] RAX: 0000000000000000 RBX: ff47e6d87f6b4800 RCX: 000000000d9c0000
[   19.409267] RDX: 0000000000000186 RSI: 0000000004700100 RDI: ffffffff8b6d8098
[   19.409267] RBP: ffffffff8b6d8098 R08: 0000000484db7ccc R09: 0000000000000000
[   19.409267] R10: 0000000000000000 R11: ff5e89656fb74ff8 R12: 0000000000000000
[   19.409267] R13: 0000000000000366 R14: ffffffff88a54330 R15: ff5e8965000efd9c
[   19.409267]  ? _{_}pfx{_}__mce_disable_bank+0x10/0x10
[   19.409267]  ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[   19.409267]  ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[   19.409267]  </NMI>
[   19.409267]  <IRQ>
[   19.409267]  _raw_spin_lock_irqsave+0x30/0x40
[   19.409267]  cmci_disable_bank+0x54/0x90
[   19.409267]  __flush_smp_call_function_queue+0x87/0x3d0
[   19.409267]  __sysvec_call_function+0x18/0xc0
[   19.409267]  sysvec_call_function+0x6d/0x90
[   19.409267]  </IRQ>
[   19.409267]  <TASK>
[   19.409267]  asm_sysvec_call_function+0x16/0x20
[   19.409267] RIP: 0010:default_idle+0xb/0x20
[   19.409267] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 eb 07 0f 00 2d c3 6c 2e 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90
[   19.409267] RSP: 0000:ff5e896564557ed0 EFLAGS: 00000202
[   19.409267] RAX: ffffffff89725600 RBX: ff47e55e8e092380 RCX: 0000000000000000
[   19.409267] RDX: 4000000000000000 RSI: ff47e6d87f6a3fe0 RDI: 0000000000013084
[   19.409267] RBP: 0000000000000000 R08: 0000000000013084 R09: 00000000fa83b2da
[   19.409267] R10: 0000000002e4f59e R11: 0000000002d51517 R12: 0000000000000000
[   19.409267] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   19.409267]  ? __pfx_default_idle+0x10/0x10
[   19.409267]  default_idle_call+0x2e/0xd0
[   19.409267]  cpuidle_idle_call+0x125/0x160
[   19.409267]  ? sched_clock_cpu+0x5a/0x190
[   19.409267]  do_idle+0x7b/0xe0
[   19.409267]  cpu_startup_entry+0x26/0x30
[   19.409267]  start_secondary+0x115/0x140
[   19.409267]  secondary_startup_64_no_verify+0x187/0x18b
[   19.409267]  </TASK>
[   19.409267] Shutting down cpus with NMI
[   19.409267] --{-}[ end Kernel panic - not syncing: Hard LOCKUP ]{-}--

Environment

  • Red Hat Enterprise Linux 8
  • Red Hat Enterprise Linux 9
  • Lenovo ThinkSystem SR950 V3 with 960 CPU cores

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content