Hard lockup occurs sometimes during reboot on systems with a large number of CPUs (960 CPUs)
Issue
Sometimes RHEL boot up failure and trigger OS panic with a large number of CPUs machines. A core trace will be showen below.
[ 18.000464] ACPI: Added _OSI(Module Device)
[ 18.000467] ACPI: Added _OSI(Processor Device)
[ 18.000469] ACPI: Added _OSI(3.0 _SCP Extensions)
[ 18.000470] ACPI: Added _OSI(Processor Aggregator Device)
[ 18.415690] ACPI: 8 ACPI AML tables successfully acquired and loaded
[ 18.460078] ACPI: Dynamic OEM Table Load:
[ 19.371682] ACPI: _OSC evaluated successfully for all CPUs
[ 19.374741] ACPI: Interpreter enabled
[ 19.374781] ACPI: PM: (supports S0 S5)
[ 19.374784] ACPI: Using IOAPIC for interrupt routing
[ 19.388584] HEST: Enabling Firmware First mode for corrected errors.
[ 19.408846] NMI watchdog: Watchdog detected hard LOCKUP on cpu 870 <----
[ 19.408846] Modules linked in:
[ 19.408846] CPU: 870 PID: 0 Comm: swapper/870 Tainted: G I ------- --- 5.14.0-570.12.1.el9_6.x86_64 #1
[ 19.408846] Hardware name: Lenovo ThinkSystem SR950 V3/SC57B77397, BIOS EBE120C-7.21 06/20/2025
[ 19.408846] RIP: 0010:native_queued_spin_lock_slowpath+0x27b/0x2b0
[ 19.408846] Code: c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 00 48 03 00 48 03 04 d5 40 0c 18 8a 48 89 18 8b 43 08 85 c0 75 09 f3 90 <8b> 43 08 85 c0 74 f7 48 8b 13 48 85 d2 74 83 0f 0d 0a e9 7b ff ff
[ 19.408846] RSP: 0000:ff5e89656fb74f58 EFLAGS: 00000046
[ 19.408846] RAX: 0000000000000000 RBX: ff47e6d87f6b4800 RCX: 000000000d9c0000
[ 19.408846] RDX: 0000000000000186 RSI: 0000000004700100 RDI: ffffffff8b6d8098
[ 19.408846] RBP: ffffffff8b6d8098 R08: 0000000484db7ccc R09: 0000000000000000
[ 19.408846] R10: 0000000000000000 R11: ff5e89656fb74ff8 R12: 0000000000000000
[ 19.408846] R13: 0000000000000366 R14: ffffffff88a54330 R15: ff5e8965000efd9c
[ 19.408846] FS: 0000000000000000(0000) GS:ff47e6d87f680000(0000) knlGS:0000000000000000
[ 19.408846] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 19.408846] CR2: 0000000000000000 CR3: 00000af6a3e10001 CR4: 0000000000771ef0
[ 19.408846] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 19.408846] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 19.408846] PKRU: 55555554
[ 19.408846] Call Trace:
[ 19.408846] <NMI>
[ 19.408846] ? show_trace_log_lvl+0x1c4/0x2df
[ 19.408846] ? show_trace_log_lvl+0x1c4/0x2df
[ 19.408846] ? _raw_spin_lock_irqsave+0x30/0x40
[ 19.408846] ? watchdog_overflow_callback.cold+0x1e/0x70
[ 19.408846] ? __perf_event_overflow+0x112/0x320
[ 19.408846] ? handle_pmi_common+0x128/0x410
[ 19.408846] ? intel_pmu_handle_irq+0x103/0x2a0
[ 19.408846] ? perf_event_nmi_handler+0x28/0x50
[ 19.408846] ? nmi_handle+0x5b/0x120
[ 19.408846] ? default_do_nmi+0x40/0x130
[ 19.408846] ? exc_nmi+0x100/0x180
[ 19.408846] ? end_repeat_nmi+0xf/0x60
[ 19.408846] ? _{_}pfx{_}__mce_disable_bank+0x10/0x10
[ 19.408846] ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[ 19.408846] ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[ 19.408848] ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[ 19.408850] </NMI>
[ 19.408851] <IRQ>
[ 19.408852] _raw_spin_lock_irqsave+0x30/0x40
[ 19.408854] cmci_disable_bank+0x54/0x90
[ 19.408857] __flush_smp_call_function_queue+0x87/0x3d0
[ 19.408865] __sysvec_call_function+0x18/0xc0
[ 19.408870] sysvec_call_function+0x6d/0x90
[ 19.408873] </IRQ>
[ 19.408873] <TASK>
[ 19.408874] asm_sysvec_call_function+0x16/0x20
[ 19.408876] RIP: 0010:default_idle+0xb/0x20
[ 19.408878] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 eb 07 0f 00 2d c3 6c 2e 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90
[ 19.408879] RSP: 0000:ff5e896564557ed0 EFLAGS: 00000202
[ 19.408880] RAX: ffffffff89725600 RBX: ff47e55e8e092380 RCX: 0000000000000000
[ 19.408881] RDX: 4000000000000000 RSI: ff47e6d87f6a3fe0 RDI: 0000000000013084
[ 19.408882] RBP: 0000000000000000 R08: 0000000000013084 R09: 00000000fa83b2da
[ 19.408883] R10: 0000000002e4f59e R11: 0000000002d51517 R12: 0000000000000000
[ 19.408883] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 19.408884] ? __pfx_default_idle+0x10/0x10
[ 19.408886] default_idle_call+0x2e/0xd0
[ 19.408888] cpuidle_idle_call+0x125/0x160
[ 19.408895] ? sched_clock_cpu+0x5a/0x190
[ 19.408900] do_idle+0x7b/0xe0
[ 19.408902] cpu_startup_entry+0x26/0x30
[ 19.408903] start_secondary+0x115/0x140
[ 19.408906] secondary_startup_64_no_verify+0x187/0x18b
[ 19.408911] </TASK>
[ 19.408913] Kernel panic - not syncing: Hard LOCKUP
[ 19.409267] CPU: 870 PID: 0 Comm: swapper/870 Tainted: G I ------- --- 5.14.0-570.12.1.el9_6.x86_64 #1
[ 19.409267] Hardware name: Lenovo ThinkSystem SR950 V3/SC57B77397, BIOS EBE120C-7.21 06/20/2025
[ 19.409267] Call Trace:
[ 19.409267] <NMI>
[ 19.409267] dump_stack_lvl+0x34/0x48
[ 19.409267] panic+0x107/0x2bb
[ 19.409267] nmi_panic.cold+0xc/0xc
[ 19.409267] watchdog_overflow_callback.cold+0x5c/0x70
[ 19.409267] __perf_event_overflow+0x112/0x320
[ 19.409267] handle_pmi_common+0x128/0x410
[ 19.409267] intel_pmu_handle_irq+0x103/0x2a0
[ 19.409267] perf_event_nmi_handler+0x28/0x50
[ 19.409267] nmi_handle+0x5b/0x120
[ 19.409267] default_do_nmi+0x40/0x130
[ 19.409267] exc_nmi+0x100/0x180
[ 19.409267] end_repeat_nmi+0xf/0x60
[ 19.409267] RIP: 0010:native_queued_spin_lock_slowpath+0x27b/0x2b0
[ 19.409267] Code: c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 00 48 03 00 48 03 04 d5 40 0c 18 8a 48 89 18 8b 43 08 85 c0 75 09 f3 90 <8b> 43 08 85 c0 74 f7 48 8b 13 48 85 d2 74 83 0f 0d 0a e9 7b ff ff
[ 19.409267] RSP: 0000:ff5e89656fb74f58 EFLAGS: 00000046
[ 19.409267] RAX: 0000000000000000 RBX: ff47e6d87f6b4800 RCX: 000000000d9c0000
[ 19.409267] RDX: 0000000000000186 RSI: 0000000004700100 RDI: ffffffff8b6d8098
[ 19.409267] RBP: ffffffff8b6d8098 R08: 0000000484db7ccc R09: 0000000000000000
[ 19.409267] R10: 0000000000000000 R11: ff5e89656fb74ff8 R12: 0000000000000000
[ 19.409267] R13: 0000000000000366 R14: ffffffff88a54330 R15: ff5e8965000efd9c
[ 19.409267] ? _{_}pfx{_}__mce_disable_bank+0x10/0x10
[ 19.409267] ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[ 19.409267] ? native_queued_spin_lock_slowpath+0x27b/0x2b0
[ 19.409267] </NMI>
[ 19.409267] <IRQ>
[ 19.409267] _raw_spin_lock_irqsave+0x30/0x40
[ 19.409267] cmci_disable_bank+0x54/0x90
[ 19.409267] __flush_smp_call_function_queue+0x87/0x3d0
[ 19.409267] __sysvec_call_function+0x18/0xc0
[ 19.409267] sysvec_call_function+0x6d/0x90
[ 19.409267] </IRQ>
[ 19.409267] <TASK>
[ 19.409267] asm_sysvec_call_function+0x16/0x20
[ 19.409267] RIP: 0010:default_idle+0xb/0x20
[ 19.409267] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 eb 07 0f 00 2d c3 6c 2e 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90
[ 19.409267] RSP: 0000:ff5e896564557ed0 EFLAGS: 00000202
[ 19.409267] RAX: ffffffff89725600 RBX: ff47e55e8e092380 RCX: 0000000000000000
[ 19.409267] RDX: 4000000000000000 RSI: ff47e6d87f6a3fe0 RDI: 0000000000013084
[ 19.409267] RBP: 0000000000000000 R08: 0000000000013084 R09: 00000000fa83b2da
[ 19.409267] R10: 0000000002e4f59e R11: 0000000002d51517 R12: 0000000000000000
[ 19.409267] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 19.409267] ? __pfx_default_idle+0x10/0x10
[ 19.409267] default_idle_call+0x2e/0xd0
[ 19.409267] cpuidle_idle_call+0x125/0x160
[ 19.409267] ? sched_clock_cpu+0x5a/0x190
[ 19.409267] do_idle+0x7b/0xe0
[ 19.409267] cpu_startup_entry+0x26/0x30
[ 19.409267] start_secondary+0x115/0x140
[ 19.409267] secondary_startup_64_no_verify+0x187/0x18b
[ 19.409267] </TASK>
[ 19.409267] Shutting down cpus with NMI
[ 19.409267] --{-}[ end Kernel panic - not syncing: Hard LOCKUP ]{-}--
Environment
- Red Hat Enterprise Linux 8
- Red Hat Enterprise Linux 9
- Lenovo ThinkSystem SR950 V3 with 960 CPU cores
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.