"Kernel panic - not syncing: Hard LOCKUP" due to third party module [mmfs]
Environment
- Red Hat Enterprise Linux 8.
- Kernel has the following third-party modules loaded:
mmfslinuxmmfs26
Issue
- The kernel panics with the following messages:
crash> log
[7257073.323983] Kernel panic - not syncing: Hard LOCKUP
[7257073.323984] CPU: 41 PID: 1173723 Comm: opera_node_expo Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.22.1.el8_10.x86_64 #1
[7257073.323984] Hardware name: Dell Inc. PowerEdge R740xd/0923K0, BIOS 2.19.1 06/04/2023
[7257073.323985] Call Trace:
[7257073.323985] <NMI>
[7257073.323986] dump_stack+0x41/0x60
[7257073.323986] panic+0xe7/0x2ac
[7257073.323986] nmi_panic.cold.11+0xc/0xc
[7257073.323987] watchdog_overflow_callback.cold.7+0x5c/0x70
[7257073.323988] __perf_event_overflow+0x52/0x100
[7257073.323988] handle_pmi_common+0x200/0x2d0
[7257073.323989] ? __set_pte_vaddr+0x32/0x50
[7257073.323990] ? __native_set_fixmap+0x24/0x40
[7257073.323990] ? ghes_copy_tofrom_phys+0xf9/0x250
[7257073.323991] intel_pmu_handle_irq+0x119/0x450
[7257073.323991] perf_event_nmi_handler+0x2d/0x50
[7257073.323992] nmi_handle+0x63/0x110
[7257073.323992] default_do_nmi+0x49/0x110
[7257073.323993] do_nmi+0x19c/0x210
[7257073.323993] end_repeat_nmi+0x16/0x69
[7257073.323994] RIP: 0010:native_queued_spin_lock_slowpath+0x5f/0x1c0
[7257073.323995] Code: 71 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 cc cc cc cc 8b 37 81
[7257073.323996] RSP: 0018:ffffa4d0a303f9b0 EFLAGS: 00000002
[7257073.323996] RAX: 0000000000000173 RBX: ffff98d55469c000 RCX: 00000000390ae9d9
[7257073.323997] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff98cebbeaf184
[7257073.323998] RBP: ffff98cebbeaf180 R08: 0000000000000000 R09: 0000000000000000
[7257073.323998] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000017
[7257073.323999] R13: 0000000001fee9e5 R14: 00000045613fc8ab R15: 0000000000000001
[7257073.323999] ? native_queued_spin_lock_slowpath+0x5f/0x1c0
[7257073.324000] ? native_queued_spin_lock_slowpath+0x5f/0x1c0
[7257073.324000] </NMI>
[7257073.324001] _raw_spin_lock_irq+0x25/0x2c
[7257073.324001] task_numa_fault+0x35a/0xa90
[7257073.324002] do_numa_page+0x23d/0x260
[7257073.324002] __handle_mm_fault+0x552/0x6d0
[7257073.324003] handle_mm_fault+0xca/0x2a0
[7257073.324003] __do_page_fault+0x1e4/0x440
[7257073.324003] do_page_fault+0x37/0x12d
[7257073.324004] page_fault+0x1e/0x30
[..]
[7257073.324018] RIP: 0033:0x472f63
[7257073.324019] Code: 24 20 c3 cc cc cc cc 48 8b 7c 24 08 8b 74 24 10 8b 54 24 14 4c 8b 54 24 18 4c 8b 44 24 20 44 8b 4c 24 28 b8 ca 00 00 00 0f 05 <89> 44 24 30 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[7257073.324020] RSP: 002b:00007f84c67fbc40 EFLAGS: 00000286 ORIG_RAX: 00000000000000ca
[7257073.324021] RAX: fffffffffffffe00 RBX: 000000c0008d2400 RCX: 0000000000472f63
[7257073.324022] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000c0008d2548
[7257073.324022] RBP: 00007f84c67fbc88 R08: 0000000000000000 R09: 0000000000000000
[7257073.324023] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000003
[7257073.324024] R13: 000000c000a8d500 R14: 0000000000000050 R15: 0000000000000200
Resolution
-
The
mmfsis a third-party module. Kindly engage the module vendor to investigate further on the issue.Workaround:
- Blacklist the module
mmfs26&mmfslinuxand check if this issue is reproducible.
How do I prevent a kernel module from loading automatically?
- Blacklist the module
Root Cause
- The spinlock appears to be corrupted, and the associated SLAB object containing this spinlock seems to have been overwritten with an ASCII string
oprfscoming from third-party modules.
Diagnostic Steps
- Backtrace of the panic task:
crash> bt
PID: 1173723 TASK: ffff98d55469c000 CPU: 41 COMMAND: "opera_node_expo"
#0 [fffffe1e38bcaa18] machine_kexec at ffffffff9c66f383
#1 [fffffe1e38bcaa70] __crash_kexec at ffffffff9c7bacba
#2 [fffffe1e38bcab30] panic at ffffffff9c6fa74f
#3 [fffffe1e38bcabb8] watchdog_overflow_callback.cold.7 at ffffffff9c7f4809
#4 [fffffe1e38bcabc8] __perf_event_overflow at ffffffff9c888752
#5 [fffffe1e38bcabf8] handle_pmi_common at ffffffff9c613ca0
#6 [fffffe1e38bcade0] intel_pmu_handle_irq at ffffffff9c613e89
#7 [fffffe1e38bcae38] perf_event_nmi_handler at ffffffff9c60771d
#8 [fffffe1e38bcae50] nmi_handle at ffffffff9c62e163
#9 [fffffe1e38bcaea8] default_do_nmi at ffffffff9d013079
#10 [fffffe1e38bcaec8] do_nmi at ffffffff9c62e6cc
#11 [fffffe1e38bcaef0] end_repeat_nmi at ffffffff9d201678
[exception RIP: native_queued_spin_lock_slowpath+95]
RIP: ffffffff9c75facf RSP: ffffa4d0a303f9b0 RFLAGS: 00000002
RAX: 0000000000000173 RBX: ffff98d55469c000 RCX: 00000000390ae9d9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff98cebbeaf184
RBP: ffff98cebbeaf180 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000017
R13: 0000000001fee9e5 R14: 00000045613fc8ab R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#12 [ffffa4d0a303f9b0] native_queued_spin_lock_slowpath at ffffffff9c75facf
#13 [ffffa4d0a303f9b0] _raw_spin_lock_irq at ffffffff9d0275b5
#14 [ffffa4d0a303f9b8] task_numa_fault at ffffffff9c73fbda
#15 [ffffa4d0a303fa50] do_numa_page at ffffffff9c8e39cd
#16 [ffffa4d0a303fa90] __handle_mm_fault at ffffffff9c8e9272
#17 [ffffa4d0a303fb48] handle_mm_fault at ffffffff9c8e94ba
#18 [ffffa4d0a303fb80] __do_page_fault at ffffffff9c682b14
#19 [ffffa4d0a303fbd0] do_page_fault at ffffffff9c682da7
#20 [ffffa4d0a303fc00] page_fault at ffffffff9d2011fe
[exception RIP: __get_user_8+33]
RIP: ffffffff9d010141 RSP: ffffa4d0a303fcb8 RFLAGS: 00050206
RAX: 00007f84c67fc9e7 RBX: ffff98d55469c000 RCX: 00000000000002d0
RDX: ffffffffffffffff RSI: 0000000000000001 RDI: ffff98d55469c000
RBP: ffff98d55469cbbc R8: 0000000000dc0000 R9: ffff98ce87f11d90
R10: ffff98b2e17eb400 R11: 0000000000000600 R12: ffff98d55469ca98
R13: 0000000000000000 R14: 00007f84c67fc9e0 R15: ffff98d55469c000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#21 [ffffa4d0a303fcb8] futex_cleanup at ffffffff9c7ac5b6
#22 [ffffa4d0a303fd30] futex_exit_release at ffffffff9c7aeccd
#23 [ffffa4d0a303fd50] exit_mm_release at ffffffff9c6f7242
#24 [ffffa4d0a303fd68] do_exit at ffffffff9c6ff486
#25 [ffffa4d0a303fdd8] do_group_exit at ffffffff9c6ffe4a
#26 [ffffa4d0a303fe00] get_signal at ffffffff9c70d0b1
#27 [ffffa4d0a303fe58] do_signal at ffffffff9c628e46
#28 [ffffa4d0a303ff20] exit_to_usermode_loop at ffffffff9c604dc9
#29 [ffffa4d0a303ff38] do_syscall_64 at ffffffff9c6055d5
#30 [ffffa4d0a303ff50] entry_SYSCALL_64_after_hwframe at ffffffff9d20012e
RIP: 0000000000472f63 RSP: 00007f84c67fbc40 RFLAGS: 00000286
RAX: fffffffffffffe00 RBX: 000000c0008d2400 RCX: 0000000000472f63
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000c0008d2548
RBP: 00007f84c67fbc88 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000003
R13: 000000c000a8d500 R14: 0000000000000050 R15: 0000000000000200
ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b
- Here, the task is stuck exiting. It tries to acquire the following spinlock:
crash> task_struct.numa_group ffff98d55469c000
numa_group = 0xffff98cebbeaf180,
crash> numa_group.lock 0xffff98cebbeaf180
lock = {
{
rlock = {
raw_lock = {
{
val = {
counter = 0x173
},
{
locked = 0x73,
pending = 0x1
},
{
locked_pending = 0x173,
tail = 0x0
}
}
}
}
}
},
- The spinlock looks corrupted, and the SLAB object that confines this spinlock seems to be overwritten with an ASCII string:
crash> kmem 0xffff98cebbeaf180
CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME
ffff989f40004c40 128 16723 33824 1057 4k kmalloc-128
SLAB MEMORY NODE TOTAL ALLOCATED FREE
fffffa76c1efabc0 ffff98cebbeaf000 1 32 15 17
FREE / [ALLOCATED]
[ffff98cebbeaf180]
crash> rd ffff98cebbeaf180 16
ffff98cebbeaf180: 000001736672706f 0011e8db00000001 oprfs...........
ffff98cebbeaf190: 0000000000000001 0000000000000000 ................
ffff98cebbeaf1a0: 0000000000000000 000000000000003f ........?.......
ffff98cebbeaf1b0: 0000000000000000 ffff98cebbeaf1e0 ................
ffff98cebbeaf1c0: 0000000000000000 000000000000000a ................
ffff98cebbeaf1d0: 0000000000000000 0000000000000035 ........5.......
ffff98cebbeaf1e0: 0000000000000000 0000000000000013 ................
ffff98cebbeaf1f0: 0000000000000000 0000000000000069 ........i.......
- The
oprfsstring must be a result of GPFS activity:
crash> search -c oprfs
ffff989f47e172c3: oprfs_nsd_273................6.......................H.i
ffff989f65bc7350: oprfs::151::R:SGPFS-ssooprfs:0:::::::::::::::::::.SSOOPR
ffff989f65bc7367: oprfs:0:::::::::::::::::::.SSOOPRH.OCE.sl73caessop06:40_
ffff989f65bc73b1: oprfs:1:%2Foprapp%2Fprod%2Fgpfs:.SSOOPRH.OCE.sl73caessop
ffff989f65bc7401: oprfs:2:.dev..= /dev/SGPFS-ssooprfs.SSOOPRH.OCE.sl73caes
ffff989f65bc741f: oprfs.SSOOPRH.OCE.sl73caessop06:40_SG_ETCFS:SGPFS-ssoopr
ffff989f65bc7454: oprfs:3:.vfs..= mmfs.SSOOPRH.OCE.sl73caessop06:40_SG_ETC
ffff989f65bc7498: oprfs:4:.nodename.= -.SSOOPRH.OCE.sl73caessop06:40_SG_ET
ffff989f65bc74dd: oprfs:5:.mount..= mmfs.SSOOPRH.OCE.sl73caessop06:40_SG_E
ffff989f65bc7523: oprfs:6:.type..= mmfs.SSOOPRH.OCE.sl73caessop06:40_SG_ET
ffff989f65bc7568: oprfs:7:.account..= false.SSOOPRH.OCE.sl73caessop06:50_S
ffff989f65bc75b1: oprfs::rw::::context=system_u%3Aobject_r%3Acontainer_fil
ffff989f72708733: oprfs_nsd_87.&...............................r.......r..
ffff989f80432a73: oprfs_nsd_84.&..........................................
ffff989f8d59a7a3: oprfs_nsd_129................6..........................
ffff989fac0d1425: oprfs:4:.nodename.= -.SSOOPRH.OCE.sl73caessop06:40_SG_ET
ffff989fac0d146a: oprfs:5:.mount..= mmfs.SSOOPRH.OCE.sl73caessop06:40_SG_E
ffff989fac0d14b0: oprfs:6:.type..= mmfs.SSOOPRH.OCE.sl73caessop06:40_SG_ET
ffff989fac0d14f5: oprfs:7:.account..= false.SSOOPRH.OCE.sl73caessop06:50_S
ffff989fac0d153e: oprfs::rw::::context=system_u%3Aobject_r%3Acontainer_fil
ffff989fb014f371: oprfs........RKM.conf.CHG3505105.....%..veesoceanagpfspr
ffff989fb43affa3: oprfs_nsd_8..&.........................5.............(.h
ffff989fb4a5b753: oprfs_nsd_158................6..........................
ffff989fb6418afa: oprfs::151::R:SGPFS-ssooprfs:0:::::::::::::::::::.SSOOPR
ffff989fb6418b11: oprfs:0:::::::::::::::::::.SSOOPRH.OCE.sl73caessop06:40_
ffff989fb6418b5b: oprfs:1:%2Foprapp%2Fprod%2Fgpfs:.SSOOPRH.OCE.sl73caessop
ffff989fb6418bab: oprfs:2:.dev..= /dev/SGPFS-ssooprfs.SSOOPRH.OCE.sl73caes
ffff989fb6418bc9: oprfs.SSOOPRH.OCE.sl73caessop06:40_SG_ETCFS:SGPFS-ssoopr
ffff989fb6418bfe: oprfs:3:.vfs..= mmfs.SSOOPRH.OCE.sl73caessop06:40_SG_ETC
ffff989fb6418c42: oprfs:4:.nodename.= -.SSOOPRH.OCE.sl73caessop06:40_SG_ET
ffff989fb6418c87: oprfs:5:.mount..= mmfs.SSOOPRH.OCE.sl73caessop06:40_SG_E
ffff989fb6418ccd: oprfs:6:.type..= mmfs.SSOOPRH.OCE.sl73caessop06:40_SG_ET
ffff989fb6418d12: oprfs:7:.account..= false.SSOOPRH.OCE.sl73caessop06:50_S
ffff989fb6418d5b: oprfs::rw::::context=system_u%3Aobject_r%3Acontainer_fil
ffff989fc9398f33: oprfs_nsd_198................6..........................
- The GPFS kernel modules are loaded:
crash> mod -t
NAME TAINTS
tracedev OE
mmfslinux OE
mmfs26 OE
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments