Soft lockups and RCU sched CPU stalls are detected where many CPUs are looping in a spinlock. A hard lockup is encountered and then the kernel crashes in the end.
Issue
- Soft lockups and RCU sched CPU stalls are detected where many CPUs are looping in a spinlock. A hard lockup is encountered and then the kernel crashes in the end.
soft lockups:
...
[56376.979162] NMI watchdog: BUG: soft lockup - CPU#26 stuck for 23s! [ptlrpcd_00_00:12056]
[56376.979182] Modules linked in: nfsv3 nfs_acl mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mlx4_core(OE) vfat fat ext4 mbcache jbd2 amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass pcspkr snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer sg snd ipmi_si soundcore ipmi_devintf ipmi_msghandler pcc_cpufreq k10temp i2c_piix4 i2c_designware_platform pinctrl_amd i2c_designware_core acpi_cpufreq binfmt_misc ip_tables xfs libcrc32c raid1 sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul
[56376.979190] crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel ast aesni_intel lrw gf128mul glue_helper ablk_helper cryptd drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops mlx5_core(OE) ttm libahci mlx_compat(OE) drm igb mlxfw(OE) libata devlink dca drm_panel_orientation_quirks i2c_algo_bit ptp pps_core nfit libnvdimm sunrpc dm_mirror dm_region_hash dm_log dm_mod
[56376.979192] CPU: 26 PID: 12056 Comm: ptlrpcd_00_00 Kdump: loaded Tainted: G OEL ------------ 3.10.0-1062.18.1.el7.x86_64 #1
[56376.979192] Hardware name: GIGABYTE H262-Z61-00/MZ62-HD0-00, BIOS R05 08/27/2019
[56376.979193] task: ffff9aac9fcd20e0 ti: ffff9aac7a4fc000 task.ti: ffff9aac7a4fc000
[56376.979195] RIP: 0010:[<ffffffffa5f17682>] [<ffffffffa5f17682>] native_queued_spin_lock_slowpath+0x122/0x200
[56376.979196] RSP: 0018:ffff9aac7a4ffb68 EFLAGS: 00000246
[56376.979197] RAX: 0000000000000000 RBX: ffff9a7d2875d800 RCX: 0000000000d10000
[56376.979197] RDX: ffff9aaca769b880 RSI: 0000000001d10000 RDI: ffff9a8ccb95ab80
[56376.979198] RBP: ffff9aac7a4ffb68 R08: ffff9a8ced69b880 R09: 0000000000000000
[56376.979198] R10: 0000000000000000 R11: 000000000000000f R12: ffff9a7d287a6780
[56376.979199] R13: 000000000000000a R14: ffffffffc143ee2b R15: 0000000000000003
[56376.979200] FS: 0000000000000000(0000) GS:ffff9a8ced680000(0000) knlGS:0000000000000000
[56376.979201] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[56376.979201] CR2: 00002b3a697cc090 CR3: 0000000e7fdfe000 CR4: 0000000000340fe0
[56376.979202] Call Trace:
[56376.979204] [<ffffffffa65754ee>] queued_spin_lock_slowpath+0xb/0xf
[56376.979206] [<ffffffffa6583b20>] _raw_spin_lock+0x20/0x30
[56376.979218] [<ffffffffc0be9788>] cfs_percpt_lock+0x58/0x110 [libcfs]
[56376.979228] [<ffffffffc1405788>] LNetMDUnlink+0x78/0x180 [lnet]
[56376.979252] [<ffffffffc14ab96f>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
[56376.979254] [<ffffffffa5eae692>] ? del_timer_sync+0x52/0x60
[56376.979269] [<ffffffffc14b05de>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
[56376.979284] [<ffffffffc14b0abf>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[56376.979302] [<ffffffffc14e00ec>] ptlrpcd+0x28c/0x560 [ptlrpc]
[56376.979303] [<ffffffffa5edb600>] ? wake_up_state+0x20/0x20
[56376.979319] [<ffffffffc14dfe60>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[56376.979320] [<ffffffffa5ec6321>] kthread+0xd1/0xe0
[56376.979321] [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
[56376.979323] [<ffffffffa658dd24>] ret_from_fork_nospec_begin+0xe/0x21
[56376.979324] [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
[56376.979335] Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 80 b8 01 00 48 03 14 c5 60 e8 b4 a6 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b
...
RCU sched CPU stalls:
...
[56377.012166] INFO: rcu_sched self-detected stall on CPU {
[56377.012167] INFO: rcu_sched self-detected stall on CPU
[56377.012168] INFO: rcu_sched self-detected stall on CPU
[56377.012169] INFO: rcu_sched self-detected stall on CPU
[56377.012170] 47
[56377.012171] {
[56377.012171] {
[56377.012172] {
[56377.012173] }
[56377.012174] 49
[56377.012174] 56
[56377.012175] 58
[56377.012176] }
[56377.012176] }
[56377.012192] (t=60000 jiffies g=4828650 c=4828649 q=225869)
[56377.012193] Task dump for CPU 33:
[56377.012195] } (t=60000 jiffies g=4828650 c=4828649 q=225869)
[56377.012196] (t=60000 jiffies g=4828650 c=4828649 q=225869)
[56377.012199] (t=60000 jiffies g=4828650 c=4828649 q=225869)
[56377.012201] ptlrpcd_01_07 R running task 0 12095 2 0x00000088
[56377.012202] Call Trace:
[56377.012235] [<ffffffffc14ab96f>] ? ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
[56377.012250] [<ffffffffc14b05de>] ? ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
[56377.012264] [<ffffffffc14b0abf>] ? ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[56377.012281] [<ffffffffc14e00ec>] ? ptlrpcd+0x28c/0x560 [ptlrpc]
[56377.012282] [<ffffffffa5edb600>] ? wake_up_state+0x20/0x20
[56377.012297] [<ffffffffc14dfe60>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[56377.012299] [<ffffffffa5ec6321>] ? kthread+0xd1/0xe0
[56377.012300] [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
[56377.012301] [<ffffffffa658dd24>] ? ret_from_fork_nospec_begin+0xe/0x21
[56377.012302] [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
...
Finally the kernel crashes due to a hard lockup:
...
[56395.512442] Kernel panic - not syncing: Hard LOCKUP
[56395.512443] CPU: 62 PID: 12108 Comm: ptlrpcd_01_20 Kdump: loaded Tainted: G OEL ------------ 3.10.0-1062.18.1.el7.x86_64 #1
[56395.512444] Hardware name: GIGABYTE H262-Z61-00/MZ62-HD0-00, BIOS R05 08/27/2019
[56395.512444] Call Trace:
[56395.512449] <NMI> [<ffffffffa657b416>] dump_stack+0x19/0x1b
[56395.512450] [<ffffffffa6574a0b>] panic+0xe8/0x21f
[56395.512454] [<ffffffffa5e2f958>] ? show_regs+0x58/0x290
[56395.512457] [<ffffffffa5e9b80f>] nmi_panic+0x3f/0x40
[56395.512461] [<ffffffffa5f4edd1>] watchdog_overflow_callback+0x121/0x140
[56395.512463] [<ffffffffa5fa84c7>] __perf_event_overflow+0x57/0x100
[56395.512465] [<ffffffffa5fb1c64>] perf_event_overflow+0x14/0x20
[56395.512469] [<ffffffffa5e05525>] x86_pmu_handle_irq+0x125/0x180
[56395.512473] [<ffffffffa6183db8>] ? ioremap_page_range+0x2e8/0x480
[56395.512475] [<ffffffffa60030c4>] ? vunmap_page_range+0x234/0x470
[56395.512476] [<ffffffffa6003311>] ? unmap_kernel_range_noflush+0x11/0x20
[56395.512478] [<ffffffffa624acee>] ? ghes_copy_tofrom_phys+0x10e/0x210
[56395.512479] [<ffffffffa624ae90>] ? ghes_read_estatus+0xa0/0x190
[56395.512480] [<ffffffffa5e0668a>] amd_pmu_handle_irq+0x3a/0x80
[56395.512482] [<ffffffffa6584031>] perf_event_nmi_handler+0x31/0x50
[56395.512484] [<ffffffffa658593c>] nmi_handle.isra.0+0x8c/0x150
[56395.512486] [<ffffffffa6585c18>] do_nmi+0x218/0x460
[56395.512487] [<ffffffffa6584d9c>] end_repeat_nmi+0x1e/0x81
[56395.512488] [<ffffffffa5f17682>] ? native_queued_spin_lock_slowpath+0x122/0x200
[56395.512489] [<ffffffffa5f17682>] ? native_queued_spin_lock_slowpath+0x122/0x200
[56395.512490] [<ffffffffa5f17682>] ? native_queued_spin_lock_slowpath+0x122/0x200
[56395.512492] <EOE> <IRQ> [<ffffffffa65754ee>] queued_spin_lock_slowpath+0xb/0xf
[56395.512494] [<ffffffffa6583ba7>] _raw_spin_lock_irqsave+0x37/0x40
[56395.512496] [<ffffffffa5f581ab>] rcu_dump_cpu_stacks+0x4b/0xd0
[56395.512498] [<ffffffffa5f5b8b2>] rcu_check_callbacks+0x442/0x730
[56395.512501] [<ffffffffa5f10190>] ? tick_sched_do_timer+0x50/0x50
[56395.512503] [<ffffffffa5eaf666>] update_process_times+0x46/0x80
[56395.512504] [<ffffffffa5f0ff00>] tick_sched_handle+0x30/0x70
[56395.512505] [<ffffffffa5f101c9>] tick_sched_timer+0x39/0x80
[56395.512507] [<ffffffffa5eca71e>] __hrtimer_run_queues+0x10e/0x270
[56395.512509] [<ffffffffa5ecac7f>] hrtimer_interrupt+0xaf/0x1d0
[56395.512513] [<ffffffffa5e5c61b>] local_apic_timer_interrupt+0x3b/0x60
[56395.512516] [<ffffffffa65929d3>] smp_apic_timer_interrupt+0x43/0x60
[56395.512517] [<ffffffffa658eefa>] apic_timer_interrupt+0x16a/0x170
[56395.512518] <EOI> [<ffffffffa5f17682>] ? native_queued_spin_lock_slowpath+0x122/0x200
[56395.512520] [<ffffffffa65754ee>] queued_spin_lock_slowpath+0xb/0xf
[56395.512521] [<ffffffffa6583b20>] _raw_spin_lock+0x20/0x30
[56395.512526] [<ffffffffc0be9788>] cfs_percpt_lock+0x58/0x110 [libcfs]
[56395.512532] [<ffffffffc1405788>] LNetMDUnlink+0x78/0x180 [lnet]
[56395.512547] [<ffffffffc14ab96f>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
[56395.512562] [<ffffffffc14b05de>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
[56395.512576] [<ffffffffc14b0abf>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[56395.512591] [<ffffffffc14e00ec>] ptlrpcd+0x28c/0x560 [ptlrpc]
[56395.512592] [<ffffffffa5edb600>] ? wake_up_state+0x20/0x20
[56395.512607] [<ffffffffc14dfe60>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[56395.512608] [<ffffffffa5ec6321>] kthread+0xd1/0xe0
[56395.512609] [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
[56395.512610] [<ffffffffa658dd24>] ret_from_fork_nospec_begin+0xe/0x21
[56395.512612] [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
...
Environment
- Red Hat Enterprise Linux 7.7 (kernel-3.10.0-1062.18.1.el7)
- lustre, libcfs and lnet are installed and loaded.
- libcfs is an API used throughout Lustre and LNet and provides primitives for process/memory management and debugging.
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.