Soft lockups and RCU sched CPU stalls are detected where many CPUs are looping in a spinlock. A hard lockup is encountered and then the kernel crashes in the end.

Solution Unverified - Updated -

Issue

  • Soft lockups and RCU sched CPU stalls are detected where many CPUs are looping in a spinlock. A hard lockup is encountered and then the kernel crashes in the end.
soft lockups:
    ...
[56376.979162] NMI watchdog: BUG: soft lockup - CPU#26 stuck for 23s! [ptlrpcd_00_00:12056]
[56376.979182] Modules linked in: nfsv3 nfs_acl mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mlx4_core(OE) vfat fat ext4 mbcache jbd2 amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass pcspkr snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer sg snd ipmi_si soundcore ipmi_devintf ipmi_msghandler pcc_cpufreq k10temp i2c_piix4 i2c_designware_platform pinctrl_amd i2c_designware_core acpi_cpufreq binfmt_misc ip_tables xfs libcrc32c raid1 sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul
[56376.979190]  crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel ast aesni_intel lrw gf128mul glue_helper ablk_helper cryptd drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops mlx5_core(OE) ttm libahci mlx_compat(OE) drm igb mlxfw(OE) libata devlink dca drm_panel_orientation_quirks i2c_algo_bit ptp pps_core nfit libnvdimm sunrpc dm_mirror dm_region_hash dm_log dm_mod
[56376.979192] CPU: 26 PID: 12056 Comm: ptlrpcd_00_00 Kdump: loaded Tainted: G           OEL ------------   3.10.0-1062.18.1.el7.x86_64 #1
[56376.979192] Hardware name: GIGABYTE H262-Z61-00/MZ62-HD0-00, BIOS R05 08/27/2019
[56376.979193] task: ffff9aac9fcd20e0 ti: ffff9aac7a4fc000 task.ti: ffff9aac7a4fc000
[56376.979195] RIP: 0010:[<ffffffffa5f17682>]  [<ffffffffa5f17682>] native_queued_spin_lock_slowpath+0x122/0x200
[56376.979196] RSP: 0018:ffff9aac7a4ffb68  EFLAGS: 00000246
[56376.979197] RAX: 0000000000000000 RBX: ffff9a7d2875d800 RCX: 0000000000d10000
[56376.979197] RDX: ffff9aaca769b880 RSI: 0000000001d10000 RDI: ffff9a8ccb95ab80
[56376.979198] RBP: ffff9aac7a4ffb68 R08: ffff9a8ced69b880 R09: 0000000000000000
[56376.979198] R10: 0000000000000000 R11: 000000000000000f R12: ffff9a7d287a6780
[56376.979199] R13: 000000000000000a R14: ffffffffc143ee2b R15: 0000000000000003
[56376.979200] FS:  0000000000000000(0000) GS:ffff9a8ced680000(0000) knlGS:0000000000000000
[56376.979201] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[56376.979201] CR2: 00002b3a697cc090 CR3: 0000000e7fdfe000 CR4: 0000000000340fe0
[56376.979202] Call Trace:
[56376.979204]  [<ffffffffa65754ee>] queued_spin_lock_slowpath+0xb/0xf
[56376.979206]  [<ffffffffa6583b20>] _raw_spin_lock+0x20/0x30
[56376.979218]  [<ffffffffc0be9788>] cfs_percpt_lock+0x58/0x110 [libcfs]
[56376.979228]  [<ffffffffc1405788>] LNetMDUnlink+0x78/0x180 [lnet]
[56376.979252]  [<ffffffffc14ab96f>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
[56376.979254]  [<ffffffffa5eae692>] ? del_timer_sync+0x52/0x60
[56376.979269]  [<ffffffffc14b05de>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
[56376.979284]  [<ffffffffc14b0abf>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[56376.979302]  [<ffffffffc14e00ec>] ptlrpcd+0x28c/0x560 [ptlrpc]
[56376.979303]  [<ffffffffa5edb600>] ? wake_up_state+0x20/0x20
[56376.979319]  [<ffffffffc14dfe60>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[56376.979320]  [<ffffffffa5ec6321>] kthread+0xd1/0xe0
[56376.979321]  [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
[56376.979323]  [<ffffffffa658dd24>] ret_from_fork_nospec_begin+0xe/0x21
[56376.979324]  [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
[56376.979335] Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 80 b8 01 00 48 03 14 c5 60 e8 b4 a6 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 
    ...
RCU sched CPU stalls:
    ...
[56377.012166] INFO: rcu_sched self-detected stall on CPU {
[56377.012167] INFO: rcu_sched self-detected stall on CPU
[56377.012168] INFO: rcu_sched self-detected stall on CPU
[56377.012169] INFO: rcu_sched self-detected stall on CPU
[56377.012170]  47
[56377.012171]  {
[56377.012171]  {
[56377.012172]  {
[56377.012173] } 
[56377.012174]  49
[56377.012174]  56
[56377.012175]  58
[56377.012176] } 
[56377.012176] } 
[56377.012192]  (t=60000 jiffies g=4828650 c=4828649 q=225869)
[56377.012193] Task dump for CPU 33:
[56377.012195] }  (t=60000 jiffies g=4828650 c=4828649 q=225869)
[56377.012196]  (t=60000 jiffies g=4828650 c=4828649 q=225869)
[56377.012199]  (t=60000 jiffies g=4828650 c=4828649 q=225869)
[56377.012201] ptlrpcd_01_07   R  running task        0 12095      2 0x00000088
[56377.012202] Call Trace:
[56377.012235]  [<ffffffffc14ab96f>] ? ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
[56377.012250]  [<ffffffffc14b05de>] ? ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
[56377.012264]  [<ffffffffc14b0abf>] ? ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[56377.012281]  [<ffffffffc14e00ec>] ? ptlrpcd+0x28c/0x560 [ptlrpc]
[56377.012282]  [<ffffffffa5edb600>] ? wake_up_state+0x20/0x20
[56377.012297]  [<ffffffffc14dfe60>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[56377.012299]  [<ffffffffa5ec6321>] ? kthread+0xd1/0xe0
[56377.012300]  [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
[56377.012301]  [<ffffffffa658dd24>] ? ret_from_fork_nospec_begin+0xe/0x21
[56377.012302]  [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
    ...
Finally the kernel crashes due to a hard lockup:
    ...
[56395.512442] Kernel panic - not syncing: Hard LOCKUP
[56395.512443] CPU: 62 PID: 12108 Comm: ptlrpcd_01_20 Kdump: loaded Tainted: G           OEL ------------   3.10.0-1062.18.1.el7.x86_64 #1
[56395.512444] Hardware name: GIGABYTE H262-Z61-00/MZ62-HD0-00, BIOS R05 08/27/2019
[56395.512444] Call Trace:
[56395.512449]  <NMI>  [<ffffffffa657b416>] dump_stack+0x19/0x1b
[56395.512450]  [<ffffffffa6574a0b>] panic+0xe8/0x21f
[56395.512454]  [<ffffffffa5e2f958>] ? show_regs+0x58/0x290
[56395.512457]  [<ffffffffa5e9b80f>] nmi_panic+0x3f/0x40
[56395.512461]  [<ffffffffa5f4edd1>] watchdog_overflow_callback+0x121/0x140
[56395.512463]  [<ffffffffa5fa84c7>] __perf_event_overflow+0x57/0x100
[56395.512465]  [<ffffffffa5fb1c64>] perf_event_overflow+0x14/0x20
[56395.512469]  [<ffffffffa5e05525>] x86_pmu_handle_irq+0x125/0x180
[56395.512473]  [<ffffffffa6183db8>] ? ioremap_page_range+0x2e8/0x480
[56395.512475]  [<ffffffffa60030c4>] ? vunmap_page_range+0x234/0x470
[56395.512476]  [<ffffffffa6003311>] ? unmap_kernel_range_noflush+0x11/0x20
[56395.512478]  [<ffffffffa624acee>] ? ghes_copy_tofrom_phys+0x10e/0x210
[56395.512479]  [<ffffffffa624ae90>] ? ghes_read_estatus+0xa0/0x190
[56395.512480]  [<ffffffffa5e0668a>] amd_pmu_handle_irq+0x3a/0x80
[56395.512482]  [<ffffffffa6584031>] perf_event_nmi_handler+0x31/0x50
[56395.512484]  [<ffffffffa658593c>] nmi_handle.isra.0+0x8c/0x150
[56395.512486]  [<ffffffffa6585c18>] do_nmi+0x218/0x460
[56395.512487]  [<ffffffffa6584d9c>] end_repeat_nmi+0x1e/0x81
[56395.512488]  [<ffffffffa5f17682>] ? native_queued_spin_lock_slowpath+0x122/0x200
[56395.512489]  [<ffffffffa5f17682>] ? native_queued_spin_lock_slowpath+0x122/0x200
[56395.512490]  [<ffffffffa5f17682>] ? native_queued_spin_lock_slowpath+0x122/0x200
[56395.512492]  <EOE>  <IRQ>  [<ffffffffa65754ee>] queued_spin_lock_slowpath+0xb/0xf
[56395.512494]  [<ffffffffa6583ba7>] _raw_spin_lock_irqsave+0x37/0x40
[56395.512496]  [<ffffffffa5f581ab>] rcu_dump_cpu_stacks+0x4b/0xd0
[56395.512498]  [<ffffffffa5f5b8b2>] rcu_check_callbacks+0x442/0x730
[56395.512501]  [<ffffffffa5f10190>] ? tick_sched_do_timer+0x50/0x50
[56395.512503]  [<ffffffffa5eaf666>] update_process_times+0x46/0x80
[56395.512504]  [<ffffffffa5f0ff00>] tick_sched_handle+0x30/0x70
[56395.512505]  [<ffffffffa5f101c9>] tick_sched_timer+0x39/0x80
[56395.512507]  [<ffffffffa5eca71e>] __hrtimer_run_queues+0x10e/0x270
[56395.512509]  [<ffffffffa5ecac7f>] hrtimer_interrupt+0xaf/0x1d0
[56395.512513]  [<ffffffffa5e5c61b>] local_apic_timer_interrupt+0x3b/0x60
[56395.512516]  [<ffffffffa65929d3>] smp_apic_timer_interrupt+0x43/0x60
[56395.512517]  [<ffffffffa658eefa>] apic_timer_interrupt+0x16a/0x170
[56395.512518]  <EOI>  [<ffffffffa5f17682>] ? native_queued_spin_lock_slowpath+0x122/0x200
[56395.512520]  [<ffffffffa65754ee>] queued_spin_lock_slowpath+0xb/0xf
[56395.512521]  [<ffffffffa6583b20>] _raw_spin_lock+0x20/0x30
[56395.512526]  [<ffffffffc0be9788>] cfs_percpt_lock+0x58/0x110 [libcfs]
[56395.512532]  [<ffffffffc1405788>] LNetMDUnlink+0x78/0x180 [lnet]
[56395.512547]  [<ffffffffc14ab96f>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
[56395.512562]  [<ffffffffc14b05de>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
[56395.512576]  [<ffffffffc14b0abf>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[56395.512591]  [<ffffffffc14e00ec>] ptlrpcd+0x28c/0x560 [ptlrpc]
[56395.512592]  [<ffffffffa5edb600>] ? wake_up_state+0x20/0x20
[56395.512607]  [<ffffffffc14dfe60>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[56395.512608]  [<ffffffffa5ec6321>] kthread+0xd1/0xe0
[56395.512609]  [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
[56395.512610]  [<ffffffffa658dd24>] ret_from_fork_nospec_begin+0xe/0x21
[56395.512612]  [<ffffffffa5ec6250>] ? insert_kthread_work+0x40/0x40
    ...

Environment

  • Red Hat Enterprise Linux 7.7 (kernel-3.10.0-1062.18.1.el7)
  • lustre, libcfs and lnet are installed and loaded.
  • libcfs is an API used throughout Lustre and LNet and provides primitives for process/memory management and debugging.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content