RHEL8 system with the kernel 4.18.0-425.10.1.el8_7 or higher fails to boot and hangs with soft lockup and rcu_sched CPU stall
Issue
- RHEL8.7 system fails to boot and hangs with the following soft lockup and rcu_sched CPU stall events after kernel upgrade from version
4.18.0-425.3.1.el8to4.18.0-425.10.1.el8_7or higher.
[ 72.146997] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:3:1452]
[ 72.154433] Modules linked in: mgag200(+) i2c_algo_bit mlx5_core(O+) drm_shmem_helper drm_kms_helper syscopyarea mlxfw(O) sysfillrect pci_hyperv_intf sysimgblt psample fb_sys_fops mlxdevm(O) smartpqi(+) drm scsi_transport_sas mlx_compat(O) crc32c_intel tg3 tls dm_mirror dm_region_hash dm_log dm_mod fuse
[ 72.181450] CPU: 0 PID: 1452 Comm: kworker/0:3 Tainted: GOL ----------- 4.18.0-425.19.2.el8_7.x86_64 #1
[ 72.192381] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/16/2023
[ 72.200954] Workqueue: events work_for_cpu_fn
[ 72.205334] RIP: 0010:native_queued_spin_lock_slowpath+0x5f/0x1c0
[ 72.211464] Code: 71 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 e9 4e b6 aa 00 8b 37 81
[ 72.230342] RSP: 0018:ffffad140f83bbb8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 72.237954] RAX: 0000000000000101 RBX: 0000000000000010 RCX: 0000000000000000
[ 72.245127] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f494ebec28c
[ 72.252300] RBP: ffffad140f83bca0 R08: 00000000006000c0 R09: ffff9f49c3522f80
[ 72.259474] R10: ffff9f49c3522fe0 R11: ffffad140f83bd30 R12: ffff9f494ebec1a0
[ 72.266649] R13: ffff9f49c3522f80 R14: ffff9f494ebec28c R15: ffffffffc05a3380
[ 72.273824] FS: 0000000000000000(0000) GS:ffff9f77bf400000(0000) knlGS:0000000000000000
[ 72.281960] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 72.287735] CR2: 00007fdc8d3c08d0 CR3: 00000057f6e10002 CR4: 00000000007706f0
[ 72.294908] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 72.302081] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 72.309254] PKRU: 55555554
[ 72.311973] Call Trace:
[ 72.314430] _raw_spin_lock+0x1e/0x30
[ 72.318112] cmd_exec+0x1c0/0x9d0 [mlx5_core]
[ 72.322552] mlx5_cmd_do+0x1e/0x40 [mlx5_core]
[ 72.327077] mlx5_cmd_exec+0x17/0x30 [mlx5_core]
[ 72.331775] mlx5_core_set_issi+0x66/0x120 [mlx5_core]
[ 72.336996] mlx5_function_setup+0x137/0x620 [mlx5_core]
[ 72.342390] ? proc_register+0xcf/0x140
[ 72.346246] mlx5_init_one+0x34/0x110 [mlx5_core]
[ 72.351028] probe_one+0x234/0x300 [mlx5_core]
[ 72.355551] local_pci_probe+0x42/0x80
[ 72.359320] work_for_cpu_fn+0x16/0x20
[ 72.363088] process_one_work+0x1a7/0x360
[ 72.367118] ? create_worker+0x1a0/0x1a0
[ 72.371061] worker_thread+0x1ce/0x390
[ 72.374828] ? create_worker+0x1a0/0x1a0
[ 72.378771] kthread+0x10b/0x130
[ 72.382016] ? set_kthread_struct+0x50/0x50
[ 72.386220] ret_from_fork+0x1f/0x40
[ 79.860997] rcu: INFO: rcu_sched self-detected stall on CPU
[ 79.866598] rcu: 0-....: (59516 ticks this GP) idle=2de/1/0x4000000000000002 softirq=783/783 fqs=12084
[ 79.876135] (t=60016 jiffies g=1373 q=2880)
[ 79.880426] NMI backtrace for cpu 0
[ 79.883932] CPU: 0 PID: 1452 Comm: kworker/0:3 Tainted: GOL ----------- 4.18.0-425.19.2.el8_7.x86_64 #1
[ 79.894863] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/16/2023
[ 79.903434] Workqueue: events work_for_cpu_fn
[ 79.907814] Call Trace:
[ 79.910272] <IRQ>
[ 79.912292] dump_stack+0x41/0x60
[ 79.915626] nmi_cpu_backtrace.cold.8+0x13/0x4f
[ 79.920180] ? lapic_can_unplug_cpu.cold.30+0x43/0x43
[ 79.925263] nmi_trigger_cpumask_backtrace+0xe9/0xee
[ 79.930254] rcu_dump_cpu_stacks+0xc8/0xfc
[ 79.934375] rcu_sched_clock_irq.cold.101+0xde/0x215
[ 79.939367] ? tick_sched_do_timer+0x50/0x50
[ 79.943661] ? tick_sched_do_timer+0x50/0x50
[ 79.947951] update_process_times+0x55/0x80
[ 79.952158] tick_sched_handle+0x22/0x60
[ 79.956101] tick_sched_timer+0x37/0x80
[ 79.959957] __hrtimer_run_queues+0x101/0x280
[ 79.964338] hrtimer_interrupt+0x100/0x220
[ 79.968456] ? sched_clock+0x5/0x10
[ 79.971964] smp_apic_timer_interrupt+0x6a/0x130
[ 79.976606] apic_timer_interrupt+0xf/0x20
[ 79.980722] </IRQ>
[ 79.982828] RIP: 0010:native_queued_spin_lock_slowpath+0x5f/0x1c0
[ 79.988956] Code: 71 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 e9 4e b6 aa 00 8b 37 81
[ 80.007835] RSP: 0018:ffffad140f83bbb8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 80.015446] RAX: 0000000000000101 RBX: 0000000000000010 RCX: 0000000000000000
[ 80.022619] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f494ebec28c
[ 80.029794] RBP: ffffad140f83bca0 R08: 00000000006000c0 R09: ffff9f49c3522f80
[ 80.036967] R10: ffff9f49c3522fe0 R11: ffffad140f83bd30 R12: ffff9f494ebec1a0
[ 80.044140] R13: ffff9f49c3522f80 R14: ffff9f494ebec28c R15: ffffffffc05a3380
[ 80.051314] _raw_spin_lock+0x1e/0x30
[ 80.054995] cmd_exec+0x1c0/0x9d0 [mlx5_core]
[ 80.059434] mlx5_cmd_do+0x1e/0x40 [mlx5_core]
[ 80.063957] mlx5_cmd_exec+0x17/0x30 [mlx5_core]
[ 80.068654] mlx5_core_set_issi+0x66/0x120 [mlx5_core]
[ 80.073874] mlx5_function_setup+0x137/0x620 [mlx5_core]
[ 80.079271] ? proc_register+0xcf/0x140
[ 80.083125] mlx5_init_one+0x34/0x110 [mlx5_core]
[ 80.087909] probe_one+0x234/0x300 [mlx5_core]
[ 80.092431] local_pci_probe+0x42/0x80
[ 80.096199] work_for_cpu_fn+0x16/0x20
[ 80.099968] process_one_work+0x1a7/0x360
[ 80.103997] ? create_worker+0x1a0/0x1a0
[ 80.107937] worker_thread+0x1ce/0x390
[ 80.111704] ? create_worker+0x1a0/0x1a0
[ 80.115646] kthread+0x10b/0x130
[ 80.118891] ? set_kthread_struct+0x50/0x50
[ 80.123095] ret_from_fork+0x1f/0x40
Environment
- Red Hat Enterprise Linux release 8.7 (Ootpa)
- kernel-4.18.0-425.10.1.el8_7 or higher
- Mellanox Technologies/HPE: Out-of-tree (O) kernel module:
[mlx5_core] - Third-party RPM:
kmod-mlnx-ofa_kernel-5.8-OFED.5.8.1.1.2.1.rhel8u7.x86_64
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.