RHEL8 system with the kernel 4.18.0-425.10.1.el8_7 or higher fails to boot and hangs with soft lockup and rcu_sched CPU stall

Solution Verified - Updated -

Issue

  • RHEL8.7 system fails to boot and hangs with the following soft lockup and rcu_sched CPU stall events after kernel upgrade from version 4.18.0-425.3.1.el8 to 4.18.0-425.10.1.el8_7 or higher.
[   72.146997] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:3:1452]
[   72.154433] Modules linked in: mgag200(+) i2c_algo_bit mlx5_core(O+) drm_shmem_helper drm_kms_helper syscopyarea mlxfw(O) sysfillrect pci_hyperv_intf sysimgblt psample fb_sys_fops mlxdevm(O) smartpqi(+) drm scsi_transport_sas mlx_compat(O) crc32c_intel tg3 tls dm_mirror dm_region_hash dm_log dm_mod fuse
[   72.181450] CPU: 0 PID: 1452 Comm: kworker/0:3 Tainted: GOL ----------- 4.18.0-425.19.2.el8_7.x86_64 #1
[   72.192381] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/16/2023
[   72.200954] Workqueue: events work_for_cpu_fn
[   72.205334] RIP: 0010:native_queued_spin_lock_slowpath+0x5f/0x1c0
[   72.211464] Code: 71 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 e9 4e b6 aa 00 8b 37 81
[   72.230342] RSP: 0018:ffffad140f83bbb8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[   72.237954] RAX: 0000000000000101 RBX: 0000000000000010 RCX: 0000000000000000
[   72.245127] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f494ebec28c
[   72.252300] RBP: ffffad140f83bca0 R08: 00000000006000c0 R09: ffff9f49c3522f80
[   72.259474] R10: ffff9f49c3522fe0 R11: ffffad140f83bd30 R12: ffff9f494ebec1a0
[   72.266649] R13: ffff9f49c3522f80 R14: ffff9f494ebec28c R15: ffffffffc05a3380
[   72.273824] FS:  0000000000000000(0000) GS:ffff9f77bf400000(0000) knlGS:0000000000000000
[   72.281960] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   72.287735] CR2: 00007fdc8d3c08d0 CR3: 00000057f6e10002 CR4: 00000000007706f0
[   72.294908] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   72.302081] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   72.309254] PKRU: 55555554
[   72.311973] Call Trace:
[   72.314430]  _raw_spin_lock+0x1e/0x30
[   72.318112]  cmd_exec+0x1c0/0x9d0 [mlx5_core]
[   72.322552]  mlx5_cmd_do+0x1e/0x40 [mlx5_core]
[   72.327077]  mlx5_cmd_exec+0x17/0x30 [mlx5_core]
[   72.331775]  mlx5_core_set_issi+0x66/0x120 [mlx5_core]
[   72.336996]  mlx5_function_setup+0x137/0x620 [mlx5_core]
[   72.342390]  ? proc_register+0xcf/0x140
[   72.346246]  mlx5_init_one+0x34/0x110 [mlx5_core]
[   72.351028]  probe_one+0x234/0x300 [mlx5_core]
[   72.355551]  local_pci_probe+0x42/0x80
[   72.359320]  work_for_cpu_fn+0x16/0x20
[   72.363088]  process_one_work+0x1a7/0x360
[   72.367118]  ? create_worker+0x1a0/0x1a0
[   72.371061]  worker_thread+0x1ce/0x390
[   72.374828]  ? create_worker+0x1a0/0x1a0
[   72.378771]  kthread+0x10b/0x130
[   72.382016]  ? set_kthread_struct+0x50/0x50
[   72.386220]  ret_from_fork+0x1f/0x40
[   79.860997] rcu: INFO: rcu_sched self-detected stall on CPU
[   79.866598] rcu: 0-....: (59516 ticks this GP) idle=2de/1/0x4000000000000002 softirq=783/783 fqs=12084 
[   79.876135] (t=60016 jiffies g=1373 q=2880)
[   79.880426] NMI backtrace for cpu 0
[   79.883932] CPU: 0 PID: 1452 Comm: kworker/0:3 Tainted: GOL ----------- 4.18.0-425.19.2.el8_7.x86_64 #1
[   79.894863] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/16/2023
[   79.903434] Workqueue: events work_for_cpu_fn
[   79.907814] Call Trace:
[   79.910272]  <IRQ>
[   79.912292]  dump_stack+0x41/0x60
[   79.915626]  nmi_cpu_backtrace.cold.8+0x13/0x4f
[   79.920180]  ? lapic_can_unplug_cpu.cold.30+0x43/0x43
[   79.925263]  nmi_trigger_cpumask_backtrace+0xe9/0xee
[   79.930254]  rcu_dump_cpu_stacks+0xc8/0xfc
[   79.934375]  rcu_sched_clock_irq.cold.101+0xde/0x215
[   79.939367]  ? tick_sched_do_timer+0x50/0x50
[   79.943661]  ? tick_sched_do_timer+0x50/0x50
[   79.947951]  update_process_times+0x55/0x80
[   79.952158]  tick_sched_handle+0x22/0x60
[   79.956101]  tick_sched_timer+0x37/0x80
[   79.959957]  __hrtimer_run_queues+0x101/0x280
[   79.964338]  hrtimer_interrupt+0x100/0x220
[   79.968456]  ? sched_clock+0x5/0x10
[   79.971964]  smp_apic_timer_interrupt+0x6a/0x130
[   79.976606]  apic_timer_interrupt+0xf/0x20
[   79.980722]  </IRQ>
[   79.982828] RIP: 0010:native_queued_spin_lock_slowpath+0x5f/0x1c0
[   79.988956] Code: 71 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 e9 4e b6 aa 00 8b 37 81
[   80.007835] RSP: 0018:ffffad140f83bbb8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[   80.015446] RAX: 0000000000000101 RBX: 0000000000000010 RCX: 0000000000000000
[   80.022619] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f494ebec28c
[   80.029794] RBP: ffffad140f83bca0 R08: 00000000006000c0 R09: ffff9f49c3522f80
[   80.036967] R10: ffff9f49c3522fe0 R11: ffffad140f83bd30 R12: ffff9f494ebec1a0
[   80.044140] R13: ffff9f49c3522f80 R14: ffff9f494ebec28c R15: ffffffffc05a3380
[   80.051314]  _raw_spin_lock+0x1e/0x30
[   80.054995]  cmd_exec+0x1c0/0x9d0 [mlx5_core]
[   80.059434]  mlx5_cmd_do+0x1e/0x40 [mlx5_core]
[   80.063957]  mlx5_cmd_exec+0x17/0x30 [mlx5_core]
[   80.068654]  mlx5_core_set_issi+0x66/0x120 [mlx5_core]
[   80.073874]  mlx5_function_setup+0x137/0x620 [mlx5_core]
[   80.079271]  ? proc_register+0xcf/0x140
[   80.083125]  mlx5_init_one+0x34/0x110 [mlx5_core]
[   80.087909]  probe_one+0x234/0x300 [mlx5_core]
[   80.092431]  local_pci_probe+0x42/0x80
[   80.096199]  work_for_cpu_fn+0x16/0x20
[   80.099968]  process_one_work+0x1a7/0x360
[   80.103997]  ? create_worker+0x1a0/0x1a0
[   80.107937]  worker_thread+0x1ce/0x390
[   80.111704]  ? create_worker+0x1a0/0x1a0
[   80.115646]  kthread+0x10b/0x130
[   80.118891]  ? set_kthread_struct+0x50/0x50
[   80.123095]  ret_from_fork+0x1f/0x40

Environment

  • Red Hat Enterprise Linux release 8.7 (Ootpa)
  • kernel-4.18.0-425.10.1.el8_7 or higher
  • Mellanox Technologies/HPE: Out-of-tree (O) kernel module: [mlx5_core]
  • Third-party RPM: kmod-mlnx-ofa_kernel-5.8-OFED.5.8.1.1.2.1.rhel8u7.x86_64

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content