stall-warning deadlock due to non-release of rcu_node ->lock spinlock
Issue
- The server is getting hung up frequently with RCU CPU stall message followed by "BUG: scheduling while atomic", blocked task messages, and soft lockup occurrence.
[12091.893384] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[12091.893394] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-13):
[12091.893406] (detected by 8, t=60002 jiffies, g=2614757, q=113003)
[12091.893412] rcu: All QSes seen, last rcu_preempt kthread activity 1 (4306759294-4306759293), jiffies_till_next_fqs=3, root ->qsmask 0x1
[12091.896390] BUG: scheduling while atomic: swapper/8/0/0x00000003
[12091.896396] Modules linked in: ...
[12091.896532] Preemption disabled at:
[12091.896533] [<ffffffffb6654fff>] start_secondary+0x5f/0x1e0
[12091.896551]
[12091.896554] CPU: 8 PID: 0 Comm: swapper/8 Kdump: loaded Tainted: G OE --------- - - 4.18.0-372.32.1.rt7.189.el8_6.x86_64 #1
[12091.896563] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 05/27/2024
[12091.896566] Call Trace:
[12091.896574] dump_stack+0x41/0x60
[12091.896589] ? start_secondary+0x5f/0x1e0
[12091.896597] __schedule_bug.cold.104+0x87/0x94
[12091.896610] __schedule+0x593/0x9b0
[12091.896623] schedule_idle+0x1c/0x40
[12091.896630] do_idle+0x1db/0x320
[12091.896645] cpu_startup_entry+0x46/0x50
[12091.896655] start_secondary+0x19f/0x1e0
[12091.896665] secondary_startup_64_no_verify+0xc2/0xcb
[12166.005897] INFO: task rcub/1:16 blocked for more than 120 seconds.
[12166.005909] Tainted: G W OE --------- - - 4.18.0-372.32.1.rt7.189.el8_6.x86_64 #1
[12166.005916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12166.005920] task:rcub/1 state:D stack: 0 pid: 16 ppid: 2 flags:0x80004000
[12166.005936] Call Trace:
[12166.005947] __schedule+0x38b/0x9b0
[12166.005971] schedule+0x3d/0xf0
[12166.005980] rt_mutex_slowlock_block.isra.18+0x9c/0x180
[12166.005992] rt_mutex_slowlock.constprop.21+0xd4/0x140
[12166.006009] rcu_boost_kthread+0xf6/0x480
[12166.006024] ? kfree_rcu_shrink_scan+0x250/0x250
[12166.006035] kthread+0x151/0x170
[12166.006045] ? set_kthread_struct+0x50/0x50
[12166.006055] ret_from_fork+0x1f/0x40
...
[12166.006286] INFO: task tcpdump:199038 blocked for more than 120 seconds.
...
[12166.006566] INFO: task tcpdump:200266 blocked for more than 120 seconds.
...
[12267.933782] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [oid_oam_manager:187913]
...
[12267.933841] CPU: 1 PID: 187913 Comm: oid_oam_manager Kdump: loaded Tainted: G W OE --------- - - 4.18.0-372.32.1.rt7.189.el8_6.x86_64 #1
...
[12267.933846] RIP: 0010:smp_call_function_single+0xbc/0x1a0
...
[12267.933873] Call Trace:
[12267.933875] ? flush_tlb_func_common.constprop.8+0x2d0/0x2d0
[12267.933880] ? flush_tlb_func_common.constprop.8+0x2d0/0x2d0
[12267.933884] flush_tlb_mm_range+0x132/0x190
[12267.933888] ptep_clear_flush+0x58/0x70
[12267.933892] wp_page_copy+0x27a/0x580
[12267.933897] do_wp_page+0xef/0x450
[12267.933900] __handle_mm_fault+0x6d9/0x9b0
[12267.933905] handle_mm_fault+0xd1/0x1f0
[12267.933908] do_user_addr_fault+0x196/0x4d0
[12267.933912] do_page_fault+0x54/0x1c0
[12267.933915] ? page_fault+0x8/0x30
[12267.933919] page_fault+0x1e/0x30
[12267.933923] RIP: 0033:0x7efcc32b517a
Environment
- Red Hat Enterprise Linux for Real Time 8.6.z versions are affected
- Red Hat Enterprise Linux for Real Time 8.7.z versions - older than kernel-rt-4.18.0-425.13.1.rt7.223.el8_7 - are affected
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.