stall-warning deadlock due to non-release of rcu_node ->lock spinlock

Solution Verified - Updated -

Issue

  • The server is getting hung up frequently with RCU CPU stall message followed by "BUG: scheduling while atomic", blocked task messages, and soft lockup occurrence.
[12091.893384] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[12091.893394] rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-13):
[12091.893406]  (detected by 8, t=60002 jiffies, g=2614757, q=113003)
[12091.893412] rcu: All QSes seen, last rcu_preempt kthread activity 1 (4306759294-4306759293), jiffies_till_next_fqs=3, root ->qsmask 0x1
[12091.896390] BUG: scheduling while atomic: swapper/8/0/0x00000003
[12091.896396] Modules linked in: ...
[12091.896532] Preemption disabled at:
[12091.896533] [<ffffffffb6654fff>] start_secondary+0x5f/0x1e0
[12091.896551] 
[12091.896554] CPU: 8 PID: 0 Comm: swapper/8 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-372.32.1.rt7.189.el8_6.x86_64 #1
[12091.896563] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 05/27/2024
[12091.896566] Call Trace:
[12091.896574]  dump_stack+0x41/0x60
[12091.896589]  ? start_secondary+0x5f/0x1e0
[12091.896597]  __schedule_bug.cold.104+0x87/0x94
[12091.896610]  __schedule+0x593/0x9b0
[12091.896623]  schedule_idle+0x1c/0x40
[12091.896630]  do_idle+0x1db/0x320
[12091.896645]  cpu_startup_entry+0x46/0x50
[12091.896655]  start_secondary+0x19f/0x1e0
[12091.896665]  secondary_startup_64_no_verify+0xc2/0xcb
[12166.005897] INFO: task rcub/1:16 blocked for more than 120 seconds.
[12166.005909]       Tainted: G        W  OE    --------- -  - 4.18.0-372.32.1.rt7.189.el8_6.x86_64 #1
[12166.005916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12166.005920] task:rcub/1          state:D stack:    0 pid:   16 ppid:     2 flags:0x80004000
[12166.005936] Call Trace:
[12166.005947]  __schedule+0x38b/0x9b0
[12166.005971]  schedule+0x3d/0xf0
[12166.005980]  rt_mutex_slowlock_block.isra.18+0x9c/0x180
[12166.005992]  rt_mutex_slowlock.constprop.21+0xd4/0x140
[12166.006009]  rcu_boost_kthread+0xf6/0x480
[12166.006024]  ? kfree_rcu_shrink_scan+0x250/0x250
[12166.006035]  kthread+0x151/0x170
[12166.006045]  ? set_kthread_struct+0x50/0x50
[12166.006055]  ret_from_fork+0x1f/0x40
        ...
[12166.006286] INFO: task tcpdump:199038 blocked for more than 120 seconds.
        ...
[12166.006566] INFO: task tcpdump:200266 blocked for more than 120 seconds.
        ...
[12267.933782] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [oid_oam_manager:187913]
        ...
[12267.933841] CPU: 1 PID: 187913 Comm: oid_oam_manager Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-372.32.1.rt7.189.el8_6.x86_64 #1
        ...
[12267.933846] RIP: 0010:smp_call_function_single+0xbc/0x1a0
        ...
[12267.933873] Call Trace:
[12267.933875]  ? flush_tlb_func_common.constprop.8+0x2d0/0x2d0
[12267.933880]  ? flush_tlb_func_common.constprop.8+0x2d0/0x2d0
[12267.933884]  flush_tlb_mm_range+0x132/0x190
[12267.933888]  ptep_clear_flush+0x58/0x70
[12267.933892]  wp_page_copy+0x27a/0x580
[12267.933897]  do_wp_page+0xef/0x450
[12267.933900]  __handle_mm_fault+0x6d9/0x9b0
[12267.933905]  handle_mm_fault+0xd1/0x1f0
[12267.933908]  do_user_addr_fault+0x196/0x4d0
[12267.933912]  do_page_fault+0x54/0x1c0
[12267.933915]  ? page_fault+0x8/0x30
[12267.933919]  page_fault+0x1e/0x30
[12267.933923] RIP: 0033:0x7efcc32b517a

Environment

  • Red Hat Enterprise Linux for Real Time 8.6.z versions are affected
  • Red Hat Enterprise Linux for Real Time 8.7.z versions - older than kernel-rt-4.18.0-425.13.1.rt7.223.el8_7 - are affected

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content