Sleeping or scheduling after sched_cpu_dying() leads to "scheduling while atomic" and BUG at kernel/cpu.c:907!

Solution Verified - Updated -

Issue

Rare, during CPU offline procedures the following BUG_ON()s might trigger:

[44369.425382] CPU29: shutdown
[44369.455042] psci: Retrying again to check for CPU kill
[44369.458362] psci: CPU29 killed.
[44369.555289] BUG: scheduling while atomic: migration/30/210/0x00000001
[44369.561701] Modules linked in: ...
[44369.561778] CPU: 30 PID: 210 Comm: migration/30 Tainted: G        W  OE    --------- -  - 4.18.0-193.6.3.el8_2.aarch64 #1
[44369.561781] Hardware name: ...
[44369.561783] Call trace:
[44369.561791]  dump_backtrace+0x0/0x188
[44369.561794]  show_stack+0x24/0x30
[44369.561801]  dump_stack+0x90/0xb4
[44369.561806]  __schedule_bug+0x70/0x80
[44369.561810]  __schedule+0x698/0x778
[44369.561813]  schedule+0x38/0xb8
[44369.561816]  schedule_timeout+0x264/0x388
[44369.561819]  wait_for_common+0x1a8/0x210
[44369.561822]  wait_for_completion+0x28/0x38
[44369.561826]  flush_work+0x118/0x238
[44369.561829]  __cancel_work_timer+0x12c/0x1a0
[44369.561832]  cancel_delayed_work_sync+0x24/0x30
[44369.561834]  sched_cpu_dying+0x414/0x470
[44369.561839]  cpuhp_invoke_callback+0xa8/0x618
[44369.561842]  take_cpu_down+0x84/0xe0
[44369.561846]  multi_cpu_stop+0x98/0x148
[44369.561849]  cpu_stopper_thread+0xb4/0x170
[44369.561851]  smpboot_thread_fn+0x154/0x1d8
[44369.561853]  kthread+0x130/0x138
[44369.561857]  ret_from_fork+0x10/0x18
[44369.561867] ------------[ cut here ]------------
[44369.561870] kernel BUG at kernel/cpu.c:907!
[44369.571209] Internal error: Oops - BUG: 0 [#1] SMP
[44369.575981] [INFO] xos TOF 3002 - file:tof_dump.c function:tof_dump_handler line:77 dump ICC registers
[44369.602959] Modules linked in: ...
[44369.677592] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G        W  OE    --------- -  - 4.18.0-193.6.3.el8_2.aarch64 #1
[44369.688150] Hardware name: ...
[44369.695337] pstate: 20c00009 (nzCv daif +PAN +UAO)
[44369.700108] pc : cpuhp_report_idle_dead+0x88/0x90
[44369.704789] lr : do_idle+0x16c/0x290
[44369.708346] sp : ffff000016faff10
[44369.711646] x29: ffff000016faff10 x28: 0000000000000000 
[44369.716933] x27: 0000000000000000 x26: 0000000000000000 
[44369.722219] x25: 0000000000000000 x24: ffff0000115dc00c 
[44369.727505] x23: ffff00001128c178 x22: ffff0000115d3708 
[44369.732791] x21: ffff80897fc32130 x20: 000080896ea30000 
[44369.738077] x19: ffff000011202130 x18: 0000000000000060 
[44369.743363] x17: 0000000000000000 x16: 0000000000000000 
[44369.748649] x15: ffffffffffffffff x14: ffff0000115d3708 
[44369.753934] x13: 0000000000014214 x12: ffff000011fc6000 
[44369.759220] x11: ffff00001160f000 x10: 0000000000000d10 
[44369.764506] x9 : ffff000016fafe80 x8 : ffff80894066d970 
[44369.769792] x7 : 00000000000033b0 x6 : 00000409ff409211 
[44369.775078] x5 : 00ffffffffffffff x4 : 0031a4b5a764bd4d 
[44369.780364] x3 : 0000000000000018 x2 : ffff0000115db778 
[44369.785649] x1 : 0000000000000000 x0 : 0000000000000058 
[44369.790937] Process swapper/30 (pid: 0, stack limit = 0x0000000087d5da54)
[44369.797693] Call trace:
[44369.800129]  cpuhp_report_idle_dead+0x88/0x90
[44369.804465]  do_idle+0x16c/0x290
[44369.807677]  cpu_startup_entry+0x28/0x30
[44369.811581]  secondary_start_kernel+0x124/0x138
[44369.816090] Code: a94153f3 f94013f5 a8c37bfd d65f03c0 (d4210000) 
[44369.822154] ---[ end trace 9fa940c2df45e9bf ]---
[44369.831908] Kernel panic - not syncing: Fatal exception
[44369.837109] SMP: stopping secondary CPUs
[44370.904513] SMP: failed to stop secondary CPUs 0-1,31-59
[44370.909808] Kernel Offset: disabled
[44370.913282] CPU features: 0x0002,2ae08a38
[44370.917272] Memory Limit: none
[44370.921289] Starting crashdump kernel...

Another example:

[  133.355772] BUG: sleeping function called from invalid context at kernel/workqueue.c:2959
[  133.368477] in_atomic(): 1, irqs_disabled(): 128, pid: 23, name: migration/2
[  133.369988] no locks held by migration/2/23.
[  133.370890] irq event stamp: 1366
[  133.371839] hardirqs last  enabled at (1365): [<ffff20003fb5ecc0>] _raw_spin_unlock_irq+0x38/0xc0
[  133.373713] hardirqs last disabled at (1366): [<ffff20003e4fbbc4>] multi_cpu_stop+0x244/0x338
[  133.375494] softirqs last  enabled at (0): [<ffff20003e263408>] copy_process.isra.1.part.2+0x1188/0x52b0
[  133.377834] softirqs last disabled at (0): [<0000000000000000>] 0x0
[  133.379404] CPU: 2 PID: 23 Comm: migration/2 Kdump: loaded Not tainted 4.18.0-193.6.3.el8_2.aarch64+debug #1
[  133.381443] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  133.382970] Call trace:
[  133.383495]  dump_backtrace+0x0/0x308
[  133.384266]  show_stack+0x24/0x30
[  133.384966]  dump_stack+0xe0/0x11c
[  133.385686]  ___might_sleep+0x2c8/0x428
[  133.386520]  __might_sleep+0xa0/0x190
[  133.387306]  flush_work+0xc8/0x928
[  133.388026]  __cancel_work_timer+0x228/0x348
[  133.388930]  cancel_delayed_work_sync+0x24/0x30
[  133.389888]  sched_cpu_dying+0x6c8/0xa10
[  133.390723]  cpuhp_invoke_callback+0x23c/0x3030
[  133.391678]  take_cpu_down+0x124/0x1f8
[  133.392470]  multi_cpu_stop+0x180/0x338
[  133.393283]  cpu_stopper_thread+0x1cc/0x3f0
[  133.394165]  smpboot_thread_fn+0x3bc/0xa10
[  133.395029]  kthread+0x2c8/0x350
[  133.395714]  ret_from_fork+0x10/0x18

The common thing is that scheduling happens after sched_cpu_dying().

Environment

  • Red Hat Enterprise Linux release 8

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content