Sleeping or scheduling after sched_cpu_dying() leads to "scheduling while atomic" and BUG at kernel/cpu.c:907!

Solution Verified - Updated -

Issue

Rare, during CPU offline procedures the following BUG_ON()s might trigger:

[44369.425382] CPU29: shutdown
[44369.455042] psci: Retrying again to check for CPU kill
[44369.458362] psci: CPU29 killed.
[44369.555289] BUG: scheduling while atomic: migration/30/210/0x00000001
[44369.561701] Modules linked in: ...
[44369.561778] CPU: 30 PID: 210 Comm: migration/30 Tainted: G        W  OE    --------- -  - 4.18.0-193.6.3.el8_2.aarch64 #1
[44369.561781] Hardware name: ...
[44369.561783] Call trace:
[44369.561791]  dump_backtrace+0x0/0x188
[44369.561794]  show_stack+0x24/0x30
[44369.561801]  dump_stack+0x90/0xb4
[44369.561806]  __schedule_bug+0x70/0x80
[44369.561810]  __schedule+0x698/0x778
[44369.561813]  schedule+0x38/0xb8
[44369.561816]  schedule_timeout+0x264/0x388
[44369.561819]  wait_for_common+0x1a8/0x210
[44369.561822]  wait_for_completion+0x28/0x38
[44369.561826]  flush_work+0x118/0x238
[44369.561829]  __cancel_work_timer+0x12c/0x1a0
[44369.561832]  cancel_delayed_work_sync+0x24/0x30
[44369.561834]  sched_cpu_dying+0x414/0x470
[44369.561839]  cpuhp_invoke_callback+0xa8/0x618
[44369.561842]  take_cpu_down+0x84/0xe0
[44369.561846]  multi_cpu_stop+0x98/0x148
[44369.561849]  cpu_stopper_thread+0xb4/0x170
[44369.561851]  smpboot_thread_fn+0x154/0x1d8
[44369.561853]  kthread+0x130/0x138
[44369.561857]  ret_from_fork+0x10/0x18
[44369.561867] ------------[ cut here ]------------
[44369.561870] kernel BUG at kernel/cpu.c:907!
[44369.571209] Internal error: Oops - BUG: 0 [#1] SMP
[44369.575981] [INFO] xos TOF 3002 - file:tof_dump.c function:tof_dump_handler line:77 dump ICC registers
[44369.602959] Modules linked in: ...
[44369.677592] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G        W  OE    --------- -  - 4.18.0-193.6.3.el8_2.aarch64 #1
[44369.688150] Hardware name: ...
[44369.695337] pstate: 20c00009 (nzCv daif +PAN +UAO)
[44369.700108] pc : cpuhp_report_idle_dead+0x88/0x90
[44369.704789] lr : do_idle+0x16c/0x290
[44369.708346] sp : ffff000016faff10
[44369.711646] x29: ffff000016faff10 x28: 0000000000000000 
[44369.716933] x27: 0000000000000000 x26: 0000000000000000 
[44369.722219] x25: 0000000000000000 x24: ffff0000115dc00c 
[44369.727505] x23: ffff00001128c178 x22: ffff0000115d3708 
[44369.732791] x21: ffff80897fc32130 x20: 000080896ea30000 
[44369.738077] x19: ffff000011202130 x18: 0000000000000060 
[44369.743363] x17: 0000000000000000 x16: 0000000000000000 
[44369.748649] x15: ffffffffffffffff x14: ffff0000115d3708 
[44369.753934] x13: 0000000000014214 x12: ffff000011fc6000 
[44369.759220] x11: ffff00001160f000 x10: 0000000000000d10 
[44369.764506] x9 : ffff000016fafe80 x8 : ffff80894066d970 
[44369.769792] x7 : 00000000000033b0 x6 : 00000409ff409211 
[44369.775078] x5 : 00ffffffffffffff x4 : 0031a4b5a764bd4d 
[44369.780364] x3 : 0000000000000018 x2 : ffff0000115db778 
[44369.785649] x1 : 0000000000000000 x0 : 0000000000000058 
[44369.790937] Process swapper/30 (pid: 0, stack limit = 0x0000000087d5da54)
[44369.797693] Call trace:
[44369.800129]  cpuhp_report_idle_dead+0x88/0x90
[44369.804465]  do_idle+0x16c/0x290
[44369.807677]  cpu_startup_entry+0x28/0x30
[44369.811581]  secondary_start_kernel+0x124/0x138
[44369.816090] Code: a94153f3 f94013f5 a8c37bfd d65f03c0 (d4210000) 
[44369.822154] ---[ end trace 9fa940c2df45e9bf ]---
[44369.831908] Kernel panic - not syncing: Fatal exception
[44369.837109] SMP: stopping secondary CPUs
[44370.904513] SMP: failed to stop secondary CPUs 0-1,31-59
[44370.909808] Kernel Offset: disabled
[44370.913282] CPU features: 0x0002,2ae08a38
[44370.917272] Memory Limit: none
[44370.921289] Starting crashdump kernel...

Another example:

[  133.355772] BUG: sleeping function called from invalid context at kernel/workqueue.c:2959
[  133.368477] in_atomic(): 1, irqs_disabled(): 128, pid: 23, name: migration/2
[  133.369988] no locks held by migration/2/23.
[  133.370890] irq event stamp: 1366
[  133.371839] hardirqs last  enabled at (1365): [<ffff20003fb5ecc0>] _raw_spin_unlock_irq+0x38/0xc0
[  133.373713] hardirqs last disabled at (1366): [<ffff20003e4fbbc4>] multi_cpu_stop+0x244/0x338
[  133.375494] softirqs last  enabled at (0): [<ffff20003e263408>] copy_process.isra.1.part.2+0x1188/0x52b0
[  133.377834] softirqs last disabled at (0): [<0000000000000000>] 0x0
[  133.379404] CPU: 2 PID: 23 Comm: migration/2 Kdump: loaded Not tainted 4.18.0-193.6.3.el8_2.aarch64+debug #1
[  133.381443] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  133.382970] Call trace:
[  133.383495]  dump_backtrace+0x0/0x308
[  133.384266]  show_stack+0x24/0x30
[  133.384966]  dump_stack+0xe0/0x11c
[  133.385686]  ___might_sleep+0x2c8/0x428
[  133.386520]  __might_sleep+0xa0/0x190
[  133.387306]  flush_work+0xc8/0x928
[  133.388026]  __cancel_work_timer+0x228/0x348
[  133.388930]  cancel_delayed_work_sync+0x24/0x30
[  133.389888]  sched_cpu_dying+0x6c8/0xa10
[  133.390723]  cpuhp_invoke_callback+0x23c/0x3030
[  133.391678]  take_cpu_down+0x124/0x1f8
[  133.392470]  multi_cpu_stop+0x180/0x338
[  133.393283]  cpu_stopper_thread+0x1cc/0x3f0
[  133.394165]  smpboot_thread_fn+0x3bc/0xa10
[  133.395029]  kthread+0x2c8/0x350
[  133.395714]  ret_from_fork+0x10/0x18

The common thing is that scheduling happens after sched_cpu_dying().

Environment

  • Red Hat Enterprise Linux release 8

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In