Sleeping or scheduling after sched_cpu_dying() leads to "scheduling while atomic" and BUG at kernel/cpu.c:907!
Issue
Rare, during CPU offline procedures the following BUG_ON()s might trigger:
[44369.425382] CPU29: shutdown
[44369.455042] psci: Retrying again to check for CPU kill
[44369.458362] psci: CPU29 killed.
[44369.555289] BUG: scheduling while atomic: migration/30/210/0x00000001
[44369.561701] Modules linked in: ...
[44369.561778] CPU: 30 PID: 210 Comm: migration/30 Tainted: G W OE --------- - - 4.18.0-193.6.3.el8_2.aarch64 #1
[44369.561781] Hardware name: ...
[44369.561783] Call trace:
[44369.561791] dump_backtrace+0x0/0x188
[44369.561794] show_stack+0x24/0x30
[44369.561801] dump_stack+0x90/0xb4
[44369.561806] __schedule_bug+0x70/0x80
[44369.561810] __schedule+0x698/0x778
[44369.561813] schedule+0x38/0xb8
[44369.561816] schedule_timeout+0x264/0x388
[44369.561819] wait_for_common+0x1a8/0x210
[44369.561822] wait_for_completion+0x28/0x38
[44369.561826] flush_work+0x118/0x238
[44369.561829] __cancel_work_timer+0x12c/0x1a0
[44369.561832] cancel_delayed_work_sync+0x24/0x30
[44369.561834] sched_cpu_dying+0x414/0x470
[44369.561839] cpuhp_invoke_callback+0xa8/0x618
[44369.561842] take_cpu_down+0x84/0xe0
[44369.561846] multi_cpu_stop+0x98/0x148
[44369.561849] cpu_stopper_thread+0xb4/0x170
[44369.561851] smpboot_thread_fn+0x154/0x1d8
[44369.561853] kthread+0x130/0x138
[44369.561857] ret_from_fork+0x10/0x18
[44369.561867] ------------[ cut here ]------------
[44369.561870] kernel BUG at kernel/cpu.c:907!
[44369.571209] Internal error: Oops - BUG: 0 [#1] SMP
[44369.575981] [INFO] xos TOF 3002 - file:tof_dump.c function:tof_dump_handler line:77 dump ICC registers
[44369.602959] Modules linked in: ...
[44369.677592] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G W OE --------- - - 4.18.0-193.6.3.el8_2.aarch64 #1
[44369.688150] Hardware name: ...
[44369.695337] pstate: 20c00009 (nzCv daif +PAN +UAO)
[44369.700108] pc : cpuhp_report_idle_dead+0x88/0x90
[44369.704789] lr : do_idle+0x16c/0x290
[44369.708346] sp : ffff000016faff10
[44369.711646] x29: ffff000016faff10 x28: 0000000000000000
[44369.716933] x27: 0000000000000000 x26: 0000000000000000
[44369.722219] x25: 0000000000000000 x24: ffff0000115dc00c
[44369.727505] x23: ffff00001128c178 x22: ffff0000115d3708
[44369.732791] x21: ffff80897fc32130 x20: 000080896ea30000
[44369.738077] x19: ffff000011202130 x18: 0000000000000060
[44369.743363] x17: 0000000000000000 x16: 0000000000000000
[44369.748649] x15: ffffffffffffffff x14: ffff0000115d3708
[44369.753934] x13: 0000000000014214 x12: ffff000011fc6000
[44369.759220] x11: ffff00001160f000 x10: 0000000000000d10
[44369.764506] x9 : ffff000016fafe80 x8 : ffff80894066d970
[44369.769792] x7 : 00000000000033b0 x6 : 00000409ff409211
[44369.775078] x5 : 00ffffffffffffff x4 : 0031a4b5a764bd4d
[44369.780364] x3 : 0000000000000018 x2 : ffff0000115db778
[44369.785649] x1 : 0000000000000000 x0 : 0000000000000058
[44369.790937] Process swapper/30 (pid: 0, stack limit = 0x0000000087d5da54)
[44369.797693] Call trace:
[44369.800129] cpuhp_report_idle_dead+0x88/0x90
[44369.804465] do_idle+0x16c/0x290
[44369.807677] cpu_startup_entry+0x28/0x30
[44369.811581] secondary_start_kernel+0x124/0x138
[44369.816090] Code: a94153f3 f94013f5 a8c37bfd d65f03c0 (d4210000)
[44369.822154] ---[ end trace 9fa940c2df45e9bf ]---
[44369.831908] Kernel panic - not syncing: Fatal exception
[44369.837109] SMP: stopping secondary CPUs
[44370.904513] SMP: failed to stop secondary CPUs 0-1,31-59
[44370.909808] Kernel Offset: disabled
[44370.913282] CPU features: 0x0002,2ae08a38
[44370.917272] Memory Limit: none
[44370.921289] Starting crashdump kernel...
Another example:
[ 133.355772] BUG: sleeping function called from invalid context at kernel/workqueue.c:2959
[ 133.368477] in_atomic(): 1, irqs_disabled(): 128, pid: 23, name: migration/2
[ 133.369988] no locks held by migration/2/23.
[ 133.370890] irq event stamp: 1366
[ 133.371839] hardirqs last enabled at (1365): [<ffff20003fb5ecc0>] _raw_spin_unlock_irq+0x38/0xc0
[ 133.373713] hardirqs last disabled at (1366): [<ffff20003e4fbbc4>] multi_cpu_stop+0x244/0x338
[ 133.375494] softirqs last enabled at (0): [<ffff20003e263408>] copy_process.isra.1.part.2+0x1188/0x52b0
[ 133.377834] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 133.379404] CPU: 2 PID: 23 Comm: migration/2 Kdump: loaded Not tainted 4.18.0-193.6.3.el8_2.aarch64+debug #1
[ 133.381443] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 133.382970] Call trace:
[ 133.383495] dump_backtrace+0x0/0x308
[ 133.384266] show_stack+0x24/0x30
[ 133.384966] dump_stack+0xe0/0x11c
[ 133.385686] ___might_sleep+0x2c8/0x428
[ 133.386520] __might_sleep+0xa0/0x190
[ 133.387306] flush_work+0xc8/0x928
[ 133.388026] __cancel_work_timer+0x228/0x348
[ 133.388930] cancel_delayed_work_sync+0x24/0x30
[ 133.389888] sched_cpu_dying+0x6c8/0xa10
[ 133.390723] cpuhp_invoke_callback+0x23c/0x3030
[ 133.391678] take_cpu_down+0x124/0x1f8
[ 133.392470] multi_cpu_stop+0x180/0x338
[ 133.393283] cpu_stopper_thread+0x1cc/0x3f0
[ 133.394165] smpboot_thread_fn+0x3bc/0xa10
[ 133.395029] kthread+0x2c8/0x350
[ 133.395714] ret_from_fork+0x10/0x18
The common thing is that scheduling happens after sched_cpu_dying().
Environment
- Red Hat Enterprise Linux release 8
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.