System crashed with "Kernel panic - not syncing: RCU Stall" on the Real Time kernel
Environment
- Red Hat Enterprise Linux 7 for Real Time only
- Specifically kernel versions
kernel-rt-3.10.0-514.1.1.rt56.422.el7and above - Note the issue does not pertain to Red Hat Enterprise Linux 8 for Real Time or any other version of Red Hat Enterprise Linux
- Specifically kernel versions
Issue
- Kernel panic with the following message on the console:
Kernel panic - not syncing: RCU Stall
PID: 425416 TASK: ffff8b257ca8d3c0 CPU: 40 COMMAND: XXXXXXXXXXXXXXXXX
#0 [ffff8b1d9df09e40] crash_nmi_callback at ffffffff83245e57
#1 [ffff8b1d9df09e50] nmi_handle at ffffffff8395f8e4
#2 [ffff8b1d9df09ea8] do_nmi at ffffffff8395fafd
#3 [ffff8b1d9df09ef0] end_repeat_nmi at ffffffff8395ed39
[exception RIP: generic_exec_single+0xfa]
RIP: ffffffff83306d6a RSP: ffff8b24eb39fb20 RFLAGS: 00000202
RAX: ffff8b0e2df73c00 RBX: ffff8b24eb39fb20 RCX: ffff8b0e2df73c30
RDX: ffff8b359be1cb40 RSI: ffff8b24eb39fb20 RDI: ffff8b24eb39fb20
RBP: ffff8b24eb39fb68 R8: 0000000000000001 R9: 000000055076903e
R10: fffff70944138980 R11: 0000000000000000 R12: 0000000000000014
R13: 0000000000000001 R14: 0000000000000001 R15: ffff8b24eb39fbf8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
--- <NMI exception stack> ---
#4 [ffff8b24eb39fb20] generic_exec_single at ffffffff83306d6a
#5 [ffff8b24eb39fb70] smp_call_function_single at ffffffff83306e92
#6 [ffff8b24eb39fba8] smp_call_function_many at ffffffff833074ca
#7 [ffff8b24eb39fbf0] native_flush_tlb_others at ffffffff8326abe8
[...]
Resolution
- Update to kernel-rt-3.10.0-1160.9.1.rt56.1143.el7 from Errata RHSA-2020:5441 or later.
Root Cause
- To allow for simultaneous multiprocessing (SMP), or in other words, multitasking across multiple CPUs, the CPUs sometimes must synchronize such as syncing up
their L1-L3 cache. This is often carried out via thecall_single_datainfrastructure in the kernel. -
Effectively, if, for example, CPU X needs CPU Y to sync up its cache with what CPU X has in its cache:
- CPU X bundles the function to perform the syncing into a
call_single_data(csd) structure - CPU X queues up the new csd structure to CPU Y's
call_single_queue - CPU X then sends an interprocess interrupt (IPI) to CPU Y
- Oftentimes, CPU X will then wait for the work to finish in order to make sure it can safely proceed
- CPU Y receives the interrupt and works csd's in its
call_single_queue - CPU Y then acknowledges the csd completion
- The acknowledgment of completion wakes up CPU X and allows execution to progress
- CPU X bundles the function to perform the syncing into a
-
A deadlock condition can occur within the above noted steps at the following lines of code:
kernel/smp.c: 146 static int generic_exec_single(int cpu, struct call_single_data *csd, 147 smp_call_func_t func, void *info, int wait) 148 { [...] 190 if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) 191 arch_send_call_function_single_ipi(cpu); -
CPU X is queuing csd work to CPU Y, but in between queuing the csd to CPU Y and sending an IPI, CPU X can be preempted and the process performing the queuing
is scheduled off the CPU and put back into that CPU's runqueue. - Then another process on CPU X queues up csd work for CPU Y, but, because csd work is already queued for CPU Y, the if statement at line 190 above evaluates
to false so line 191 is not executed (thus no IPI is sent to CPU Y). - CPU X then waits in
csd_lock_wait()a few lines of code later indefinitely
Diagnostic Steps
Please note addresses and values noted in analysis will likely differ from what is observed in your particular environment
1. Ensure kdump is setup and configured for the system to generate vmcores should the issue occur. The crash tool will also need to be installed and setup to perform analysis.
2. Check the kernel ring buffer to see precisely how the system ended up panicking.
crash> log
[...]
[32801109.627094] INFO: rcu_preempt self-detected stall on CPU { 0} (t=60000 jiffies g=3805569957 c=3805569956 q=39067)
[32801109.627096] Task dump for CPU 0:
[32801109.627099] XXXXXXXXXXXXXXXX R running task 0 260624 260617 0x00080088
[32801109.627100] Call Trace:
[32801109.627113] <IRQ> [<ffffffffaaaca6f6>] sched_show_task+0xb6/0x120
[32801109.627118] [<ffffffffaaacea39>] dump_cpu_task+0x39/0x70
[32801109.627125] [<ffffffffaab49cc0>] rcu_dump_cpu_stacks+0x90/0xd0
[32801109.627130] [<ffffffffaab4e6d6>] rcu_check_callbacks+0x476/0x860
[32801109.627136] [<ffffffffaaa9d26b>] update_process_times+0x4b/0x80
[32801109.627142] [<ffffffffaaaff470>] tick_sched_handle+0x30/0x70
[32801109.627146] [<ffffffffaaaff869>] tick_sched_timer+0x39/0x80
[32801109.627151] [<ffffffffaaab959b>] __run_hrtimer+0xab/0x2b0
[32801109.627154] [<ffffffffaaaff830>] ? tick_sched_do_timer+0x50/0x50
[32801109.627157] [<ffffffffaaaba490>] hrtimer_interrupt+0x130/0x330
[32801109.627162] [<ffffffffaaa4a55b>] local_apic_timer_interrupt+0x3b/0x60
[32801109.627169] [<ffffffffab16b563>] smp_apic_timer_interrupt+0x43/0x60
[32801109.627172] [<ffffffffab168072>] apic_timer_interrupt+0x162/0x170
[32801109.627177] <EOI> [<ffffffffaab06d6a>] ? generic_exec_single+0xfa/0x1b0
[32801109.627182] [<ffffffffaaa6aa20>] ? leave_mm+0x120/0x120
[32801109.627184] [<ffffffffaaa6aa20>] ? leave_mm+0x120/0x120
[32801109.627186] [<ffffffffaab06e92>] smp_call_function_single+0x72/0xb0
[32801109.627189] [<ffffffffaaa6aa20>] ? leave_mm+0x120/0x120
[32801109.627192] [<ffffffffaab074ca>] smp_call_function_many+0x23a/0x280
[32801109.627195] [<ffffffffaaa6abe8>] native_flush_tlb_others+0xb8/0xc0
[32801109.627198] [<ffffffffaaa6add5>] flush_tlb_page+0x65/0xf0
[32801109.627203] [<ffffffffaabf55e5>] ptep_clear_flush+0x45/0x70
[32801109.627208] [<ffffffffaabde782>] wp_page_copy.isra.81+0x332/0x590
[...]
[32801109.627240] Kernel panic - not syncing: RCU Stall
- The output above indicates CPU 0 stalled out and caught itself stalling.
3. Given the system panicked because CPU 0 stalled, check the backtrace of CPU 0 to determine where precisely the kernel stalled out at:
crash> bt
PID: 260624 TASK: ffff92a0156ce480 CPU: 0 COMMAND: "runc:[2:INIT]"
#0 [ffff92c43da03c68] machine_kexec at ffffffffaaa53b14
#1 [ffff92c43da03cc8] __crash_kexec at ffffffffaab13162
#2 [ffff92c43da03d98] panic at ffffffffab14fa1f
#3 [ffff92c43da03e18] rcu_check_callbacks at ffffffffaab4eab7
#4 [ffff92c43da03e80] update_process_times at ffffffffaaa9d26b
#5 [ffff92c43da03ea8] tick_sched_handle at ffffffffaaaff470
#6 [ffff92c43da03ec8] tick_sched_timer at ffffffffaaaff869
#7 [ffff92c43da03ef0] __run_hrtimer at ffffffffaaab959b
#8 [ffff92c43da03f30] hrtimer_interrupt at ffffffffaaaba490
#9 [ffff92c43da03fc0] local_apic_timer_interrupt at ffffffffaaa4a55b
#10 [ffff92c43da03fd8] smp_apic_timer_interrupt at ffffffffab16b563
#11 [ffff92c43da03ff0] apic_timer_interrupt at ffffffffab168072
--- <IRQ stack> ---
#12 [ffff92c414a1fa78] apic_timer_interrupt at ffffffffab168072
[exception RIP: generic_exec_single+0xfa] <---
RIP: ffffffffaab06d6a RSP: ffff92c414a1fb20 RFLAGS: 00000202
RAX: ffff92c427d67c00 RBX: ffffd984000001c1 RCX: ffff92c427d67c30
RDX: ffff92f43c31cb40 RSI: ffff92c414a1fb20 RDI: ffff92c414a1fb20
RBP: ffff92c414a1fb68 R8: 0000000000000001 R9: 0000000004ea5120
R10: ffffd9844f8fbc80 R11: 0000000000000000 R12: ffff92c4fffdb108
R13: 0000000000000000 R14: 0000000000000000 R15: ffff92c4fffdb100
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0000
#13 [ffff92c414a1fb70] smp_call_function_single at ffffffffaab06e92
#14 [ffff92c414a1fba8] smp_call_function_many at ffffffffaab074ca
#15 [ffff92c414a1fbf0] native_flush_tlb_others at ffffffffaaa6abe8
#16 [ffff92c414a1fc40] flush_tlb_page at ffffffffaaa6add5
#17 [ffff92c414a1fc68] ptep_clear_flush at ffffffffaabf55e5
#18 [ffff92c414a1fca0] wp_page_copy at ffffffffaabde782
#19 [ffff92c414a1fd10] do_wp_page at ffffffffaabe0974
#20 [ffff92c414a1fdb8] handle_mm_fault at ffffffffaabe4645
#21 [ffff92c414a1feb0] __do_page_fault at ffffffffab161f93
#22 [ffff92c414a1ff20] do_page_fault at ffffffffab162325
#23 [ffff92c414a1ff50] page_fault at ffffffffab15e718
RIP: 000055860f47bcb3 RSP: 000000c42004ff18 RFLAGS: 00010246
RAX: 000000c420072000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 000000c420072060 RSI: 0000000000000068 RDI: 000000c420072010
RBP: 000000c42004ff48 R8: 0000000000000001 R9: 0000000000000001
R10: 000000c420072000 R11: 0000000000000001 R12: 0000000000000040
R13: 0000000000000040 R14: 0000000000000011 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
- In the above backtrace, the stall occurred within the function `generic_exec_single()`. Everything above the '<IRQ stack>' is the kernel handling the kernel panic and can thus be ignored.
4. Looking into the code, the only location within generic_exec_single() a CPU could wait indefinitely at is the following location:
kernel/smp.c:
146 static int generic_exec_single(int cpu, struct call_single_data *csd,
147 smp_call_func_t func, void *info, int wait)
[...]
190 if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
191 arch_send_call_function_single_ipi(cpu);
192
193 if (wait)
194 csd_lock_wait(csd); <---
- Above, a wait occurs is the process must wait for the csd work to complete
5. Since the CPU is waiting for csd to complete but is waiting after queuing the work, that would imply another CPU has left over objects in its call_single_queue. The following command gathers all the call_single_queue-s for the CPUs and prints any work in each queue:
crash> p call_single_queue | awk '/ffff/ {print "list " $NF}' > csd_queues
crash> < csd_queues
crash> list ffff92c43da1cb40
ffff92c43da1cb40
crash> list ffff92c43da5cb40
ffff92c43da5cb40
crash> list ffff92c43da9cb40
ffff92c43da9cb40
crash> list ffff92c43dadcb40
ffff92c43dadcb40
crash> list ffff92c43db1cb40
ffff92c43db1cb40
[...]
crash> list ffff92c43e3dcb40
ffff92c43e3dcb40
crash> list ffff92f43c31cb40 <--- CPU 60's call_single_queue
ffff92f43c31cb40
ffff92d19b2a7990
ffff92c414a1fb20
ffff92c427d67c30
crash> list ffff92f43c35cb40
ffff92f43c35cb40
crash> list ffff92f43c39cb40
ffff92f43c39cb40
[...]
- Note, an empty list in the above output is denoted by a single address that matches the address of the respective CPU's `call_single_queue`. Above, CPU 60 has more than one entry.
6. With this, if the deadlock occurs, then another process exists somewhere on the system waiting to get onto a CPU that was preempted earlier. The stack for that process should indicate it is working through a similar series of functions to queue work but should also show some preemption occurred. The following command prints the backtrace for every process currently running on a CPU or queued up to run on a CPU:
crash> foreach RU bt
[...]
PID: 260617 TASK: ffff92996b72d3c0 CPU: 0 COMMAND: XXXXXXXXXXXXXXXX
#0 [ffff92c427d67ab8] __schedule at ffffffffab15a914
#1 [ffff92c427d67b48] preempt_schedule_irq at ffffffffab15b3d1
#2 [ffff92c427d67b70] retint_kernel at ffffffffab15e5e7
#3 [ffff92c427d67c08] physflat_send_IPI_mask at ffffffffaaa5228e
#4 [ffff92c427d67c18] native_send_call_func_single_ipi at ffffffffaaa46cd2
#5 [ffff92c427d67c28] generic_exec_single at ffffffffaab06daf
#6 [ffff92c427d67c80] smp_call_function_single at ffffffffaab06e92
#7 [ffff92c427d67cb8] task_function_call at ffffffffaab9915c
#8 [ffff92c427d67cf8] perf_cgroup_attach at ffffffffaab9a114
#9 [ffff92c427d67d20] cgroup_attach_task at ffffffffaab1c345
#10 [ffff92c427d67dd8] attach_task_by_pid at ffffffffaab1ca28
#11 [ffff92c427d67e28] cgroup_procs_write at ffffffffaab1ca96
#12 [ffff92c427d67e38] cgroup_file_write at ffffffffaab1b1f7
#13 [ffff92c427d67ec8] vfs_write at ffffffffaac2e5d0
#14 [ffff92c427d67f08] sys_write at ffffffffaac2f3ef
#15 [ffff92c427d67f50] tracesys at ffffffffab1672ed (via system_call)
RIP: 00005638d3fda110 RSP: 000000c42014a560 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 0000000000000006 RSI: 000000c42014a7f0 RDI: 0000000000000006
RBP: 000000c42014a5b8 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000007b
R13: 000000000000007a R14: 0000000000000200 R15: 0000000000000001
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
[...]
- Above, the backtrace shows an attempt to queue up csd via `generic_exec_single()` (at #5), but was preempted (#2 and #1 in the backtrace) and is thus waiting to get back onto the CPU to finish queuing work.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments