System crashed with "Kernel panic - not syncing: RCU Stall" on the Real Time kernel

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 7 for Real Time only
    • Specifically kernel versions kernel-rt-3.10.0-514.1.1.rt56.422.el7 and above
    • Note the issue does not pertain to Red Hat Enterprise Linux 8 for Real Time or any other version of Red Hat Enterprise Linux

Issue

  • Kernel panic with the following message on the console:
Kernel panic - not syncing: RCU Stall
PID: 425416  TASK: ffff8b257ca8d3c0  CPU: 40  COMMAND: XXXXXXXXXXXXXXXXX
 #0 [ffff8b1d9df09e40] crash_nmi_callback at ffffffff83245e57
 #1 [ffff8b1d9df09e50] nmi_handle at ffffffff8395f8e4
 #2 [ffff8b1d9df09ea8] do_nmi at ffffffff8395fafd
 #3 [ffff8b1d9df09ef0] end_repeat_nmi at ffffffff8395ed39
    [exception RIP: generic_exec_single+0xfa]
    RIP: ffffffff83306d6a  RSP: ffff8b24eb39fb20  RFLAGS: 00000202
    RAX: ffff8b0e2df73c00  RBX: ffff8b24eb39fb20  RCX: ffff8b0e2df73c30
    RDX: ffff8b359be1cb40  RSI: ffff8b24eb39fb20  RDI: ffff8b24eb39fb20
    RBP: ffff8b24eb39fb68   R8: 0000000000000001   R9: 000000055076903e
    R10: fffff70944138980  R11: 0000000000000000  R12: 0000000000000014
    R13: 0000000000000001  R14: 0000000000000001  R15: ffff8b24eb39fbf8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #4 [ffff8b24eb39fb20] generic_exec_single at ffffffff83306d6a
 #5 [ffff8b24eb39fb70] smp_call_function_single at ffffffff83306e92
 #6 [ffff8b24eb39fba8] smp_call_function_many at ffffffff833074ca
 #7 [ffff8b24eb39fbf0] native_flush_tlb_others at ffffffff8326abe8
[...]

Resolution

Root Cause

  • To allow for simultaneous multiprocessing (SMP), or in other words, multitasking across multiple CPUs, the CPUs sometimes must synchronize such as syncing up
    their L1-L3 cache. This is often carried out via the call_single_data infrastructure in the kernel.
  • Effectively, if, for example, CPU X needs CPU Y to sync up its cache with what CPU X has in its cache:

    1. CPU X bundles the function to perform the syncing into a call_single_data (csd) structure
    2. CPU X queues up the new csd structure to CPU Y's call_single_queue
    3. CPU X then sends an interprocess interrupt (IPI) to CPU Y
    4. Oftentimes, CPU X will then wait for the work to finish in order to make sure it can safely proceed
    5. CPU Y receives the interrupt and works csd's in its call_single_queue
    6. CPU Y then acknowledges the csd completion
    7. The acknowledgment of completion wakes up CPU X and allows execution to progress
  • A deadlock condition can occur within the above noted steps at the following lines of code:

    kernel/smp.c:
    146 static int generic_exec_single(int cpu, struct call_single_data *csd,
    147                                smp_call_func_t func, void *info, int wait)
    148 {
    [...]
    190         if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
    191                 arch_send_call_function_single_ipi(cpu);
    
  • CPU X is queuing csd work to CPU Y, but in between queuing the csd to CPU Y and sending an IPI, CPU X can be preempted and the process performing the queuing
    is scheduled off the CPU and put back into that CPU's runqueue.

  • Then another process on CPU X queues up csd work for CPU Y, but, because csd work is already queued for CPU Y, the if statement at line 190 above evaluates
    to false so line 191 is not executed (thus no IPI is sent to CPU Y).
  • CPU X then waits in csd_lock_wait() a few lines of code later indefinitely

Diagnostic Steps

Please note addresses and values noted in analysis will likely differ from what is observed in your particular environment
1. Ensure kdump is setup and configured for the system to generate vmcores should the issue occur. The crash tool will also need to be installed and setup to perform analysis.
2. Check the kernel ring buffer to see precisely how the system ended up panicking.

    crash> log 
    [...]
    [32801109.627094] INFO: rcu_preempt self-detected stall on CPU { 0}  (t=60000 jiffies g=3805569957 c=3805569956 q=39067)
    [32801109.627096] Task dump for CPU 0:
    [32801109.627099] XXXXXXXXXXXXXXXX   R  running task        0 260624 260617 0x00080088
    [32801109.627100] Call Trace:
    [32801109.627113]  <IRQ>  [<ffffffffaaaca6f6>] sched_show_task+0xb6/0x120
    [32801109.627118]  [<ffffffffaaacea39>] dump_cpu_task+0x39/0x70
    [32801109.627125]  [<ffffffffaab49cc0>] rcu_dump_cpu_stacks+0x90/0xd0
    [32801109.627130]  [<ffffffffaab4e6d6>] rcu_check_callbacks+0x476/0x860
    [32801109.627136]  [<ffffffffaaa9d26b>] update_process_times+0x4b/0x80
    [32801109.627142]  [<ffffffffaaaff470>] tick_sched_handle+0x30/0x70
    [32801109.627146]  [<ffffffffaaaff869>] tick_sched_timer+0x39/0x80
    [32801109.627151]  [<ffffffffaaab959b>] __run_hrtimer+0xab/0x2b0
    [32801109.627154]  [<ffffffffaaaff830>] ? tick_sched_do_timer+0x50/0x50
    [32801109.627157]  [<ffffffffaaaba490>] hrtimer_interrupt+0x130/0x330
    [32801109.627162]  [<ffffffffaaa4a55b>] local_apic_timer_interrupt+0x3b/0x60
    [32801109.627169]  [<ffffffffab16b563>] smp_apic_timer_interrupt+0x43/0x60
    [32801109.627172]  [<ffffffffab168072>] apic_timer_interrupt+0x162/0x170
    [32801109.627177]  <EOI>  [<ffffffffaab06d6a>] ? generic_exec_single+0xfa/0x1b0
    [32801109.627182]  [<ffffffffaaa6aa20>] ? leave_mm+0x120/0x120
    [32801109.627184]  [<ffffffffaaa6aa20>] ? leave_mm+0x120/0x120
    [32801109.627186]  [<ffffffffaab06e92>] smp_call_function_single+0x72/0xb0
    [32801109.627189]  [<ffffffffaaa6aa20>] ? leave_mm+0x120/0x120
    [32801109.627192]  [<ffffffffaab074ca>] smp_call_function_many+0x23a/0x280
    [32801109.627195]  [<ffffffffaaa6abe8>] native_flush_tlb_others+0xb8/0xc0
    [32801109.627198]  [<ffffffffaaa6add5>] flush_tlb_page+0x65/0xf0
    [32801109.627203]  [<ffffffffaabf55e5>] ptep_clear_flush+0x45/0x70
    [32801109.627208]  [<ffffffffaabde782>] wp_page_copy.isra.81+0x332/0x590
    [...]
    [32801109.627240] Kernel panic - not syncing: RCU Stall

- The output above indicates CPU 0 stalled out and caught itself stalling.

3. Given the system panicked because CPU 0 stalled, check the backtrace of CPU 0 to determine where precisely the kernel stalled out at:

    crash> bt
    PID: 260624  TASK: ffff92a0156ce480  CPU: 0   COMMAND: "runc:[2:INIT]"
     #0 [ffff92c43da03c68] machine_kexec at ffffffffaaa53b14
     #1 [ffff92c43da03cc8] __crash_kexec at ffffffffaab13162
     #2 [ffff92c43da03d98] panic at ffffffffab14fa1f
     #3 [ffff92c43da03e18] rcu_check_callbacks at ffffffffaab4eab7
     #4 [ffff92c43da03e80] update_process_times at ffffffffaaa9d26b
     #5 [ffff92c43da03ea8] tick_sched_handle at ffffffffaaaff470
     #6 [ffff92c43da03ec8] tick_sched_timer at ffffffffaaaff869
     #7 [ffff92c43da03ef0] __run_hrtimer at ffffffffaaab959b
     #8 [ffff92c43da03f30] hrtimer_interrupt at ffffffffaaaba490
     #9 [ffff92c43da03fc0] local_apic_timer_interrupt at ffffffffaaa4a55b
    #10 [ffff92c43da03fd8] smp_apic_timer_interrupt at ffffffffab16b563
    #11 [ffff92c43da03ff0] apic_timer_interrupt at ffffffffab168072
    --- <IRQ stack> ---
    #12 [ffff92c414a1fa78] apic_timer_interrupt at ffffffffab168072
        [exception RIP: generic_exec_single+0xfa]                           <---
        RIP: ffffffffaab06d6a  RSP: ffff92c414a1fb20  RFLAGS: 00000202
        RAX: ffff92c427d67c00  RBX: ffffd984000001c1  RCX: ffff92c427d67c30
        RDX: ffff92f43c31cb40  RSI: ffff92c414a1fb20  RDI: ffff92c414a1fb20
        RBP: ffff92c414a1fb68   R8: 0000000000000001   R9: 0000000004ea5120
        R10: ffffd9844f8fbc80  R11: 0000000000000000  R12: ffff92c4fffdb108
        R13: 0000000000000000  R14: 0000000000000000  R15: ffff92c4fffdb100
        ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0000
    #13 [ffff92c414a1fb70] smp_call_function_single at ffffffffaab06e92
    #14 [ffff92c414a1fba8] smp_call_function_many at ffffffffaab074ca
    #15 [ffff92c414a1fbf0] native_flush_tlb_others at ffffffffaaa6abe8
    #16 [ffff92c414a1fc40] flush_tlb_page at ffffffffaaa6add5
    #17 [ffff92c414a1fc68] ptep_clear_flush at ffffffffaabf55e5
    #18 [ffff92c414a1fca0] wp_page_copy at ffffffffaabde782
    #19 [ffff92c414a1fd10] do_wp_page at ffffffffaabe0974
    #20 [ffff92c414a1fdb8] handle_mm_fault at ffffffffaabe4645
    #21 [ffff92c414a1feb0] __do_page_fault at ffffffffab161f93
    #22 [ffff92c414a1ff20] do_page_fault at ffffffffab162325
    #23 [ffff92c414a1ff50] page_fault at ffffffffab15e718
        RIP: 000055860f47bcb3  RSP: 000000c42004ff18  RFLAGS: 00010246
        RAX: 000000c420072000  RBX: 0000000000000000  RCX: 0000000000000000
        RDX: 000000c420072060  RSI: 0000000000000068  RDI: 000000c420072010
        RBP: 000000c42004ff48   R8: 0000000000000001   R9: 0000000000000001
        R10: 000000c420072000  R11: 0000000000000001  R12: 0000000000000040
        R13: 0000000000000040  R14: 0000000000000011  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

- In the above backtrace, the stall occurred within the function `generic_exec_single()`. Everything above the '<IRQ stack>' is the kernel handling the kernel panic and can thus be ignored. 

4. Looking into the code, the only location within generic_exec_single() a CPU could wait indefinitely at is the following location:

    kernel/smp.c:
    146 static int generic_exec_single(int cpu, struct call_single_data *csd,
    147                                smp_call_func_t func, void *info, int wait)
    [...]
    190         if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
    191                 arch_send_call_function_single_ipi(cpu);
    192 
    193         if (wait)
    194                 csd_lock_wait(csd);        <---

- Above, a wait occurs is the process must wait for the csd work to complete 

5. Since the CPU is waiting for csd to complete but is waiting after queuing the work, that would imply another CPU has left over objects in its call_single_queue. The following command gathers all the call_single_queue-s for the CPUs and prints any work in each queue:

    crash> p call_single_queue | awk '/ffff/ {print "list " $NF}' > csd_queues
    crash> < csd_queues
    crash> list ffff92c43da1cb40
    ffff92c43da1cb40
    crash> list ffff92c43da5cb40
    ffff92c43da5cb40
    crash> list ffff92c43da9cb40
    ffff92c43da9cb40
    crash> list ffff92c43dadcb40
    ffff92c43dadcb40
    crash> list ffff92c43db1cb40
    ffff92c43db1cb40
    [...]
    crash> list ffff92c43e3dcb40
    ffff92c43e3dcb40
    crash> list ffff92f43c31cb40         <--- CPU 60's call_single_queue 
    ffff92f43c31cb40
    ffff92d19b2a7990
    ffff92c414a1fb20
    ffff92c427d67c30
    crash> list ffff92f43c35cb40
    ffff92f43c35cb40
    crash> list ffff92f43c39cb40
    ffff92f43c39cb40
    [...]

- Note, an empty list in the above output is denoted by a single address that matches the address of the respective CPU's `call_single_queue`. Above, CPU 60 has more than one entry.

6. With this, if the deadlock occurs, then another process exists somewhere on the system waiting to get onto a CPU that was preempted earlier. The stack for that process should indicate it is working through a similar series of functions to queue work but should also show some preemption occurred. The following command prints the backtrace for every process currently running on a CPU or queued up to run on a CPU:

    crash> foreach RU bt 
    [...]
    PID: 260617  TASK: ffff92996b72d3c0  CPU: 0   COMMAND: XXXXXXXXXXXXXXXX
     #0 [ffff92c427d67ab8] __schedule at ffffffffab15a914
     #1 [ffff92c427d67b48] preempt_schedule_irq at ffffffffab15b3d1
     #2 [ffff92c427d67b70] retint_kernel at ffffffffab15e5e7
     #3 [ffff92c427d67c08] physflat_send_IPI_mask at ffffffffaaa5228e
     #4 [ffff92c427d67c18] native_send_call_func_single_ipi at ffffffffaaa46cd2
     #5 [ffff92c427d67c28] generic_exec_single at ffffffffaab06daf
     #6 [ffff92c427d67c80] smp_call_function_single at ffffffffaab06e92
     #7 [ffff92c427d67cb8] task_function_call at ffffffffaab9915c
     #8 [ffff92c427d67cf8] perf_cgroup_attach at ffffffffaab9a114
     #9 [ffff92c427d67d20] cgroup_attach_task at ffffffffaab1c345
    #10 [ffff92c427d67dd8] attach_task_by_pid at ffffffffaab1ca28
    #11 [ffff92c427d67e28] cgroup_procs_write at ffffffffaab1ca96
    #12 [ffff92c427d67e38] cgroup_file_write at ffffffffaab1b1f7
    #13 [ffff92c427d67ec8] vfs_write at ffffffffaac2e5d0
    #14 [ffff92c427d67f08] sys_write at ffffffffaac2f3ef
    #15 [ffff92c427d67f50] tracesys at ffffffffab1672ed (via system_call)
        RIP: 00005638d3fda110  RSP: 000000c42014a560  RFLAGS: 00000202
        RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: ffffffffffffffff
        RDX: 0000000000000006  RSI: 000000c42014a7f0  RDI: 0000000000000006
        RBP: 000000c42014a5b8   R8: 0000000000000000   R9: 0000000000000000
        R10: 0000000000000000  R11: 0000000000000202  R12: 000000000000007b
        R13: 000000000000007a  R14: 0000000000000200  R15: 0000000000000001
        ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
    [...]

- Above, the backtrace shows an attempt to queue up csd via `generic_exec_single()` (at #5), but was preempted (#2 and #1 in the backtrace) and is thus waiting to get back onto the CPU to finish queuing work.  

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments