Red Hat Enterprise Linux crashing shortly after BUG: Bad rss-counter state and BUG: non-zero pgtables_bytes on freeing mm

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux 8.3 and below

    • Specifically kernel version below kernel-4.18.0-305.el8 or kernel-rt-4.18.0-305.rt7.72.el8
    • Observed more commonly on Red Hat Enterprise Linux for Real Time 8.3, however, the issue can also effect non-Real Time kernels as well.

Issue

  • Red Hat Enterprise Linux or Red Hat Enterprise Linux for Real Time crashed with the following logs printed right before the crash

    [302301.489968] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:0 val:12865
    [302301.513574] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:1 val:46392
    [302301.514404] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:0 val:12865
    [302301.514405] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:1 val:46392
    [302301.514405] BUG: non-zero pgtables_bytes on freeing mm: 1814528
    [...]
    [302301.526553] kernel BUG at <LOCATION>
    [302301.526563] invalid opcode: 0000 [#1] PREEMPT_RT SMP NOPTI
    <STACK TRACE>
    

Resolution

  • Update to at least kernel version kernel-4.18.0-305.el8 or kernel-rt-4.18.0-305.rt7.72.el8 or later.

Root Cause

  • A race condition exists in the code path handling the initial execution of an application where the mm_struct can be corrupt.
  • When a process is created in the kernel, before the application/binary is loaded in most instances, the kernel initially duplicates the parent process and then overwrites structures in the duplicated process in order to create a child process. One of these structures is an mm_struct.
  • The mm_struct is the main entity in every process that contains and manages all the memory-specific aspects associated with a specific process. For example, if a process calls malloc() to allocate memory, that process' mm_struct is updated to contain the newly malloc-ed area of memory as well as the increased memory usage.
  • During process creation, for most instances, the duplicated process' mm_struct was updated in the following area of code;

    fs/exec.c
    1006 static int exec_mmap(struct mm_struct *mm)
    [...]
    1030         task_lock(tsk);
    1031         active_mm = tsk->active_mm;
    1032         membarrier_exec_mmap(mm);
    1033         tsk->mm = mm;
    1034         tsk->active_mm = mm;
    1035         activate_mm(active_mm, mm);
    1036         tsk->mm->vmacache_seqnum = 0;
    1037         vmacache_flush(tsk);
    1038         task_unlock(tsk);
    [...]
    
    • In the above code block, the new mm_struct was assigned to the duplicated process while the original mm_struct duplicated from the parent was unassigned and stored in order to properly remove it from the duplicated process.
  • A race condition is possible in the above code block, wherein a process waiting for execution could preempt the current process, overwrite the mm_struct, and result in a double free of the original mm_struct and a missed freeing of the new mm_struct;

    PROCESS A                              PROCESS B
    
    old_mm = current->mm;
    active_mm = current->active_mm;
    *** preempt *** -------------------->  schedule()
                                             prev->active_mm = NULL;
                                             mmdrop(prev active_mm);
                                           ...
                    <--------------------  schedule()
    current->mm = mm;
    current->active_mm = mm;
    if (!old_mm)
        mmdrop(active_mm);
    
  • An upstream fix was backported to Red Hat Enterprise Linux 8.4 kernel kernel-4.18.0-305.el8 and kernel-rt-4.18.0-305.rt7.72.el8;

    $ git show 8c7905e30cdd6053d5bbd1ed027f176dc2369802
    [...]
    diff --git a/fs/exec.c b/fs/exec.c
    index e76be45db81f..7a46a4484e61 100644
    --- a/fs/exec.c
    +++ b/fs/exec.c
    @@ -1028,11 +1028,24 @@ static int exec_mmap(struct mm_struct *mm)
                    }
            }
            task_lock(tsk);
    -       active_mm = tsk->active_mm;
            membarrier_exec_mmap(mm);
    -       tsk->mm = mm;
    +
    +       local_irq_disable();
    +       active_mm = tsk->active_mm;
            tsk->active_mm = mm;
    +       tsk->mm = mm;
    +       /*
    +        * This prevents preemption while active_mm is being loaded and
    +        * it and mm are being updated, which could cause problems for
    +        * lazy tlb mm refcounting when these are updated by context
    +        * switches. Not all architectures can handle irqs off over
    +        * activate_mm yet.
    +        */
    +       if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
    +               local_irq_enable();
            activate_mm(active_mm, mm);
    +       if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
    +               local_irq_enable();
            tsk->mm->vmacache_seqnum = 0;
            vmacache_flush(tsk);
            task_unlock(tsk);
    
    • The backport more or less disabled interrupts and thus preemption throughout the code path to close the window for the race condition.

Diagnostic Steps

  1. Setup kdump to capture vmcores for the system if this is not yet done so.
  2. Setup crash to be able to view the contents of the vmcore (similar to gdb with an application core).
  3. Should the system crash, check the console logs and the kernel ring buffer in the vmcore (what is printed with dmesg and log in crash) for logs similar to the following;

    crash> log
    [...]
    [302301.489968] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:0 val:12865
    [302301.513574] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:1 val:46392
    [302301.514404] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:0 val:12865
    [302301.514405] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:1 val:46392
    [302301.514405] BUG: non-zero pgtables_bytes on freeing mm: 1814528
    [...]
    
    • Note due to the issue being caused by a race condition, the corruption will not always produce the same errors.
    • For example, one instance of investigation did not include the non-zero pgtables_bytes log line:

      crash> log
      [...]
      [84047.522934] BUG: Bad rss-counter state mm:000000004ae89fb2 idx:0 val:12759
      [84047.522942] WARNING: CPU: 12 PID: 120 at kernel/rcu/tree.c:2156 rcu_core+0x5d8/0x5f0
      [...]
      
    • In many scenarios, warnings of various kinds can be (but are not guaranteed) to be observed. For example;

      [...]
      [84047.522934] BUG: Bad rss-counter state mm:000000004ae89fb2 idx:0 val:12759                  <---
      [84047.522942] WARNING: CPU: 12 PID: 120 at kernel/rcu/tree.c:2156 rcu_core+0x5d8/0x5f0
      [84047.522943] Modules linked in: <list of modules> 
      [84047.522965] CPU: 12 PID: 120 Comm: rcuc/12 Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-240.12.1.rt7.66.el8_3.dt1.x86_64 #1
      [84047.522966] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.11.2 004/21/2021
      [84047.522967] RIP: 0010:rcu_core+0x5d8/0x5f0
      [...]
      [84047.522974] Call Trace:
      [84047.522978]  ? rcu_cpu_kthread+0x54/0x310
      [84047.522979]  ? rcu_cpu_kthread+0x7d/0x310
      [84047.522981]  ? smpboot_register_percpu_thread_cpumask+0x130/0x130
      [84047.522982]  rcu_cpu_kthread+0xaf/0x310
      [84047.522983]  ? smpboot_register_percpu_thread_cpumask+0x130/0x130
      [84047.522984]  smpboot_thread_fn+0x1d6/0x2c0
      [84047.522986]  kthread+0x112/0x130
      [84047.522987]  ? kthread_flush_work_fn+0x10/0x10
      [84047.522991]  ret_from_fork+0x1f/0x40
      [84047.522992] ---[ end trace 0000000000000003 ]---
      [84047.523056] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)             <--- 
      [84047.523057] BUG: unable to handle kernel paging request at ffff9763170a2590
      [84047.523058] PGD 619001067 P4D 619001067 PUD a23e29063 PMD a27255063 PTE 8000000a170a2163
      [84047.523060] Oops: 0011 [#1] PREEMPT_RT SMP NOPTI
      [84047.523061] CPU: 14 PID: 137 Comm: rcuc/14 Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-240.12.1.rt7.66.el8_3.dt1.x86_64 #1
      [84047.523061] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.11.2 004/21/2021
      [...]
      [84047.523069] Call Trace:
      [84047.523073]  ? rcu_core+0x399/0x5f0
      [84047.523075]  ? rcu_cpu_kthread+0x54/0x310
      [84047.523076]  ? rcu_cpu_kthread+0x7d/0x310
      [84047.523077]  ? smpboot_register_percpu_thread_cpumask+0x130/0x130
      [84047.523078]  ? rcu_cpu_kthread+0xaf/0x310
      [84047.523079]  ? smpboot_register_percpu_thread_cpumask+0x130/0x130
      [84047.523080]  ? smpboot_thread_fn+0x1d6/0x2c0
      [84047.523081]  ? kthread+0x112/0x130
      [84047.523082]  ? kthread_flush_work_fn+0x10/0x10
      [84047.523084]  ? ret_from_fork+0x1f/0x40
      [...]
      
    • Kernel warnings in rcu-related code paths are not necessary but possible with this issue.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.