Red Hat Enterprise Linux crashing shortly after BUG: Bad rss-counter state and BUG: non-zero pgtables_bytes on freeing mm
Environment
-
Red Hat Enterprise Linux 8.3 and below
- Specifically kernel version below
kernel-4.18.0-305.el8
orkernel-rt-4.18.0-305.rt7.72.el8
- Observed more commonly on Red Hat Enterprise Linux for Real Time 8.3, however, the issue can also effect non-Real Time kernels as well.
- Specifically kernel version below
Issue
-
Red Hat Enterprise Linux or Red Hat Enterprise Linux for Real Time crashed with the following logs printed right before the crash
[302301.489968] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:0 val:12865 [302301.513574] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:1 val:46392 [302301.514404] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:0 val:12865 [302301.514405] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:1 val:46392 [302301.514405] BUG: non-zero pgtables_bytes on freeing mm: 1814528 [...] [302301.526553] kernel BUG at <LOCATION> [302301.526563] invalid opcode: 0000 [#1] PREEMPT_RT SMP NOPTI <STACK TRACE>
Resolution
- Update to at least kernel version
kernel-4.18.0-305.el8
orkernel-rt-4.18.0-305.rt7.72.el8
or later.
Root Cause
- A race condition exists in the code path handling the initial execution of an application where the mm_struct can be corrupt.
- When a process is created in the kernel, before the application/binary is loaded in most instances, the kernel initially duplicates the parent process and then overwrites structures in the duplicated process in order to create a child process. One of these structures is an
mm_struct
. - The
mm_struct
is the main entity in every process that contains and manages all the memory-specific aspects associated with a specific process. For example, if a process callsmalloc()
to allocate memory, that process'mm_struct
is updated to contain the newlymalloc
-ed area of memory as well as the increased memory usage. -
During process creation, for most instances, the duplicated process'
mm_struct
was updated in the following area of code;fs/exec.c 1006 static int exec_mmap(struct mm_struct *mm) [...] 1030 task_lock(tsk); 1031 active_mm = tsk->active_mm; 1032 membarrier_exec_mmap(mm); 1033 tsk->mm = mm; 1034 tsk->active_mm = mm; 1035 activate_mm(active_mm, mm); 1036 tsk->mm->vmacache_seqnum = 0; 1037 vmacache_flush(tsk); 1038 task_unlock(tsk); [...]
- In the above code block, the new
mm_struct
was assigned to the duplicated process while the originalmm_struct
duplicated from the parent was unassigned and stored in order to properly remove it from the duplicated process.
- In the above code block, the new
-
A race condition is possible in the above code block, wherein a process waiting for execution could preempt the current process, overwrite the
mm_struct
, and result in a double free of the originalmm_struct
and a missed freeing of the newmm_struct
;PROCESS A PROCESS B old_mm = current->mm; active_mm = current->active_mm; *** preempt *** --------------------> schedule() prev->active_mm = NULL; mmdrop(prev active_mm); ... <-------------------- schedule() current->mm = mm; current->active_mm = mm; if (!old_mm) mmdrop(active_mm);
-
An upstream fix was backported to Red Hat Enterprise Linux 8.4 kernel
kernel-4.18.0-305.el8
andkernel-rt-4.18.0-305.rt7.72.el8
;$ git show 8c7905e30cdd6053d5bbd1ed027f176dc2369802 [...] diff --git a/fs/exec.c b/fs/exec.c index e76be45db81f..7a46a4484e61 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1028,11 +1028,24 @@ static int exec_mmap(struct mm_struct *mm) } } task_lock(tsk); - active_mm = tsk->active_mm; membarrier_exec_mmap(mm); - tsk->mm = mm; + + local_irq_disable(); + active_mm = tsk->active_mm; tsk->active_mm = mm; + tsk->mm = mm; + /* + * This prevents preemption while active_mm is being loaded and + * it and mm are being updated, which could cause problems for + * lazy tlb mm refcounting when these are updated by context + * switches. Not all architectures can handle irqs off over + * activate_mm yet. + */ + if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) + local_irq_enable(); activate_mm(active_mm, mm); + if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) + local_irq_enable(); tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk);
- The backport more or less disabled interrupts and thus preemption throughout the code path to close the window for the race condition.
Diagnostic Steps
- Setup
kdump
to capture vmcores for the system if this is not yet done so. - Setup
crash
to be able to view the contents of the vmcore (similar togdb
with an application core). -
Should the system crash, check the console logs and the kernel ring buffer in the vmcore (what is printed with
dmesg
andlog
in crash) for logs similar to the following;crash> log [...] [302301.489968] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:0 val:12865 [302301.513574] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:1 val:46392 [302301.514404] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:0 val:12865 [302301.514405] BUG: Bad rss-counter state mm:00000000e0ea23fa idx:1 val:46392 [302301.514405] BUG: non-zero pgtables_bytes on freeing mm: 1814528 [...]
- Note due to the issue being caused by a race condition, the corruption will not always produce the same errors.
-
For example, one instance of investigation did not include the
non-zero pgtables_bytes
log line:crash> log [...] [84047.522934] BUG: Bad rss-counter state mm:000000004ae89fb2 idx:0 val:12759 [84047.522942] WARNING: CPU: 12 PID: 120 at kernel/rcu/tree.c:2156 rcu_core+0x5d8/0x5f0 [...]
-
In many scenarios, warnings of various kinds can be (but are not guaranteed) to be observed. For example;
[...] [84047.522934] BUG: Bad rss-counter state mm:000000004ae89fb2 idx:0 val:12759 <--- [84047.522942] WARNING: CPU: 12 PID: 120 at kernel/rcu/tree.c:2156 rcu_core+0x5d8/0x5f0 [84047.522943] Modules linked in: <list of modules> [84047.522965] CPU: 12 PID: 120 Comm: rcuc/12 Kdump: loaded Tainted: G W OE --------- - - 4.18.0-240.12.1.rt7.66.el8_3.dt1.x86_64 #1 [84047.522966] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.11.2 004/21/2021 [84047.522967] RIP: 0010:rcu_core+0x5d8/0x5f0 [...] [84047.522974] Call Trace: [84047.522978] ? rcu_cpu_kthread+0x54/0x310 [84047.522979] ? rcu_cpu_kthread+0x7d/0x310 [84047.522981] ? smpboot_register_percpu_thread_cpumask+0x130/0x130 [84047.522982] rcu_cpu_kthread+0xaf/0x310 [84047.522983] ? smpboot_register_percpu_thread_cpumask+0x130/0x130 [84047.522984] smpboot_thread_fn+0x1d6/0x2c0 [84047.522986] kthread+0x112/0x130 [84047.522987] ? kthread_flush_work_fn+0x10/0x10 [84047.522991] ret_from_fork+0x1f/0x40 [84047.522992] ---[ end trace 0000000000000003 ]--- [84047.523056] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) <--- [84047.523057] BUG: unable to handle kernel paging request at ffff9763170a2590 [84047.523058] PGD 619001067 P4D 619001067 PUD a23e29063 PMD a27255063 PTE 8000000a170a2163 [84047.523060] Oops: 0011 [#1] PREEMPT_RT SMP NOPTI [84047.523061] CPU: 14 PID: 137 Comm: rcuc/14 Kdump: loaded Tainted: G W OE --------- - - 4.18.0-240.12.1.rt7.66.el8_3.dt1.x86_64 #1 [84047.523061] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.11.2 004/21/2021 [...] [84047.523069] Call Trace: [84047.523073] ? rcu_core+0x399/0x5f0 [84047.523075] ? rcu_cpu_kthread+0x54/0x310 [84047.523076] ? rcu_cpu_kthread+0x7d/0x310 [84047.523077] ? smpboot_register_percpu_thread_cpumask+0x130/0x130 [84047.523078] ? rcu_cpu_kthread+0xaf/0x310 [84047.523079] ? smpboot_register_percpu_thread_cpumask+0x130/0x130 [84047.523080] ? smpboot_thread_fn+0x1d6/0x2c0 [84047.523081] ? kthread+0x112/0x130 [84047.523082] ? kthread_flush_work_fn+0x10/0x10 [84047.523084] ? ret_from_fork+0x1f/0x40 [...]
-
Kernel warnings in
rcu
-related code paths are not necessary but possible with this issue.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments