Kernel Panic in functions like gup_pgd_range() after a corrected "Hardware Error" is reported
Red Hat Lightspeed can detect this issue
Environment
- Dell PowerEdge Hardware
-
Red Hat Enterprise Linux 8
- Specifically kernel versions below
kernel-4.18.0-305.91.1.el8_4 - Hugepages must be configured
- Specifically kernel versions below
Issue
- The kernel panics after a corrected memory "Hardware Error".
Resolution
Red Hat Enterprise Linux 8
- The issue has been resolved with
kernel-4.18.0-348.el8via Errata . - The issue was tracked at private Bugzilla 1984173.
Red Hat Enterprise Linux 8.4 AUS
- The issue has been resolved with
kernel-4.18.0-305.91.1.el8_4via Errata RHSA-2023:3461. - The issue was tracked at private Bugzilla 2188306.
Workaround
A tested workaround is to boot the server with the ghes.disable=y kernel command line option to temporarily disable the hardware error reporting code causing the crashes.
Please reference the following article for more information on how to change the kernel command line options:
Root Cause
A race condition exists in hardware error detection and handling code paths provided from the ghes kernel module. When a hardware error is detected for some unit of memory (also known as a page of memory), the ghes module works to migrate the contents of that memory to help save those contents.
The race condition occurs when a hugepage is being migrated as the result of the aforementioned migration. ghes hands off the remaining actions of the migration to a kworker thread. During the migration by the kworker thread, another process (e.g. KVM) coincidentally using that same memory attempts to access the "migrated" page. The migration requires modifications to the hierarchical page management structures (in this case, the Page Upper Directory or PUD and Page Middle Directory or PMD). The migration activity needs to be locked to prevent concurrent access to those migrating pages but is not, because the kernel did not consider the PUD entry for the page in question to be eligible for migration. As such, the kworker process and the other process experience a "General Protection Fault" and panic the kernel due to the kworker process migrating the page of memory out from under the second process. The amount and size of hugepages on the system and frequency of hardware errors can influence the likelihood of the bug occurring wherein increases to hugepage quantity and/or hugepage size increases and/or hardware error detection frequency increases the likelihood of hitting the bug.
The fix changes the code to assume the PUD entry is eligible for migration resulting in it being locked down from concurrent access while the migration occurs.
Note The issue may be triggered by hardware errors regardless if the error is corrected or uncorrected. The error triggers by a combination of how Dell hardware handles hardware errors and a bug in the Linux kernel and not the direct cause of a defect nor indicates a defect with Dell hardware. Similarly, while a single or infrequent corrected hardware error is generally considered safe to ignore, substantial quantities of correct hardware errors in a very short period of time (hundreds within a single second, for example) should at least be reviewed by the hardware vendor; such activity can introduce substantial jitter to latency-sensitive workloads wherein the OS must handle the errors.
Diagnostic Steps
Pre-requisites
-
Deploy kdump in Order to Collect a vmcore:
- Vmcore analyis is required to determine if you are being impacted by this issue. This first requires that a vmcore is dumped successfully.
- If the
kexec-toolspackage is absent or thekdumpservice is inactive, please reference the following article to install, enable, start, and configure kdump:
How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux
-
Prepare
crashEnvironment for vmcore Analysis-
Ensure that you have the
crashpackage installed, and if necessary install the package:# yum install crash -
Ensure the necessary
debuginfopackage is installed. See the following article for more information:
How can I download or install debuginfo packages for RHEL systems?
-
Vmcore Analysis
-
Here is the backtrace of the failing process. It is handling a userspace page fault for a guest virtual machine (VM):
PID: 20191 TASK: ffff9842c2350000 CPU: 56 COMMAND: "CPU 3/KVM" #0 [ffffaaab26b3f768] machine_kexec at ffffffffb2c6090e #1 [ffffaaab26b3f7c0] __crash_kexec at ffffffffb2d8f0bd #2 [ffffaaab26b3f888] crash_kexec at ffffffffb2d8ffad #3 [ffffaaab26b3f8a0] oops_end at ffffffffb2c2435d #4 [ffffaaab26b3f8c0] general_protection at ffffffffb36010ce [exception RIP: gup_pgd_range+0x24c] RIP: ffffffffb2e95acc RSP: ffffaaab26b3f970 RFLAGS: 00010086 RAX: 00007f1148cdffff RBX: 000f98136ffff230 RCX: ffff981580000230 RDX: 000fffffffffffff RSI: 00007f1148ce0000 RDI: effffffdeffffe02 RBP: ffffaaab26b3fa4c R8: ffffaaab26b3fa4c R9: ffffaaab26b3fbbb R10: 00112a630d68d462 R11: 0000000000000000 R12: 0000000000000001 R13: 0000000000000000 R14: 00007f1148cdf000 R15: 00007f1148cdf000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffaaab26b3fa38] internal_get_user_pages_fast at ffffffffb2e97d1e #6 [ffffaaab26b3fa80] __get_user_pages_fast at ffffffffb2e97e38 #7 [ffffaaab26b3fa88] __gfn_to_pfn_memslot at ffffffffc09efee4 [kvm] #8 [ffffaaab26b3faf0] try_async_pf at ffffffffc0a2e821 [kvm] #9 [ffffaaab26b3fb68] direct_page_fault at ffffffffc0a39b4a [kvm] #10 [ffffaaab26b3fc30] kvm_mmu_page_fault at ffffffffc0a3a3f9 [kvm] #11 [ffffaaab26b3fd18] vcpu_enter_guest at ffffffffc0a0d37c [kvm] #12 [ffffaaab26b3fdb8] kvm_arch_vcpu_ioctl_run at ffffffffc0a101ea [kvm] #13 [ffffaaab26b3fde8] kvm_vcpu_ioctl at ffffffffc09ed71a [kvm] #14 [ffffaaab26b3fe80] do_vfs_ioctl at ffffffffb2f2d234 #15 [ffffaaab26b3fef8] ksys_ioctl at ffffffffb2f2d870 #16 [ffffaaab26b3ff30] __x64_sys_ioctl at ffffffffb2f2d8b6 #17 [ffffaaab26b3ff38] do_syscall_64 at ffffffffb2c0420b #18 [ffffaaab26b3ff50] entry_SYSCALL_64_after_hwframe at ffffffffb36000ad RIP: 00007f1334a0c62b RSP: 00007f12537fd628 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000055c92fb0b410 RCX: 00007f1334a0c62b RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000060 RBP: 0000000000000000 R8: 000055c92c88fdd8 R9: 000000000000002c R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001 R13: 000055c92c8b2020 R14: 0000000000000000 R15: 00007f1338379000 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b -
A corrected hardware error is seen in the logs just prior to the panic:
crash> log [...cut...] [327450.877805] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [327450.877807] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [327450.877808] {1}[Hardware Error]: event severity: corrected [327450.877809] {1}[Hardware Error]: Error 0, type: corrected [327450.877810] {1}[Hardware Error]: fru_text: A1 [327450.877810] {1}[Hardware Error]: section_type: memory error [327450.877811] {1}[Hardware Error]: error_status: 0x0000000000000400 [327450.877812] {1}[Hardware Error]: physical_address: 0x00000010af886040 [327450.877814] {1}[Hardware Error]: node: 0 card: 0 module: 0 rank: 1 bank: 2 device: 4 row: 44193 column: 928 [327450.877815] {1}[Hardware Error]: error_type: 2, single-bit ECC [327450.877816] {1}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 [327450.877828] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534 [327450.877828] {2}[Hardware Error]: It has been corrected by h/w and requires no further action [327450.877828] {2}[Hardware Error]: event severity: corrected [327450.877829] {2}[Hardware Error]: Error 0, type: corrected [327450.877831] {2}[Hardware Error]: section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b [327450.877832] {2}[Hardware Error]: section length: 0x38 [327450.877834] {2}[Hardware Error]: 00000000: 01010001 00000000 af886000 00000010 .........`...... [327450.877836] {2}[Hardware Error]: 00000010: 00001000 00000000 af886fff 00000010 .........o...... [327450.877837] {2}[Hardware Error]: 00000020: 00000080 00000000 00000000 00000000 ................ [327450.877838] {2}[Hardware Error]: 00000030: 00000000 00000000 ........ [327450.886746] general protection fault: 0000 [#1] SMP NOPTI [327450.892235] CPU: 56 PID: 20191 Comm: CPU 3/KVM Kdump: loaded Tainted: G I --------- - - 4.18.0-305.34.2.el8_4.x86_64 #1 [327450.904392] Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.15.1 06/15/2022 [327450.912055] RIP: 0010:gup_pgd_range+0x24c/0xc50 [327450.916673] Code: 89 03 00 00 48 81 e3 00 00 00 c0 48 21 d8 48 03 0d a9 13 ee 00 4c 89 74 24 10 48 8d 1c 01 49 8d 46 ff 4d 89 fe 48 89 44 24 58 <4c> 8b 23 4d 8d ae 00 00 20 00 49 81 e5 00 00 e0 ff 49 8d 45 ff 4c [327450.935505] RSP: 0018:ffffaaab26b3f970 EFLAGS: 00010086 [327450.940817] RAX: 00007f1148cdffff RBX: 000f98136ffff230 RCX: ffff981580000230 [327450.948036] RDX: 000fffffffffffff RSI: 00007f1148ce0000 RDI: effffffdeffffe02 [327450.955256] RBP: ffffaaab26b3fa4c R08: ffffaaab26b3fa4c R09: ffffaaab26b3fbbb [327450.962476] R10: 00112a630d68d462 R11: 0000000000000000 R12: 0000000000000001 [327450.969696] R13: 0000000000000000 R14: 00007f1148cdf000 R15: 00007f1148cdf000 [327450.976916] FS: 00007f12537fe700(0000) GS:ffff984500d00000(0000) knlGS:0000000000000000 [327450.985087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [327450.990921] CR2: 000055e478b5dfc0 CR3: 0000002d48056001 CR4: 00000000007726e0 [327450.998138] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [327451.005358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [327451.012578] PKRU: 55555554 -
The hardware error occurs on a 1 GB page starting at address
0x1080000000:crash> ptov 0x00000010af886040 VIRTUAL PHYSICAL ffff98262f886040 10af886040 crash> vtop ffff98262f886040 VIRTUAL PHYSICAL ffff98262f886040 10af886040 PGD DIRECTORY: ffffffffb4210000 PAGE DIRECTORY: 5c73c01067 PUD: 5c73c014c0 => 80000010800001e3 PAGE: 1080000000 (1GB) PTE PHYSICAL FLAGS 80000010800001e3 1080000000 (PRESENT|RW|ACCESSED|DIRTY|PSE|GLOBAL|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffd02382be2180 10af886000 0 0 1 17ffffc0400000 hwpoison\ -
The address on which we panic is held in
%rbxand placed on the stack:crash> bt -FFls ffffaaab26b3f978: ffff9842c9ef2228 00007f1148ce0000 ffffaaab26b3f988: ffffaaab26b3fa4c 00007f1148ce0000 ffffaaab26b3f998: 0000000000000007 00007f1148ce0000 ffffaaab26b3f9a8: ffffaaab26b3f9f8 ffff9842c80567f0 ffffaaab26b3f9b8: 00007f1148ce0000 00007f1100080005 ffffaaab26b3f9c8: 00007f1148cdffff 00007f1148cdffff ffffaaab26b3f9d8: 84607eb05b91ba00 00007f1148cdffff ffffaaab26b3f9e8: 0000000126b3faa0 00007f1148cdffff ffffaaab26b3f9f8: 0000002d49ef2067 84607eb05b91ba00 ffffaaab26b3fa08: 00007f1148cdf000 0000000000080005 %rbx %rbp addr ffffaaab26b3fa18: 0000000000000001 ffffaaab26b3fab0 %r12 %r13 page (struct) ffffaaab26b3fa28: 0000000000000206 [ffff9842edf34c10:kmalloc-2k] %r14 %r15 struct kvm_memory_slot ffffaaab26b3fa38: internal_get_user_pages_fast+0xce We are in gup_pgd_range %rip -
The address maps back to the following physical page:
crash> vtop 00007f1148cdf000 VIRTUAL PHYSICAL 7f1148cdf000 f88cdf000 PGD: 2d480567f0 => 2d49ef2067 PUD: 2d49ef2228 => 8000000f80000887 PAGE: f80000000 (1GB) PTE PHYSICAL FLAGS 8000000f80000887 f80000000 (PRESENT|RW|USER|PSE|NX) VMA START END FLAGS FILE ffff9842c2911878 7f0a40000000 7f1240000000 2c4600fb /dev/hugepages/libvirt/qemu/7-instance-00000630/qemu_back_mem._objects_ram-node0.CTZ2nX PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffd0237e2337c0 f88cdf000 0 0 0 17ffffc0000000 -
Looking at the PUD page which holds the 1 GB page page table entry (PTE), we see this page is out of sequence and likely held the physical page on which the hardware error occurred,
1080000000:2d49ef2140: 0000002d41a92067 80000017800008e7 g .A-........... 2d49ef2150: 80000017400008e7 80000017000008e7 ...@............ 2d49ef2160: 80000016c00008e7 80000016800008e7 ................ 2d49ef2170: 80000016400008e7 80000016000008e7 ...@............ 2d49ef2180: 80000015c00008e7 80000015800008e7 ................ 2d49ef2190: 80000015400008e7 80000015000008e7 ...@............ 2d49ef21a0: 80000014c00008e7 80000014800008e7 ................ 2d49ef21b0: 80000014400008e7 80000014000008e7 ...@............ 2d49ef21c0: 80000013c00008e7 80000013800008e7 ................ 2d49ef21d0: 80000013400008e7 80000013000008e7 ...@............ 2d49ef21e0: 80000012c00008e7 80000012800008e7 ................ 2d49ef21f0: 80000012400008e7 80000012000008e7 ...@............ 2d49ef2200: 80000011c00008e7 80000011800008e7 ................ 2d49ef2210: 80000011400008e7 80000011000008e7 ...@............ 2d49ef2220: 80000010c00008e7 8000000f80000887 ................ <<<------- 2d49ef2230: 80000010400008e7 80000010000008e7 ...@............ 2d49ef2240: 8000000fc00008e7 0000002d725f2067 ........g _r-... 2d49ef2250: 0000002d48174067 0000002d42820067 g@.H-...g..B-... 2d49ef2260: 0000002d49e6d067 0000000000000000 g..I-........... -
The page which took a hardware error is being migrated without locking. Since it
is a 1 GB page there is a larger window for the page to be touched during migration
without locking, causing a panic as the data in the page may be fluctuating at
this time.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments