Kernel Panic in functions like gup_pgd_range() after a corrected "Hardware Error" is reported

Solution Verified - Updated -

Red Hat Lightspeed can detect this issue

Proactively detect and remediate issues impacting your systems.
View matching systems and remediation

Environment

  • Dell PowerEdge Hardware
  • Red Hat Enterprise Linux 8

    • Specifically kernel versions below kernel-4.18.0-305.91.1.el8_4
    • Hugepages must be configured

Issue

  • The kernel panics after a corrected memory "Hardware Error".

Resolution

Red Hat Enterprise Linux 8

  • The issue has been resolved with kernel-4.18.0-348.el8 via Errata .
  • The issue was tracked at private Bugzilla 1984173.

Red Hat Enterprise Linux 8.4 AUS

  • The issue has been resolved with kernel-4.18.0-305.91.1.el8_4 via Errata RHSA-2023:3461.
  • The issue was tracked at private Bugzilla 2188306.

Workaround

A tested workaround is to boot the server with the ghes.disable=y kernel command line option to temporarily disable the hardware error reporting code causing the crashes.

Please reference the following article for more information on how to change the kernel command line options:

Root Cause

A race condition exists in hardware error detection and handling code paths provided from the ghes kernel module. When a hardware error is detected for some unit of memory (also known as a page of memory), the ghes module works to migrate the contents of that memory to help save those contents.

The race condition occurs when a hugepage is being migrated as the result of the aforementioned migration. ghes hands off the remaining actions of the migration to a kworker thread. During the migration by the kworker thread, another process (e.g. KVM) coincidentally using that same memory attempts to access the "migrated" page. The migration requires modifications to the hierarchical page management structures (in this case, the Page Upper Directory or PUD and Page Middle Directory or PMD). The migration activity needs to be locked to prevent concurrent access to those migrating pages but is not, because the kernel did not consider the PUD entry for the page in question to be eligible for migration. As such, the kworker process and the other process experience a "General Protection Fault" and panic the kernel due to the kworker process migrating the page of memory out from under the second process. The amount and size of hugepages on the system and frequency of hardware errors can influence the likelihood of the bug occurring wherein increases to hugepage quantity and/or hugepage size increases and/or hardware error detection frequency increases the likelihood of hitting the bug.

The fix changes the code to assume the PUD entry is eligible for migration resulting in it being locked down from concurrent access while the migration occurs.

Note The issue may be triggered by hardware errors regardless if the error is corrected or uncorrected. The error triggers by a combination of how Dell hardware handles hardware errors and a bug in the Linux kernel and not the direct cause of a defect nor indicates a defect with Dell hardware. Similarly, while a single or infrequent corrected hardware error is generally considered safe to ignore, substantial quantities of correct hardware errors in a very short period of time (hundreds within a single second, for example) should at least be reviewed by the hardware vendor; such activity can introduce substantial jitter to latency-sensitive workloads wherein the OS must handle the errors.

Diagnostic Steps

Pre-requisites

  1. Deploy kdump in Order to Collect a vmcore:

  2. Prepare crash Environment for vmcore Analysis

Vmcore Analysis

  1. Here is the backtrace of the failing process. It is handling a userspace page fault for a guest virtual machine (VM):

    PID: 20191    TASK: ffff9842c2350000  CPU: 56   COMMAND: "CPU 3/KVM"
     #0 [ffffaaab26b3f768] machine_kexec at ffffffffb2c6090e
     #1 [ffffaaab26b3f7c0] __crash_kexec at ffffffffb2d8f0bd
     #2 [ffffaaab26b3f888] crash_kexec at ffffffffb2d8ffad
     #3 [ffffaaab26b3f8a0] oops_end at ffffffffb2c2435d
     #4 [ffffaaab26b3f8c0] general_protection at ffffffffb36010ce
        [exception RIP: gup_pgd_range+0x24c]
        RIP: ffffffffb2e95acc  RSP: ffffaaab26b3f970  RFLAGS: 00010086
        RAX: 00007f1148cdffff  RBX: 000f98136ffff230  RCX: ffff981580000230
        RDX: 000fffffffffffff  RSI: 00007f1148ce0000  RDI: effffffdeffffe02
        RBP: ffffaaab26b3fa4c   R8: ffffaaab26b3fa4c   R9: ffffaaab26b3fbbb
        R10: 00112a630d68d462  R11: 0000000000000000  R12: 0000000000000001
        R13: 0000000000000000  R14: 00007f1148cdf000  R15: 00007f1148cdf000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
     #5 [ffffaaab26b3fa38] internal_get_user_pages_fast at ffffffffb2e97d1e
     #6 [ffffaaab26b3fa80] __get_user_pages_fast at ffffffffb2e97e38
     #7 [ffffaaab26b3fa88] __gfn_to_pfn_memslot at ffffffffc09efee4 [kvm]
     #8 [ffffaaab26b3faf0] try_async_pf at ffffffffc0a2e821 [kvm]
     #9 [ffffaaab26b3fb68] direct_page_fault at ffffffffc0a39b4a [kvm]
    #10 [ffffaaab26b3fc30] kvm_mmu_page_fault at ffffffffc0a3a3f9 [kvm]
    #11 [ffffaaab26b3fd18] vcpu_enter_guest at ffffffffc0a0d37c [kvm]
    #12 [ffffaaab26b3fdb8] kvm_arch_vcpu_ioctl_run at ffffffffc0a101ea [kvm]
    #13 [ffffaaab26b3fde8] kvm_vcpu_ioctl at ffffffffc09ed71a [kvm]
    #14 [ffffaaab26b3fe80] do_vfs_ioctl at ffffffffb2f2d234
    #15 [ffffaaab26b3fef8] ksys_ioctl at ffffffffb2f2d870
    #16 [ffffaaab26b3ff30] __x64_sys_ioctl at ffffffffb2f2d8b6
    #17 [ffffaaab26b3ff38] do_syscall_64 at ffffffffb2c0420b
    #18 [ffffaaab26b3ff50] entry_SYSCALL_64_after_hwframe at ffffffffb36000ad
        RIP: 00007f1334a0c62b  RSP: 00007f12537fd628  RFLAGS: 00000246
        RAX: ffffffffffffffda  RBX: 000055c92fb0b410  RCX: 00007f1334a0c62b
        RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 0000000000000060
        RBP: 0000000000000000   R8: 000055c92c88fdd8   R9: 000000000000002c
        R10: 0000000000000001  R11: 0000000000000246  R12: 0000000000000001
        R13: 000055c92c8b2020  R14: 0000000000000000  R15: 00007f1338379000
        ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
    
  2. A corrected hardware error is seen in the logs just prior to the panic:

    crash> log
    [...cut...]
    [327450.877805] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
    [327450.877807] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    [327450.877808] {1}[Hardware Error]: event severity: corrected
    [327450.877809] {1}[Hardware Error]:  Error 0, type: corrected
    [327450.877810] {1}[Hardware Error]:  fru_text: A1
    [327450.877810] {1}[Hardware Error]:   section_type: memory error
    [327450.877811] {1}[Hardware Error]:   error_status: 0x0000000000000400
    [327450.877812] {1}[Hardware Error]:   physical_address: 0x00000010af886040
    [327450.877814] {1}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 device: 4 row: 44193 column: 928 
    [327450.877815] {1}[Hardware Error]:   error_type: 2, single-bit ECC
    [327450.877816] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
    [327450.877828] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
    [327450.877828] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
    [327450.877828] {2}[Hardware Error]: event severity: corrected
    [327450.877829] {2}[Hardware Error]:  Error 0, type: corrected
    [327450.877831] {2}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
    [327450.877832] {2}[Hardware Error]:   section length: 0x38
    [327450.877834] {2}[Hardware Error]:   00000000: 01010001 00000000 af886000 00000010  .........`......
    [327450.877836] {2}[Hardware Error]:   00000010: 00001000 00000000 af886fff 00000010  .........o......
    [327450.877837] {2}[Hardware Error]:   00000020: 00000080 00000000 00000000 00000000  ................
    [327450.877838] {2}[Hardware Error]:   00000030: 00000000 00000000                    ........
    [327450.886746] general protection fault: 0000 [#1] SMP NOPTI
    [327450.892235] CPU: 56 PID: 20191 Comm: CPU 3/KVM Kdump: loaded Tainted: G          I      --------- -  - 4.18.0-305.34.2.el8_4.x86_64 #1
    [327450.904392] Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.15.1 06/15/2022
    [327450.912055] RIP: 0010:gup_pgd_range+0x24c/0xc50
    [327450.916673] Code: 89 03 00 00 48 81 e3 00 00 00 c0 48 21 d8 48 03 0d a9 13 ee 00 4c 89 74 24 10 48 8d 1c 01 49 8d 46 ff 4d 89 fe 48 89 44 24 58 <4c> 8b 23 4d 8d ae 00 00 20 00 49 81 e5 00 00 e0 ff 49 8d 45 ff 4c
    [327450.935505] RSP: 0018:ffffaaab26b3f970 EFLAGS: 00010086
    [327450.940817] RAX: 00007f1148cdffff RBX: 000f98136ffff230 RCX: ffff981580000230
    [327450.948036] RDX: 000fffffffffffff RSI: 00007f1148ce0000 RDI: effffffdeffffe02
    [327450.955256] RBP: ffffaaab26b3fa4c R08: ffffaaab26b3fa4c R09: ffffaaab26b3fbbb
    [327450.962476] R10: 00112a630d68d462 R11: 0000000000000000 R12: 0000000000000001
    [327450.969696] R13: 0000000000000000 R14: 00007f1148cdf000 R15: 00007f1148cdf000
    [327450.976916] FS:  00007f12537fe700(0000) GS:ffff984500d00000(0000) knlGS:0000000000000000
    [327450.985087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [327450.990921] CR2: 000055e478b5dfc0 CR3: 0000002d48056001 CR4: 00000000007726e0
    [327450.998138] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [327451.005358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [327451.012578] PKRU: 55555554
    
  3. The hardware error occurs on a 1 GB page starting at address 0x1080000000:

    crash> ptov 0x00000010af886040
    VIRTUAL           PHYSICAL        
    ffff98262f886040  10af886040 
    
    crash> vtop ffff98262f886040
    VIRTUAL           PHYSICAL        
    ffff98262f886040  10af886040      
    
    PGD DIRECTORY: ffffffffb4210000
    PAGE DIRECTORY: 5c73c01067
       PUD: 5c73c014c0 => 80000010800001e3
      PAGE: 1080000000  (1GB)
    
          PTE          PHYSICAL   FLAGS
    80000010800001e3  1080000000  (PRESENT|RW|ACCESSED|DIRTY|PSE|GLOBAL|NX)
    
          PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
    ffffd02382be2180 10af886000                0        0  1 17ffffc0400000 hwpoison\
    
  4. The address on which we panic is held in %rbx and placed on the stack:

    crash> bt -FFls
    ffffaaab26b3f978: ffff9842c9ef2228 00007f1148ce0000 
    ffffaaab26b3f988: ffffaaab26b3fa4c 00007f1148ce0000 
    ffffaaab26b3f998: 0000000000000007 00007f1148ce0000 
    ffffaaab26b3f9a8: ffffaaab26b3f9f8 ffff9842c80567f0
    ffffaaab26b3f9b8: 00007f1148ce0000 00007f1100080005
    ffffaaab26b3f9c8: 00007f1148cdffff 00007f1148cdffff
    ffffaaab26b3f9d8: 84607eb05b91ba00 00007f1148cdffff 
    ffffaaab26b3f9e8: 0000000126b3faa0 00007f1148cdffff 
    ffffaaab26b3f9f8: 0000002d49ef2067 84607eb05b91ba00 
    
    ffffaaab26b3fa08: 00007f1148cdf000 0000000000080005
                            %rbx              %rbp
                            addr 
    
    ffffaaab26b3fa18: 0000000000000001 ffffaaab26b3fab0 
                             %r12           %r13
                                         page (struct)
    
    ffffaaab26b3fa28: 0000000000000206 [ffff9842edf34c10:kmalloc-2k] 
                             %r14               %r15
                                        struct kvm_memory_slot
    
    ffffaaab26b3fa38: internal_get_user_pages_fast+0xce   We are in gup_pgd_range
                                %rip
    
  5. The address maps back to the following physical page:

    crash> vtop 00007f1148cdf000
    VIRTUAL     PHYSICAL        
    7f1148cdf000  f88cdf000       
    
       PGD: 2d480567f0 => 2d49ef2067
       PUD: 2d49ef2228 => 8000000f80000887
      PAGE: f80000000  (1GB)
    
          PTE         PHYSICAL   FLAGS
    8000000f80000887  f80000000  (PRESENT|RW|USER|PSE|NX)
    
          VMA           START       END     FLAGS FILE
    ffff9842c2911878 7f0a40000000 7f1240000000 2c4600fb /dev/hugepages/libvirt/qemu/7-instance-00000630/qemu_back_mem._objects_ram-node0.CTZ2nX
    
          PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
    ffffd0237e2337c0  f88cdf000                0        0  0 17ffffc0000000
    
  6. Looking at the PUD page which holds the 1 GB page page table entry (PTE), we see this page is out of sequence and likely held the physical page on which the hardware error occurred, 1080000000:

    2d49ef2140:  0000002d41a92067 80000017800008e7   g .A-...........
    2d49ef2150:  80000017400008e7 80000017000008e7   ...@............
    2d49ef2160:  80000016c00008e7 80000016800008e7   ................
    2d49ef2170:  80000016400008e7 80000016000008e7   ...@............
    2d49ef2180:  80000015c00008e7 80000015800008e7   ................
    2d49ef2190:  80000015400008e7 80000015000008e7   ...@............
    2d49ef21a0:  80000014c00008e7 80000014800008e7   ................
    2d49ef21b0:  80000014400008e7 80000014000008e7   ...@............
    2d49ef21c0:  80000013c00008e7 80000013800008e7   ................
    2d49ef21d0:  80000013400008e7 80000013000008e7   ...@............
    2d49ef21e0:  80000012c00008e7 80000012800008e7   ................
    2d49ef21f0:  80000012400008e7 80000012000008e7   ...@............
    2d49ef2200:  80000011c00008e7 80000011800008e7   ................
    2d49ef2210:  80000011400008e7 80000011000008e7   ...@............
    2d49ef2220:  80000010c00008e7 8000000f80000887   ................  <<<-------
    2d49ef2230:  80000010400008e7 80000010000008e7   ...@............
    2d49ef2240:  8000000fc00008e7 0000002d725f2067   ........g _r-...
    2d49ef2250:  0000002d48174067 0000002d42820067   g@.H-...g..B-...
    2d49ef2260:  0000002d49e6d067 0000000000000000   g..I-...........
    
  7. The page which took a hardware error is being migrated without locking. Since it
    is a 1 GB page there is a larger window for the page to be touched during migration
    without locking, causing a panic as the data in the page may be fluctuating at
    this time.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments