Kernel panic with "BUG: unable to handle kernel NULL pointer dereference at 00000000000002d4"

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 7
  • 3rd party module vxdmp loaded

Issue

  • The kernel panics with panic string BUG: unable to handle kernel NULL pointer dereference at 00000000000002d4 at RIP bdevname+0x1a

Resolution

  • Red Hat neither ships nor supports this module. Engage the respective vendor of the module vxdmp for further investigation.
  • As a workaround, if the system does not depend on storage devices provided through vxdmp, you may disable it temporarily. For more information on how to prevent a module from loading on boot, please refer to the following knowledge base article:
    How do I prevent a kernel module from loading automatically?

Root Cause

The system panicked because of 3rd party vxdmp module passed an invalid data from its get_dip_from_device() function.

Diagnostic Steps

Pre-requisites

  1. Deploy kdump in Order to Collect a vmcore:

  2. Prepare crash Environment for vmcore Analysis:

Vmcore Analysis

Example: 1

  1. System Information:

    crash> sys |grep -eREL -ePAN -eLOAD
    LOAD AVERAGE: 1.65, 1.27, 1.16
    RELEASE: 3.10.0-1160.88.1.el7.x86_64
    PANIC: "BUG: unable to handle kernel NULL pointer dereference at 00000000000002d4"
    
    crash> sys -i | head -5
    DMI_BIOS_VENDOR: HP
    DMI_BIOS_VERSION: P70
    DMI_BIOS_DATE: 05/24/2019
    DMI_SYS_VENDOR: HP
    DMI_PRODUCT_NAME: ProLiant DL380p Gen8
    
  2. The backtrace of the panicking task shows the function where the panic occurred as indicated by the RIP. Here the panic occurred in bdevname. The vxdmp module passed an invalid data from its get_dip_from_device() function:

    crash> bt
    PID: 6610     TASK: ffff935d306d2100  CPU: 3    COMMAND: "dmpdaemon"
    #0 [ffff935d3091f9a0] machine_kexec at ffffffffac869514
    #1 [ffff935d3091fa00] __crash_kexec at ffffffffac929e82
    #2 [ffff935d3091fad0] crash_kexec at ffffffffac929f78
    #3 [ffff935d3091fae8] oops_end at ffffffffacfbc818
    #4 [ffff935d3091fb10] no_context at ffffffffac87974c
    #5 [ffff935d3091fb60] __bad_area_nosemaphore at ffffffffac879a2a
    #6 [ffff935d3091fbb0] bad_area_nosemaphore at ffffffffac879b54
    #7 [ffff935d3091fbc0] __do_page_fault at ffffffffacfbf8d0
    #8 [ffff935d3091fc30] do_page_fault at ffffffffacfbfb05
    #9 [ffff935d3091fc60] page_fault at ffffffffacfbb7b8
    [exception RIP: bdevname+0x1a]
    RIP: ffffffffacb8216a  RSP: ffff935d3091fd18  RFLAGS: 00010286
    RAX: 0000000000000000  RBX: ffff93534d83e700  RCX: 0000000000000003
    RDX: ffff93534d83e700  RSI: ffff93534d83e700  RDI: ffff935533028400
    RBP: ffff935d3091fd18   R8: fdb5d31615bcf001   R9: ffffffffc09f58e7
    R10: ffff934ebfc03b00  R11: ffffcf0258360f80  R12: 0000000000800070
    R13: 0000000000000000  R14: ffffffffc0715700  R15: ffff935d3076fc00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff935d3091fd20] get_dip_from_device at ffffffffc09f67bc [vxdmp]
    #11 [ffff935d3091fd50] dmp_node_to_dip at ffffffffc09f6820 [vxdmp]
    #12 [ffff935d3091fd60] dmp_check_nonscsi at ffffffffc0a30459 [vxdmp]
    #13 [ffff935d3091fd88] dmp_check_path_alive at ffffffffc0a30fcb [vxdmp]
    #14 [ffff935d3091fdc8] dmp_check_disabled_policy at ffffffffc0a3164a [vxdmp]
    #15 [ffff935d3091fe60] dmp_initiate_restore at ffffffffc0a31b13 [vxdmp]
    #16 [ffff935d3091fe90] dmp_daemons_loop at ffffffffc0a3fc4c [vxdmp]
    #17 [ffff935d3091fec8] kthread at ffffffffac8cb621
    
    crash> dis -r bdevname+0x1a
    0xffffffffacb82150 <bdevname>:  data16 data16 data16 xchg %ax,%ax
    0xffffffffacb82155 <bdevname+0x5>:      push   %rbp
    0xffffffffacb82156 <bdevname+0x6>:      mov    0x88(%rdi),%rax
    0xffffffffacb8215d <bdevname+0xd>:      mov    %rsi,%rdx
    0xffffffffacb82160 <bdevname+0x10>:     mov    0x98(%rdi),%rdi
    0xffffffffacb82167 <bdevname+0x17>:     mov    %rsp,%rbp
    0xffffffffacb8216a <bdevname+0x1a>:     mov    0x2d4(%rax),%esi
    
  3. Third party module vxdmp loaded on the server:

    crash> mod -t | grep vxdmp
    vxdmp                    POE
    

Example: 2

This example shows %rdi being passed to bdevname() and the value of %rdi is invalid. bdevname() was called by get_dip_from_device() with an invalid address in %rdi

  1. Backtrace of the panic task:

    crash> bt
    PID: 4639     TASK: ffff99985bc11080  CPU: 2    COMMAND: "dmpdaemon"
    #0 [ffff999022aef9d0] machine_kexec at ffffffffbb0663d4
    #1 [ffff999022aefa30] __crash_kexec at ffffffffbb122ae2
    #2 [ffff999022aefb00] crash_kexec at ffffffffbb122bd0
    #3 [ffff999022aefb18] oops_end at ffffffffbb791798
    #4 [ffff999022aefb40] no_context at ffffffffbb075d14
    #5 [ffff999022aefb90] __bad_area_nosemaphore at ffffffffbb075fe2
    #6 [ffff999022aefbe0] bad_area_nosemaphore at ffffffffbb076104
    #7 [ffff999022aefbf0] __do_page_fault at ffffffffbb794750
    #8 [ffff999022aefc60] do_page_fault at ffffffffbb794975
    #9 [ffff999022aefc90] page_fault at ffffffffbb790778
    [exception RIP: bdevname+26]
    RIP: ffffffffbb36cdba  RSP: ffff999022aefd48  RFLAGS: 00010286
    RAX: 0000000000000000  RBX: ffff998ff4ed4680  RCX: ffff999022aeffd8
    RDX: ffff998ff4ed4680  RSI: ffff998ff4ed4680  RDI: ffff9998595d5c00
    RBP: ffff999022aefd48   R8: fdf48b7e745a5004   R9: ffffffffc1051897
    R10: ffff9989bfc03b00  R11: ffffe516ded3b500  R12: 00000000041000e0
    R13: ffff99905c9f1380  R14: ffff998ff4ed48c0  R15: ffff999054a50400
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff999022aefd50] get_dip_from_device at ffffffffc1052752 [vxdmp]
    #11 [ffff999022aefd78] dmp_node_to_dip at ffffffffc10527b0 [vxdmp]
    #12 [ffff999022aefd88] dmp_check_nonscsi at ffffffffc108c389 [vxdmp]
    #13 [ffff999022aefdb0] dmp_probe_required at ffffffffc108c401 [vxdmp]
    #14 [ffff999022aefdc8] dmp_check_disabled_policy at ffffffffc108d390 [vxdmp]
    #15 [ffff999022aefe60] dmp_initiate_restore at ffffffffc108da43 [vxdmp]
    #16 [ffff999022aefe90] dmp_daemons_loop at ffffffffc109bb5c [vxdmp]
    #17 [ffff999022aefec8] kthread at ffffffffbb0c5f91
    
  2. The disassembly of the function shows that the kernel got panic while executing the below instruction:

    crash> dis -rl ffffffffbb36cdba
    /usr/src/debug/kernel-3.10.0-1160.76.1.el7/linux-3.10.0-1160.76.1.el7.x86_64/block/partition-generic.c: 47
    0xffffffffbb36cda0 <bdevname>:  data16 data16 data16 xchg %ax,%ax
    0xffffffffbb36cda5 <bdevname+5>:        push   %rbp
    /usr/src/debug/kernel-3.10.0-1160.76.1.el7/linux-3.10.0-1160.76.1.el7.x86_64/block/partition-generic.c: 48
    0xffffffffbb36cda6 <bdevname+6>:        mov    0x88(%rdi),%rax
    /usr/src/debug/kernel-3.10.0-1160.76.1.el7/linux-3.10.0-1160.76.1.el7.x86_64/block/partition-generic.c: 47
    0xffffffffbb36cdad <bdevname+13>:       mov    %rsi,%rdx
    /usr/src/debug/kernel-3.10.0-1160.76.1.el7/linux-3.10.0-1160.76.1.el7.x86_64/block/partition-generic.c: 48
    0xffffffffbb36cdb0 <bdevname+16>:       mov    0x98(%rdi),%rdi
    /usr/src/debug/kernel-3.10.0-1160.76.1.el7/linux-3.10.0-1160.76.1.el7.x86_64/block/partition-generic.c: 47
    0xffffffffbb36cdb7 <bdevname+23>:       mov    %rsp,%rbp
    /usr/src/debug/kernel-3.10.0-1160.76.1.el7/linux-3.10.0-1160.76.1.el7.x86_64/block/partition-generic.c: 48
    0xffffffffbb36cdba <bdevname+26>:       mov    0x2d4(%rax),%esi          <<<----
    

    As the value of rax register appear as NULL inside bdevname():

    crash> bt | grep RAX
    RAX: 0000000000000000  RBX: ffff998ff4ed4680  RCX: ffff999022aeffd8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    
    crash> px (0x2d4+0x0000000000000000)
    $1 = 0x2d4
    

    The rax register is being modified by the rdi register inside the bdevname() function:

    crash> dis -r ffffffffbb36cdba
    0xffffffffbb36cda0 <bdevname>:  data16 data16 data16 xchg %ax,%ax
    0xffffffffbb36cda5 <bdevname+5>:        push   %rbp
    0xffffffffbb36cda6 <bdevname+6>:        mov    0x88(%rdi),%rax    <<<----
    0xffffffffbb36cdad <bdevname+13>:       mov    %rsi,%rdx
    0xffffffffbb36cdb0 <bdevname+16>:       mov    0x98(%rdi),%rdi
    0xffffffffbb36cdb7 <bdevname+23>:       mov    %rsp,%rbp
    0xffffffffbb36cdba <bdevname+26>:       mov    0x2d4(%rax),%esi**
    

    While execution of get_dip_from_device() function prior to panic, the value of %rax moved to %r13, and the %r13 and %rax register values are not being modified. Hence %rax is equal to %rdi and equals to %r13:

    crash> dis -r ffffffffc1052752 | tail
    0xffffffffc1052730 <get_dip_from_device+48>:    call   0xffffffffbb28ec10 <bdget>
    0xffffffffc1052735 <get_dip_from_device+53>:    test   %rax,%rax
    0xffffffffc1052738 <get_dip_from_device+56>:    mov    %rax,%r13     <<<---- rax = r13
    0xffffffffc105273b <get_dip_from_device+59>:    je     0xffffffffc1052780 <get_dip_from_device+128>
    0xffffffffc105273d <get_dip_from_device+61>:    cmpq   $0x0,0x98(%rax)
    0xffffffffc1052745 <get_dip_from_device+69>:    je     0xffffffffc1052780 <get_dip_from_device+128>
    0xffffffffc1052747 <get_dip_from_device+71>:    mov    %rbx,%rsi
    0xffffffffc105274a <get_dip_from_device+74>:    mov    %rax,%rdi     <<<------- before calling bdevname() the rdi value being modified by rax i.e here rax = rdi = r13
    0xffffffffc105274d <get_dip_from_device+77>:    call   0xffffffffbb36cda0 <bdevname>
    0xffffffffc1052752 <get_dip_from_device+82>:    mov    %r13,%rdi
    

    The value of R13 = ffff99905c9f1380 = RAX = RDI :

    crash> bt | grep R13
    R13: ffff99905c9f1380  R14: ffff998ff4ed48c0  R15: ffff999054a50400
    

    Before calling bdevname(), The value of %rax = ffff99905c9f1380 = %rdi , and this value passed to bdevname() seems invalid address which needs to be verified by module vendor:

    crash> dis -r ffffffffbb36cdba
    0xffffffffbb36cda0 <bdevname>:  data16 data16 data16 xchg %ax,%ax
    0xffffffffbb36cda5 <bdevname+5>:        push   %rbp
    0xffffffffbb36cda6 <bdevname+6>:        mov    0x88(%rdi),%rax    /* 0x0000000000000000  */
    0xffffffffbb36cdad <bdevname+13>:       mov    %rsi,%rdx
    0xffffffffbb36cdb0 <bdevname+16>:       mov    0x98(%rdi),%rdi
    0xffffffffbb36cdb7 <bdevname+23>:       mov    %rsp,%rbp
    0xffffffffbb36cdba <bdevname+26>:       mov    0x2d4(%rax),%esi    /* 0x00000000000002d4  */
    
    crash> px (0x88+0xffff99905c9f1380)
    $3 = 0xffff99905c9f1408
    
    crash> rd 0xffff99905c9f1408
    ffff99905c9f1408:  0000000000000000                    
    
    crash> px (0x2d4+0x0000000000000000)
    $4 = 0x2d4
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments