System crashes with RIP : remove_vma+0x66/0x90

Solution In Progress - Updated -

Environment

  • Red Hat Enterprise Linux 6

Issue

  • System crashes with following messages:
BUG: unable to handle kernel paging request at 000000000249935b
IP: [<ffffffff8115d5a6>] remove_vma+0x66/0x90

Resolution

  • Contact hardware vendor for further troubleshooting. This was identified as CPU malfunction and replacing the faulty cpu(s) stopped further system panic/crash incidents.

Root Cause

  • System crash because faulty CPU was detected, while accessing %rip (Instruction Pointer) which returned with bad value.

Diagnostic Steps

  • Kernel Ring Buffer:
crash > log

BUG: unable to handle kernel paging request at 000000000249935b
IP: [<ffffffff8115d5a6>] remove_vma+0x66/0x90
PGD 1024385067 PUD 10257a6067 PMD 101f050067 PTE 80000020121f6065
Oops: 0003 [#1] SMP 
last sysfs file: /sys/devices/pci0000:40/0000:40:03.2/0000:44:00.0/host9/rport-9:0-1/fc_remote_ports/rport-9:0-1/node_name
CPU 2 
Modules linked in: mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase dell_rbu autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding ipv6 dm_multipath ipmi_devintf microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support dcdbas cdc_ether usbnet mii joydev sg shpchp lpfc scsi_dh_emc scsi_transport_fc scsi_tgt igb dca i2c_algo_bit i2c_core ptp pps_core sb_edac edac_core lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif be2net ahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 13899, comm: python Not tainted 2.6.32-642.6.1.el6.x86_64 #1 Dell Inc. PowerEdge R720/0C4Y3R
RIP: 0010:[<ffffffff8115d5a6>]  [<ffffffff8115d5a6>] remove_vma+0x66/0x90
RSP: 0018:ffff88101bb0fed8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000100073 RDI: ffff88202fcf0300
RBP: ffff88101bb0fee8 R08: 00007f6bc7822000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: ffff88101b683570
R13: ffff88101ba88318 R14: ffff88101b683570 R15: 00007f6bc7822000
FS:  00007f6bb6bfd700(0000) GS:ffff880061c20000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000249935b CR3: 000000101b8c1000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process python (pid: 13899, threadinfo ffff88101bb0c000, task ffff88101b680ab0)
Stack:
 0000000000000000 ffff881025ef36c0 ffff88101bb0ff48 ffffffff8115fb57
<d> 00007f6ba80bfe60 ffff88101b683570 ffff88101ba88330 ffff881025ef36c8
<d> 0000000000001000 ffff881025ef3728 ffff881025ef36c0 00007f6bc7821000
Call Trace:
 [<ffffffff8115fb57>] do_munmap+0x317/0x3b0
 [<ffffffff8115fc46>] sys_munmap+0x56/0x80
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Code: 48 85 ff 74 0d e8 4b e1 03 00 41 f6 44 24 31 10 75 33 49 8b bc 24 b0 00 00 00 48 85 ff 74 05 e8 11 78 01 00 48 8b 3d 4a 21 cf 00 <4c> 89 a6 e8 92 39 02 00 48 89 d8 5b 41 5c c9 c3 66 2e 0f 1f 84 
RIP  [<ffffffff8115d5a6>] remove_vma+0x66/0x90
 RSP <ffff88101bb0fed8>
CR2: 000000000249935b
  • Backtraces:
crash> bt
PID: 13899  TASK: ffff88101b680ab0  CPU: 2   COMMAND: "python"
 #0 [ffff88101bb0fac0] machine_kexec at ffffffff8103fdcb
 #1 [ffff88101bb0fb20] crash_kexec at ffffffff810d1dc2
 #2 [ffff88101bb0fbf0] oops_end at ffffffff8154d0d0
 #3 [ffff88101bb0fc20] no_context at ffffffff810518cb
 #4 [ffff88101bb0fc70] __bad_area_nosemaphore at ffffffff81051b55
 #5 [ffff88101bb0fcc0] bad_area_nosemaphore at ffffffff81051c23
 #6 [ffff88101bb0fcd0] __do_page_fault at ffffffff8105231c
 #7 [ffff88101bb0fdf0] do_page_fault at ffffffff8154f05e
 #8 [ffff88101bb0fe20] page_fault at ffffffff8154c365
    [exception RIP: remove_vma+102]
    RIP: ffffffff8115d5a6  RSP: ffff88101bb0fed8  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 0000000000100073  RDI: ffff88202fcf0300
    RBP: ffff88101bb0fee8   R8: 00007f6bc7822000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000206  R12: ffff88101b683570
    R13: ffff88101ba88318  R14: ffff88101b683570  R15: 00007f6bc7822000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88101bb0fef0] do_munmap at ffffffff8115fb57
#10 [ffff88101bb0ff50] sys_munmap at ffffffff8115fc46
#11 [ffff88101bb0ff80] system_call_fastpath at ffffffff8100b0d2
    RIP: 0000003c4bae54b7  RSP: 00007f6bb6bfb670  RFLAGS: 00010202
    RAX: 000000000000000b  RBX: ffffffff8100b0d2  RCX: 0000000000000002
    RDX: 0000000000000000  RSI: 0000000000001000  RDI: 00007f6bc7821000
    RBP: 0000000000000000   R8: 00007f6bb6bfd700   R9: 0000000002453f30
    R10: 0000000002969780  R11: 0000000000000206  R12: 00007f6ba80bfe60
    R13: 0000000002453f30  R14: 0000000000000000  R15: 00007f6ba80bfe60
    ORIG_RAX: 000000000000000b  CS: 0033  SS: 002b
  • Dis-assembly of exception pointer (RIP):
crash> dis -lr remove_vma+102|tail
/usr/src/debug/kernel-2.6.32-642.6.1.el6/linux-2.6.32-642.6.1.el6.x86_64/mm/mmap.c: 268
0xffffffff8115d58d <remove_vma+77>: mov    0xb0(%r12),%rdi
/usr/src/debug/kernel-2.6.32-642.6.1.el6/linux-2.6.32-642.6.1.el6.x86_64/include/linux/mempolicy.h: 121
0xffffffff8115d595 <remove_vma+85>: test   %rdi,%rdi
0xffffffff8115d598 <remove_vma+88>: je     0xffffffff8115d59f <remove_vma+95>
/usr/src/debug/kernel-2.6.32-642.6.1.el6/linux-2.6.32-642.6.1.el6.x86_64/include/linux/mempolicy.h: 122
0xffffffff8115d59a <remove_vma+90>: callq  0xffffffff81174db0 <__mpol_put>
/usr/src/debug/kernel-2.6.32-642.6.1.el6/linux-2.6.32-642.6.1.el6.x86_64/mm/mmap.c: 269
0xffffffff8115d59f <remove_vma+95>: mov    0xcf214a(%rip),%rdi        # 0xffffffff81e4f6f0 <vm_area_cachep>
0xffffffff8115d5a6 <remove_vma+102>:    mov    %r12,%rsi
  • CR2 contains the linear address of the page fault (CR2 can be found in the tail of the log/ring_buffer):
 CR2: 000000000249935b
  • Checking the CR2 address, it matches nothing and based on the page table captured in the core, it is a Page Fault address:
crash> ptov 000000000249935b
VIRTUAL           PHYSICAL        
ffff88000249935b  249935b         
crash> rd ffff88000249935b       Check the Virtual address
rd: page excluded: kernel virtual address: ffff88000249935b  type: "64-bit KVADDR"
crash> rd -p 249935b             Check the Physical address
rd: page excluded: physical address: 249935b  type: "64-bit PHYSADDR"
  • CR2 doesn't match the RIP or anything else associated. Checking the RIP itself:
crash> vtop ffffffff8115d5a6
VIRTUAL           PHYSICAL        
ffffffff8115d5a6  115d5a6         

PML4 DIRECTORY: ffffffff81a8d000
PAGE DIRECTORY: 1a8f067
   PUD: 1a8fff0 => 1a93063
   PMD: 1a93040 => 10001e1
  PAGE: 1000000  (2MB)

  PTE    PHYSICAL  FLAGS
10001e1   1000000  (PRESENT|ACCESSED|DIRTY|PSE|GLOBAL)

      PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea000003cc58    115d000                0        0  1 20000000000400 reserved
  • Stack showing CPU #1's Page Fault:
 crash> bt -c1
      PID: 0      TASK: ffff881029611520  CPU: 1   COMMAND: "swapper"
       #0 [ffff881078806e90] crash_nmi_callback at ffffffff810366e6
       - - - - - - - - - - - - 8< - - - - - - - - - - 
       #5 [ffff881078806f50] nmi at ffffffff8154c653
          [exception RIP: oops_begin+0x74]
          RIP: ffffffff8154d184  RSP: ffff881078803bd8  RFLAGS: 00000097
          RAX: 0000000000000001  RBX: 0000000000000046  RCX: 0000000000000000
          RDX: 0000000000000001  RSI: 0000000000000010  RDI: ffff881078803df8
          RBP: ffff881078803be8   R8: ffff880000000000   R9: 00003ffffffff000
          R10: ffffc00000000fff  R11: 0000000000000001  R12: ffff88102022cab0
          R13: 0000000000000011  R14: ffff881029611520  R15: 0000000000030001
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #6 [ffff881078803bd8] oops_begin at ffffffff8154d184
       #7 [ffff881078803bf0] no_context at ffffffff8105184c
       #8 [ffff881078803c40] __bad_area_nosemaphore at ffffffff81051b55
       #9 [ffff881078803c90] bad_area_nosemaphore at ffffffff81051c23
      #10 [ffff881078803ca0] __do_page_fault at ffffffff810523c0
      #11 [ffff881078803dc0] do_page_fault at ffffffff8154f05e
      #12 [ffff881078803df0] page_fault at ffffffff8154c365
          [exception RIP: unknown or invalid address]  <<<<<<<<<<<<<<<<<<<<<<<
          RIP: ffff88102022cab0  RSP: ffff881078803ea0  RFLAGS: 00010046
          RAX: 0000000000000000  RBX: ffff88101b0cbd08  RCX: 0000000000000000
          RDX: 0000000000000000  RSI: ffff88101b0cbd68  RDI: ffff88101b0cbd08
          RBP: ffff881078803ee8   R8: 7fffffffffffffff   R9: 0000000000000001
          R10: 00000518930f6948  R11: 0000000000000001  R12: ffff88101b0cbd68
          R13: ffff88101b0cbf38  R14: ffff881078803f28  R15: ffff88102022cab0
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      #13 [ffff881078803ea0] __run_hrtimer at ffffffff810ab12e
      #14 [ffff881078803ef0] hrtimer_interrupt at ffffffff810ab4ce
  • Kernel Source code:

253 /* 254 * Close a vm structure and free it, returning the next. 255 */ 256 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma) 257 { 258 struct vm_area_struct *next = vma->vm_next; 259 260 might_sleep(); 261 if (vma->vm_ops && vma->vm_ops->close) 262 vma->vm_ops->close(vma); 263 if (vma->vm_file) { 264 fput(vma->vm_file); 265 if (vma->vm_flags & VM_EXECUTABLE) 266 removed_exe_file_vma(vma->vm_mm); 267 } 268 mpol_put(vma_policy(vma)); 269 kmem_cache_free(vm_area_cachep, vma); <<--------- panicked here 270 return next; 271 }

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.