System crashes with RIP : remove_vma+0x66/0x90
Environment
- Red Hat Enterprise Linux 6
Issue
- System crashes with following messages:
BUG: unable to handle kernel paging request at 000000000249935b
IP: [<ffffffff8115d5a6>] remove_vma+0x66/0x90
Resolution
- Contact hardware vendor for further troubleshooting. This was identified as CPU malfunction and replacing the faulty cpu(s) stopped further system panic/crash incidents.
Root Cause
- System crash because faulty CPU was detected, while accessing %rip (Instruction Pointer) which returned with bad value.
Diagnostic Steps
- Kernel Ring Buffer:
crash > log
BUG: unable to handle kernel paging request at 000000000249935b
IP: [<ffffffff8115d5a6>] remove_vma+0x66/0x90
PGD 1024385067 PUD 10257a6067 PMD 101f050067 PTE 80000020121f6065
Oops: 0003 [#1] SMP
last sysfs file: /sys/devices/pci0000:40/0000:40:03.2/0000:44:00.0/host9/rport-9:0-1/fc_remote_ports/rport-9:0-1/node_name
CPU 2
Modules linked in: mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase dell_rbu autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding ipv6 dm_multipath ipmi_devintf microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support dcdbas cdc_ether usbnet mii joydev sg shpchp lpfc scsi_dh_emc scsi_transport_fc scsi_tgt igb dca i2c_algo_bit i2c_core ptp pps_core sb_edac edac_core lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif be2net ahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
Pid: 13899, comm: python Not tainted 2.6.32-642.6.1.el6.x86_64 #1 Dell Inc. PowerEdge R720/0C4Y3R
RIP: 0010:[<ffffffff8115d5a6>] [<ffffffff8115d5a6>] remove_vma+0x66/0x90
RSP: 0018:ffff88101bb0fed8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000100073 RDI: ffff88202fcf0300
RBP: ffff88101bb0fee8 R08: 00007f6bc7822000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: ffff88101b683570
R13: ffff88101ba88318 R14: ffff88101b683570 R15: 00007f6bc7822000
FS: 00007f6bb6bfd700(0000) GS:ffff880061c20000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000249935b CR3: 000000101b8c1000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process python (pid: 13899, threadinfo ffff88101bb0c000, task ffff88101b680ab0)
Stack:
0000000000000000 ffff881025ef36c0 ffff88101bb0ff48 ffffffff8115fb57
<d> 00007f6ba80bfe60 ffff88101b683570 ffff88101ba88330 ffff881025ef36c8
<d> 0000000000001000 ffff881025ef3728 ffff881025ef36c0 00007f6bc7821000
Call Trace:
[<ffffffff8115fb57>] do_munmap+0x317/0x3b0
[<ffffffff8115fc46>] sys_munmap+0x56/0x80
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Code: 48 85 ff 74 0d e8 4b e1 03 00 41 f6 44 24 31 10 75 33 49 8b bc 24 b0 00 00 00 48 85 ff 74 05 e8 11 78 01 00 48 8b 3d 4a 21 cf 00 <4c> 89 a6 e8 92 39 02 00 48 89 d8 5b 41 5c c9 c3 66 2e 0f 1f 84
RIP [<ffffffff8115d5a6>] remove_vma+0x66/0x90
RSP <ffff88101bb0fed8>
CR2: 000000000249935b
- Backtraces:
crash> bt
PID: 13899 TASK: ffff88101b680ab0 CPU: 2 COMMAND: "python"
#0 [ffff88101bb0fac0] machine_kexec at ffffffff8103fdcb
#1 [ffff88101bb0fb20] crash_kexec at ffffffff810d1dc2
#2 [ffff88101bb0fbf0] oops_end at ffffffff8154d0d0
#3 [ffff88101bb0fc20] no_context at ffffffff810518cb
#4 [ffff88101bb0fc70] __bad_area_nosemaphore at ffffffff81051b55
#5 [ffff88101bb0fcc0] bad_area_nosemaphore at ffffffff81051c23
#6 [ffff88101bb0fcd0] __do_page_fault at ffffffff8105231c
#7 [ffff88101bb0fdf0] do_page_fault at ffffffff8154f05e
#8 [ffff88101bb0fe20] page_fault at ffffffff8154c365
[exception RIP: remove_vma+102]
RIP: ffffffff8115d5a6 RSP: ffff88101bb0fed8 RFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000100073 RDI: ffff88202fcf0300
RBP: ffff88101bb0fee8 R8: 00007f6bc7822000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: ffff88101b683570
R13: ffff88101ba88318 R14: ffff88101b683570 R15: 00007f6bc7822000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88101bb0fef0] do_munmap at ffffffff8115fb57
#10 [ffff88101bb0ff50] sys_munmap at ffffffff8115fc46
#11 [ffff88101bb0ff80] system_call_fastpath at ffffffff8100b0d2
RIP: 0000003c4bae54b7 RSP: 00007f6bb6bfb670 RFLAGS: 00010202
RAX: 000000000000000b RBX: ffffffff8100b0d2 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000001000 RDI: 00007f6bc7821000
RBP: 0000000000000000 R8: 00007f6bb6bfd700 R9: 0000000002453f30
R10: 0000000002969780 R11: 0000000000000206 R12: 00007f6ba80bfe60
R13: 0000000002453f30 R14: 0000000000000000 R15: 00007f6ba80bfe60
ORIG_RAX: 000000000000000b CS: 0033 SS: 002b
- Dis-assembly of exception pointer (RIP):
crash> dis -lr remove_vma+102|tail
/usr/src/debug/kernel-2.6.32-642.6.1.el6/linux-2.6.32-642.6.1.el6.x86_64/mm/mmap.c: 268
0xffffffff8115d58d <remove_vma+77>: mov 0xb0(%r12),%rdi
/usr/src/debug/kernel-2.6.32-642.6.1.el6/linux-2.6.32-642.6.1.el6.x86_64/include/linux/mempolicy.h: 121
0xffffffff8115d595 <remove_vma+85>: test %rdi,%rdi
0xffffffff8115d598 <remove_vma+88>: je 0xffffffff8115d59f <remove_vma+95>
/usr/src/debug/kernel-2.6.32-642.6.1.el6/linux-2.6.32-642.6.1.el6.x86_64/include/linux/mempolicy.h: 122
0xffffffff8115d59a <remove_vma+90>: callq 0xffffffff81174db0 <__mpol_put>
/usr/src/debug/kernel-2.6.32-642.6.1.el6/linux-2.6.32-642.6.1.el6.x86_64/mm/mmap.c: 269
0xffffffff8115d59f <remove_vma+95>: mov 0xcf214a(%rip),%rdi # 0xffffffff81e4f6f0 <vm_area_cachep>
0xffffffff8115d5a6 <remove_vma+102>: mov %r12,%rsi
- CR2 contains the linear address of the page fault (CR2 can be found in the tail of the log/ring_buffer):
CR2: 000000000249935b
- Checking the CR2 address, it matches nothing and based on the page table captured in the core, it is a Page Fault address:
crash> ptov 000000000249935b
VIRTUAL PHYSICAL
ffff88000249935b 249935b
crash> rd ffff88000249935b Check the Virtual address
rd: page excluded: kernel virtual address: ffff88000249935b type: "64-bit KVADDR"
crash> rd -p 249935b Check the Physical address
rd: page excluded: physical address: 249935b type: "64-bit PHYSADDR"
- CR2 doesn't match the RIP or anything else associated. Checking the RIP itself:
crash> vtop ffffffff8115d5a6
VIRTUAL PHYSICAL
ffffffff8115d5a6 115d5a6
PML4 DIRECTORY: ffffffff81a8d000
PAGE DIRECTORY: 1a8f067
PUD: 1a8fff0 => 1a93063
PMD: 1a93040 => 10001e1
PAGE: 1000000 (2MB)
PTE PHYSICAL FLAGS
10001e1 1000000 (PRESENT|ACCESSED|DIRTY|PSE|GLOBAL)
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea000003cc58 115d000 0 0 1 20000000000400 reserved
- Stack showing CPU #1's Page Fault:
crash> bt -c1
PID: 0 TASK: ffff881029611520 CPU: 1 COMMAND: "swapper"
#0 [ffff881078806e90] crash_nmi_callback at ffffffff810366e6
- - - - - - - - - - - - 8< - - - - - - - - - -
#5 [ffff881078806f50] nmi at ffffffff8154c653
[exception RIP: oops_begin+0x74]
RIP: ffffffff8154d184 RSP: ffff881078803bd8 RFLAGS: 00000097
RAX: 0000000000000001 RBX: 0000000000000046 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000010 RDI: ffff881078803df8
RBP: ffff881078803be8 R8: ffff880000000000 R9: 00003ffffffff000
R10: ffffc00000000fff R11: 0000000000000001 R12: ffff88102022cab0
R13: 0000000000000011 R14: ffff881029611520 R15: 0000000000030001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff881078803bd8] oops_begin at ffffffff8154d184
#7 [ffff881078803bf0] no_context at ffffffff8105184c
#8 [ffff881078803c40] __bad_area_nosemaphore at ffffffff81051b55
#9 [ffff881078803c90] bad_area_nosemaphore at ffffffff81051c23
#10 [ffff881078803ca0] __do_page_fault at ffffffff810523c0
#11 [ffff881078803dc0] do_page_fault at ffffffff8154f05e
#12 [ffff881078803df0] page_fault at ffffffff8154c365
[exception RIP: unknown or invalid address] <<<<<<<<<<<<<<<<<<<<<<<
RIP: ffff88102022cab0 RSP: ffff881078803ea0 RFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff88101b0cbd08 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88101b0cbd68 RDI: ffff88101b0cbd08
RBP: ffff881078803ee8 R8: 7fffffffffffffff R9: 0000000000000001
R10: 00000518930f6948 R11: 0000000000000001 R12: ffff88101b0cbd68
R13: ffff88101b0cbf38 R14: ffff881078803f28 R15: ffff88102022cab0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#13 [ffff881078803ea0] __run_hrtimer at ffffffff810ab12e
#14 [ffff881078803ef0] hrtimer_interrupt at ffffffff810ab4ce
- Kernel Source code:
253 /*
254 * Close a vm structure and free it, returning the next.
255 */
256 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
257 {
258 struct vm_area_struct *next = vma->vm_next;
259
260 might_sleep();
261 if (vma->vm_ops && vma->vm_ops->close)
262 vma->vm_ops->close(vma);
263 if (vma->vm_file) {
264 fput(vma->vm_file);
265 if (vma->vm_flags & VM_EXECUTABLE)
266 removed_exe_file_vma(vma->vm_mm);
267 }
268 mpol_put(vma_policy(vma));
269 kmem_cache_free(vm_area_cachep, vma); <<--------- panicked here
270 return next;
271 }
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
