Server panicked in enqueue_task+0x38 dereferencing a bad address
Environment
- Red Hat Enterprise Linux 5.8
Issue
- server getting often rebooted
- CI amrndhl1059 is ofen rebooted, since server is running with redhat 5.8 with oracle ASM and running on VMware with XTREMIO disk.
- System panicked with:
Unable to handle kernel paging request at ffff80f91c1234d0 RIP:
[<ffffffff8008e81a>] enqueue_task+0x38/0x56
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:11.0/0000:02:01.0/irq
CPU 4
Modules linked in: oracleacfs(PU) oracleadvm(PU) oracleoks(PU) oracleasm(U) nfs nfs_acl autofs4 lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi vsock(U) vmmemctl(U) acpiphp dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac lp sg floppy i2c_piix4 parport_pc parport i2c_core shpchp vmci(U) e1000 pcspkr tpm_tis tpm serio_raw tpm_bios ide_cd cdrom vmxnet(U) pvscsi(U) vmxnet3(U) dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod cciss ata_piix libata mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: P ---- 2.6.18-308.16.1.el5 #1
RIP: 0010:[<ffffffff8008e81a>] [<ffffffff8008e81a>] enqueue_task+0x38/0x56
RSP: 0018:ffff81011f94fe20 EFLAGS: 00010087
RAX: 00000001b31a6895 RBX: ffff81091c122d78 RCX: 00000000000003e8
RDX: ffff80f91c1234c8 RSI: ffff81091c122d78 RDI: ffff81191975c078
RBP: ffff81011f94fe30 R08: 000000006624e484 R09: ffff8111000f7e48
R10: 0000000000000001 R11: 00000000ffffffff R12: ffff81191975c040
R13: ffff81191975c040 R14: ffff81091c122d78 R15: ffff81091c122420
FS: 0000000000000000(0000) GS:ffff81091c19d340(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff80f91c1234d0 CR3: 000000023ba65000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffff8111000f6000, task ffff81091c19e080)
Stack: 000aab4ec032fc12 ffff81091c122420 ffff81011f94fe60 ffffffff8008e88e
000aab4ec032fc12 0000000000000004 ffff81191975c040 ffff81191975c040
ffff81011f94ff00 ffffffff80046c8f 0000000000000080 000000000000000f
Call Trace:
<IRQ> [<ffffffff8008e88e>] __activate_task+0x56/0x6d
[<ffffffff80046c8f>] try_to_wake_up+0x401/0x484
[<ffffffff8008fa94>] rebalance_tick+0x183/0x3cd
[<ffffffff8009a411>] process_timeout+0x0/0x5
[<ffffffff8009a1f7>] run_timer_softirq+0x18d/0x23a
[<ffffffff80012592>] __do_softirq+0x89/0x133
[<ffffffff8005e2fc>] call_softirq+0x1c/0x28
[<ffffffff8006d646>] do_softirq+0x2c/0x7d
[<ffffffff8006be03>] default_idle+0x0/0x50
[<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
<EOI> [<ffffffff8006be2c>] default_idle+0x29/0x50
[<ffffffff800490ad>] cpu_idle+0x95/0xb8
[<ffffffff80078af0>] start_secondary+0x479/0x488
Code: 48 8b 72 08 e8 3e c5 0c 00 41 8b 44 24 28 0f ab 43 08 ff 03
RIP [<ffffffff8008e81a>] enqueue_task+0x38/0x56
RSP <ffff81011f94fe20>
Resolution
Unsure if this is a software bug or a hardware failure.
Root Cause
It looks like a register value has been corrupted, possibly due to a hardware error.
Diagnostic Steps
Looking at the vmcore:
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-308.16.1.el5/vmlinux
DUMPFILE: /cores/retrace/tasks/564810281/crash/vmcore [PARTIAL DUMP]
CPUS: 16
DATE: Fri Jul 25 16:58:55 2014
UPTIME: 34 days, 18:45:52
LOAD AVERAGE: 2.61, 3.01, 3.56
TASKS: 674
NODENAME: <nodename>
RELEASE: 2.6.18-308.16.1.el5
VERSION: #1 SMP Tue Sep 18 07:21:07 EDT 2012
MACHINE: x86_64 (2699 Mhz)
MEMORY: 126.2 GB
PANIC: "Oops: 0000 [1] SMP " (check log for details)
Unable to handle kernel paging request at ffff80f91c1234d0 RIP:
[<ffffffff8008e81a>] enqueue_task+0x38/0x56
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:11.0/0000:02:01.0/irq
CPU 4
Modules linked in: oracleacfs(PU) oracleadvm(PU) oracleoks(PU) oracleasm(U) nfs nfs_acl autofs4 lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi vsock(U) vmmemctl(U) acpiphp dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac lp sg floppy i2c_piix4 parport_pc parport i2c_core shpchp vmci(U) e1000 pcspkr tpm_tis tpm serio_raw tpm_bios ide_cd cdrom vmxnet(U) pvscsi(U) vmxnet3(U) dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod cciss ata_piix libata mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: P ---- 2.6.18-308.16.1.el5 #1
RIP: 0010:[<ffffffff8008e81a>] [<ffffffff8008e81a>] enqueue_task+0x38/0x56
RSP: 0018:ffff81011f94fe20 EFLAGS: 00010087
RAX: 00000001b31a6895 RBX: ffff81091c122d78 RCX: 00000000000003e8
RDX: ffff80f91c1234c8 RSI: ffff81091c122d78 RDI: ffff81191975c078
RBP: ffff81011f94fe30 R08: 000000006624e484 R09: ffff8111000f7e48
R10: 0000000000000001 R11: 00000000ffffffff R12: ffff81191975c040
R13: ffff81191975c040 R14: ffff81091c122d78 R15: ffff81091c122420
FS: 0000000000000000(0000) GS:ffff81091c19d340(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff80f91c1234d0 CR3: 000000023ba65000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffff8111000f6000, task ffff81091c19e080)
Stack: 000aab4ec032fc12 ffff81091c122420 ffff81011f94fe60 ffffffff8008e88e
000aab4ec032fc12 0000000000000004 ffff81191975c040 ffff81191975c040
ffff81011f94ff00 ffffffff80046c8f 0000000000000080 000000000000000f
Call Trace:
<IRQ> [<ffffffff8008e88e>] __activate_task+0x56/0x6d
[<ffffffff80046c8f>] try_to_wake_up+0x401/0x484
[<ffffffff8008fa94>] rebalance_tick+0x183/0x3cd
[<ffffffff8009a411>] process_timeout+0x0/0x5
[<ffffffff8009a1f7>] run_timer_softirq+0x18d/0x23a
[<ffffffff80012592>] __do_softirq+0x89/0x133
[<ffffffff8005e2fc>] call_softirq+0x1c/0x28
[<ffffffff8006d646>] do_softirq+0x2c/0x7d
[<ffffffff8006be03>] default_idle+0x0/0x50
[<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
<EOI> [<ffffffff8006be2c>] default_idle+0x29/0x50
[<ffffffff800490ad>] cpu_idle+0x95/0xb8
[<ffffffff80078af0>] start_secondary+0x479/0x488
Code: 48 8b 72 08 e8 3e c5 0c 00 41 8b 44 24 28 0f ab 43 08 ff 03
RIP [<ffffffff8008e81a>] enqueue_task+0x38/0x56
RSP <ffff81011f94fe20>
crash> bt
PID: 0 TASK: ffff81091c19e080 CPU: 4 COMMAND: "swapper"
#0 [ffff81011f94fb80] crash_kexec at ffffffff800b09a0
#1 [ffff81011f94fc40] __die at ffffffff80065137
#2 [ffff81011f94fc80] do_page_fault at ffffffff80067484
#3 [ffff81011f94fd70] error_exit at ffffffff8005dde9
[exception RIP: enqueue_task+56]
RIP: ffffffff8008e81a RSP: ffff81011f94fe20 RFLAGS: 00010087
RAX: 00000001b31a6895 RBX: ffff81091c122d78 RCX: 00000000000003e8
RDX: ffff80f91c1234c8 RSI: ffff81091c122d78 RDI: ffff81191975c078
RBP: ffff81011f94fe30 R8: 000000006624e484 R9: ffff8111000f7e48
R10: 0000000000000001 R11: 00000000ffffffff R12: ffff81191975c040
R13: ffff81191975c040 R14: ffff81091c122d78 R15: ffff81091c122420
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#4 [ffff81011f94fe38] __activate_task at ffffffff8008e88e
#5 [ffff81011f94fe68] try_to_wake_up at ffffffff80046c8f
#6 [ffff81011f94ff08] run_timer_softirq at ffffffff8009a1f7
#7 [ffff81011f94ff58] __do_softirq at ffffffff80012592
#8 [ffff81011f94ff88] call_softirq at ffffffff8005e2fc
#9 [ffff81011f94ffa0] do_softirq at ffffffff8006d646
#10 [ffff81011f94ffb0] apic_timer_interrupt at ffffffff8005dc8e
--- <IRQ stack> ---
#11 [ffff8111000f7e48] apic_timer_interrupt at ffffffff8005dc8e
[exception RIP: default_idle+41]
RIP: ffffffff8006be2c RSP: ffff8111000f7ef0 RFLAGS: 00000246
RAX: 0000000000000000 RBX: 00000000000000ff RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8031b738
RBP: ffff81091c19e270 R8: ffff8111000f6000 R9: 0000000000000038
R10: ffff81203fd64158 R11: 00000000ffffffff R12: 0000000000a7511b
R13: 000aab4ebf8b3e37 R14: ffff811061835820 R15: ffff81091c19e080
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#12 [ffff8111000f7ef0] cpu_idle at ffffffff800490ad
crash> dis -rl ffffffff8008e81a
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 684
0xffffffff8008e7e2 <enqueue_task>: push %rbp
0xffffffff8008e7e3 <enqueue_task+1>: mov %rsp,%rbp
0xffffffff8008e7e6 <enqueue_task+4>: push %r12
0xffffffff8008e7e8 <enqueue_task+6>: mov %rdi,%r12
0xffffffff8008e7eb <enqueue_task+9>: push %rbx
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 624
0xffffffff8008e7ec <enqueue_task+10>: cmpq $0x0,0xd0(%rdi)
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 684
0xffffffff8008e7f4 <enqueue_task+18>: mov %rsi,%rbx
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 624
0xffffffff8008e7f7 <enqueue_task+21>: jne 0xffffffff8008e807 <enqueue_task+37>
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 625
0xffffffff8008e7f9 <enqueue_task+23>: mov 0x3d1a80(%rip),%rax # 0xffffffff80460280
0xffffffff8008e800 <enqueue_task+30>: mov %rax,0xd0(%rdi)
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 686
0xffffffff8008e807 <enqueue_task+37>: movslq 0x28(%r12),%rdx // task_struct.prio
0xffffffff8008e80c <enqueue_task+42>: lea 0x38(%r12),%rdi // &p->run_list
0xffffffff8008e811 <enqueue_task+47>: shl $0x4,%rdx // prio >> 4
0xffffffff8008e815 <enqueue_task+51>: lea 0x20(%rdx,%rbx,1),%rdx // array->queue + p->prio
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/include/linux/list.h: 83
0xffffffff8008e81a <enqueue_task+56>: mov 0x8(%rdx),%rsi
crash> whatis enqueue_task
void enqueue_task(struct task_struct *, struct prio_array *);
The task_struct from %rdi is now in %r12.
The prio_array is still in %rsi and also %rbx.
crash> task_struct.prio ffff81191975c040
prio = 115
crash> pd 115*2*2*2*2
$1 = 1840
crash> px 0xffff81091c122d78+1840
$2 = 0xffff81091c1234a8
crash> px 0xffff81091c1234a8+0x20
$3 = 0xffff81091c1234c8
The address we should have in %rdx is 0xffff81091c1234c8 but instead we have 0xffff80f91c1234c8.
crash> px 0xffff81091c1234c8-0xffff80f91c1234c8
$4 = 0x1000000000
We have a single bit difference and this looks like a hardware/mm fault.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
