Server panicked in enqueue_task+0x38 dereferencing a bad address

Solution In Progress - Updated -

Environment

  • Red Hat Enterprise Linux 5.8

Issue

  • server getting often rebooted
  • CI amrndhl1059 is ofen rebooted, since server is running with redhat 5.8 with oracle ASM and running on VMware with XTREMIO disk.
  • System panicked with:
Unable to handle kernel paging request at ffff80f91c1234d0 RIP: 
 [<ffffffff8008e81a>] enqueue_task+0x38/0x56
PGD 0 
Oops: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:11.0/0000:02:01.0/irq
CPU 4 
Modules linked in: oracleacfs(PU) oracleadvm(PU) oracleoks(PU) oracleasm(U) nfs nfs_acl autofs4 lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi vsock(U) vmmemctl(U) acpiphp dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac lp sg floppy i2c_piix4 parport_pc parport i2c_core shpchp vmci(U) e1000 pcspkr tpm_tis tpm serio_raw tpm_bios ide_cd cdrom vmxnet(U) pvscsi(U) vmxnet3(U) dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod cciss ata_piix libata mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: P     ---- 2.6.18-308.16.1.el5 #1
RIP: 0010:[<ffffffff8008e81a>]  [<ffffffff8008e81a>] enqueue_task+0x38/0x56
RSP: 0018:ffff81011f94fe20  EFLAGS: 00010087
RAX: 00000001b31a6895 RBX: ffff81091c122d78 RCX: 00000000000003e8
RDX: ffff80f91c1234c8 RSI: ffff81091c122d78 RDI: ffff81191975c078
RBP: ffff81011f94fe30 R08: 000000006624e484 R09: ffff8111000f7e48
R10: 0000000000000001 R11: 00000000ffffffff R12: ffff81191975c040
R13: ffff81191975c040 R14: ffff81091c122d78 R15: ffff81091c122420
FS:  0000000000000000(0000) GS:ffff81091c19d340(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff80f91c1234d0 CR3: 000000023ba65000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffff8111000f6000, task ffff81091c19e080)
Stack:  000aab4ec032fc12 ffff81091c122420 ffff81011f94fe60 ffffffff8008e88e
 000aab4ec032fc12 0000000000000004 ffff81191975c040 ffff81191975c040
 ffff81011f94ff00 ffffffff80046c8f 0000000000000080 000000000000000f
Call Trace:
 <IRQ>  [<ffffffff8008e88e>] __activate_task+0x56/0x6d
 [<ffffffff80046c8f>] try_to_wake_up+0x401/0x484
 [<ffffffff8008fa94>] rebalance_tick+0x183/0x3cd
 [<ffffffff8009a411>] process_timeout+0x0/0x5
 [<ffffffff8009a1f7>] run_timer_softirq+0x18d/0x23a
 [<ffffffff80012592>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006d646>] do_softirq+0x2c/0x7d
 [<ffffffff8006be03>] default_idle+0x0/0x50
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff8006be2c>] default_idle+0x29/0x50
 [<ffffffff800490ad>] cpu_idle+0x95/0xb8
 [<ffffffff80078af0>] start_secondary+0x479/0x488


Code: 48 8b 72 08 e8 3e c5 0c 00 41 8b 44 24 28 0f ab 43 08 ff 03 
RIP  [<ffffffff8008e81a>] enqueue_task+0x38/0x56
 RSP <ffff81011f94fe20>

Resolution

Unsure if this is a software bug or a hardware failure.

Root Cause

It looks like a register value has been corrupted, possibly due to a hardware error.

Diagnostic Steps

Looking at the vmcore:

      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-308.16.1.el5/vmlinux
    DUMPFILE: /cores/retrace/tasks/564810281/crash/vmcore  [PARTIAL DUMP]
        CPUS: 16
        DATE: Fri Jul 25 16:58:55 2014
      UPTIME: 34 days, 18:45:52
LOAD AVERAGE: 2.61, 3.01, 3.56
       TASKS: 674
    NODENAME: <nodename>
     RELEASE: 2.6.18-308.16.1.el5
     VERSION: #1 SMP Tue Sep 18 07:21:07 EDT 2012
     MACHINE: x86_64  (2699 Mhz)
      MEMORY: 126.2 GB
       PANIC: "Oops: 0000 [1] SMP " (check log for details)

Unable to handle kernel paging request at ffff80f91c1234d0 RIP: 
 [<ffffffff8008e81a>] enqueue_task+0x38/0x56
PGD 0 
Oops: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:11.0/0000:02:01.0/irq
CPU 4 
Modules linked in: oracleacfs(PU) oracleadvm(PU) oracleoks(PU) oracleasm(U) nfs nfs_acl autofs4 lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi vsock(U) vmmemctl(U) acpiphp dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac lp sg floppy i2c_piix4 parport_pc parport i2c_core shpchp vmci(U) e1000 pcspkr tpm_tis tpm serio_raw tpm_bios ide_cd cdrom vmxnet(U) pvscsi(U) vmxnet3(U) dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod cciss ata_piix libata mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: P     ---- 2.6.18-308.16.1.el5 #1
RIP: 0010:[<ffffffff8008e81a>]  [<ffffffff8008e81a>] enqueue_task+0x38/0x56
RSP: 0018:ffff81011f94fe20  EFLAGS: 00010087
RAX: 00000001b31a6895 RBX: ffff81091c122d78 RCX: 00000000000003e8
RDX: ffff80f91c1234c8 RSI: ffff81091c122d78 RDI: ffff81191975c078
RBP: ffff81011f94fe30 R08: 000000006624e484 R09: ffff8111000f7e48
R10: 0000000000000001 R11: 00000000ffffffff R12: ffff81191975c040
R13: ffff81191975c040 R14: ffff81091c122d78 R15: ffff81091c122420
FS:  0000000000000000(0000) GS:ffff81091c19d340(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff80f91c1234d0 CR3: 000000023ba65000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffff8111000f6000, task ffff81091c19e080)
Stack:  000aab4ec032fc12 ffff81091c122420 ffff81011f94fe60 ffffffff8008e88e
 000aab4ec032fc12 0000000000000004 ffff81191975c040 ffff81191975c040
 ffff81011f94ff00 ffffffff80046c8f 0000000000000080 000000000000000f
Call Trace:
 <IRQ>  [<ffffffff8008e88e>] __activate_task+0x56/0x6d
 [<ffffffff80046c8f>] try_to_wake_up+0x401/0x484
 [<ffffffff8008fa94>] rebalance_tick+0x183/0x3cd
 [<ffffffff8009a411>] process_timeout+0x0/0x5
 [<ffffffff8009a1f7>] run_timer_softirq+0x18d/0x23a
 [<ffffffff80012592>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006d646>] do_softirq+0x2c/0x7d
 [<ffffffff8006be03>] default_idle+0x0/0x50
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff8006be2c>] default_idle+0x29/0x50
 [<ffffffff800490ad>] cpu_idle+0x95/0xb8
 [<ffffffff80078af0>] start_secondary+0x479/0x488


Code: 48 8b 72 08 e8 3e c5 0c 00 41 8b 44 24 28 0f ab 43 08 ff 03 
RIP  [<ffffffff8008e81a>] enqueue_task+0x38/0x56
 RSP <ffff81011f94fe20>


crash> bt
PID: 0      TASK: ffff81091c19e080  CPU: 4   COMMAND: "swapper"
 #0 [ffff81011f94fb80] crash_kexec at ffffffff800b09a0
 #1 [ffff81011f94fc40] __die at ffffffff80065137
 #2 [ffff81011f94fc80] do_page_fault at ffffffff80067484
 #3 [ffff81011f94fd70] error_exit at ffffffff8005dde9
    [exception RIP: enqueue_task+56]
    RIP: ffffffff8008e81a  RSP: ffff81011f94fe20  RFLAGS: 00010087
    RAX: 00000001b31a6895  RBX: ffff81091c122d78  RCX: 00000000000003e8
    RDX: ffff80f91c1234c8  RSI: ffff81091c122d78  RDI: ffff81191975c078
    RBP: ffff81011f94fe30   R8: 000000006624e484   R9: ffff8111000f7e48
    R10: 0000000000000001  R11: 00000000ffffffff  R12: ffff81191975c040
    R13: ffff81191975c040  R14: ffff81091c122d78  R15: ffff81091c122420
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff81011f94fe38] __activate_task at ffffffff8008e88e
 #5 [ffff81011f94fe68] try_to_wake_up at ffffffff80046c8f
 #6 [ffff81011f94ff08] run_timer_softirq at ffffffff8009a1f7
 #7 [ffff81011f94ff58] __do_softirq at ffffffff80012592
 #8 [ffff81011f94ff88] call_softirq at ffffffff8005e2fc
 #9 [ffff81011f94ffa0] do_softirq at ffffffff8006d646
#10 [ffff81011f94ffb0] apic_timer_interrupt at ffffffff8005dc8e
--- <IRQ stack> ---
#11 [ffff8111000f7e48] apic_timer_interrupt at ffffffff8005dc8e
    [exception RIP: default_idle+41]
    RIP: ffffffff8006be2c  RSP: ffff8111000f7ef0  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 00000000000000ff  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffffffff8031b738
    RBP: ffff81091c19e270   R8: ffff8111000f6000   R9: 0000000000000038
    R10: ffff81203fd64158  R11: 00000000ffffffff  R12: 0000000000a7511b
    R13: 000aab4ebf8b3e37  R14: ffff811061835820  R15: ffff81091c19e080
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#12 [ffff8111000f7ef0] cpu_idle at ffffffff800490ad

crash> dis -rl ffffffff8008e81a
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 684
0xffffffff8008e7e2 <enqueue_task>:      push   %rbp
0xffffffff8008e7e3 <enqueue_task+1>:    mov    %rsp,%rbp
0xffffffff8008e7e6 <enqueue_task+4>:    push   %r12
0xffffffff8008e7e8 <enqueue_task+6>:    mov    %rdi,%r12
0xffffffff8008e7eb <enqueue_task+9>:    push   %rbx
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 624
0xffffffff8008e7ec <enqueue_task+10>:   cmpq   $0x0,0xd0(%rdi)
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 684
0xffffffff8008e7f4 <enqueue_task+18>:   mov    %rsi,%rbx
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 624
0xffffffff8008e7f7 <enqueue_task+21>:   jne    0xffffffff8008e807 <enqueue_task+37>
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 625
0xffffffff8008e7f9 <enqueue_task+23>:   mov    0x3d1a80(%rip),%rax        # 0xffffffff80460280
0xffffffff8008e800 <enqueue_task+30>:   mov    %rax,0xd0(%rdi)
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/kernel/sched.c: 686
0xffffffff8008e807 <enqueue_task+37>:   movslq 0x28(%r12),%rdx         // task_struct.prio
0xffffffff8008e80c <enqueue_task+42>:   lea    0x38(%r12),%rdi         // &p->run_list
0xffffffff8008e811 <enqueue_task+47>:   shl    $0x4,%rdx               // prio >> 4
0xffffffff8008e815 <enqueue_task+51>:   lea    0x20(%rdx,%rbx,1),%rdx  // array->queue + p->prio
/usr/src/debug/kernel-2.6.18/linux-2.6.18-308.16.1.el5.x86_64/include/linux/list.h: 83
0xffffffff8008e81a <enqueue_task+56>:   mov    0x8(%rdx),%rsi

crash> whatis enqueue_task
void enqueue_task(struct task_struct *, struct prio_array *);

The task_struct from %rdi is now in %r12.
The prio_array is still in %rsi and also %rbx.

crash> task_struct.prio ffff81191975c040
  prio = 115

crash> pd 115*2*2*2*2
$1 = 1840

crash> px 0xffff81091c122d78+1840
$2 = 0xffff81091c1234a8

crash> px 0xffff81091c1234a8+0x20
$3 = 0xffff81091c1234c8

The address we should have in %rdx is 0xffff81091c1234c8 but instead we have 0xffff80f91c1234c8.

crash> px 0xffff81091c1234c8-0xffff80f91c1234c8
$4 = 0x1000000000

We have a single bit difference and this looks like a hardware/mm fault.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.