RHEL5: Kernel crash because of 2 corrupted 64-bit pointers 0xffff000000000000 followed by 0x000000000000ffff

Solution Verified - Updated -

Issue

  • Kernel crash occurs often in a common kernel code path, often with RIP: __d_rehash+0x18/0x20, or RIP: __d_lookup+0xdb/0xff
  • Processes such as oracle, sshd, HP's cmanicd, are crashed at either __d_rehash or __d_lookup.
  • Last processes that ran include kipmi0
  • Examining a crash dump, the kernel's dentry_hashtable has been corrupted with 2 non-NULL 64-bit values which are not valid kernel pointers. The corrupted values are usually the same values on every panic, 0xffff000000000000 followed by 0x000000000000ffff and are always at the same address.
  • On the same hardware, the same exact memory locations of dentry_hashtable are found corrupted, despite running different RHEL5 kernel versions from RHEL5.4 up to RHEL5.8.
  • The same or very similar issue was originally reported in Red Hat Bugzilla 603620.
  • Here is a sample __d_rehash oops:
`Unable to handle kernel paging request at 00000000bf3c48e5 RIP:
[<ffffffff8003a3f8>] __d_rehash+0x18/0x20
...
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 xfrm_nalgo crypto_api mptctl loop dm_mirror dm_log dm_multipath scsi_dh dm_mod video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ngrstio(PU) ipmi_poweroff ipmi_devintf ipmi_si ipmi_msghandler i2c_dev sd_mod sg usb_storage ixgbe shpchp pcspkr mptsas ehci_hcd uhci_hcd mptscsih mptbase scsi_transport_sas scsi_mod igb 8021q dca i2c_i801 i2c_core

Pid: 16249, comm: sshd Tainted: P      2.6.18-194.el5 #1
RIP: 0010:[<ffffffff8003a3f8>]  [<ffffffff8003a3f8>] __d_rehash+0x18/0x20
RSP: 0018:ffff81043fbf7ec0  EFLAGS: 00010206
RAX: 00000000bf3c48dd RBX: ffff810431d75228 RCX: 0000000000000016
RDX: ffff810431d75240 RSI: ffff810001c8e708 RDI: ffff810431d75228
RBP: ffff81043316fcc0 R08: 00000000ffffffff R09: 0000000000000020
R10: 0000000000000000 R11: ffffffff80128780 R12: ffff810431ec1a80
R13: ffff810431ec1ad0 R14: 000000000000000c R15: 00002b08f76db299
FS:  00002b08f89df470(0000) GS:ffffffff803cb000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000bf3c48e5 CR3: 0000000436ed4000 CR4: 00000000000006e0

Process sshd (pid: 16249, threadinfo ffff81043fbf6000, task ffff810403ffd080)
Stack:  ffffffff80042367 ffff810431d75228 ffffffff8022718e 343535303632315b
000000000000005d ffff81043316fcc0 ffff81043fbf7f40 0000000900133c0a
ffff81043fbf7ed8 0000000000000004 0000000000000004 0000000000000004

Call Trace:
[<ffffffff80042367>] d_rehash+0x21/0x34
[<ffffffff8022718e>] sock_attach_fd+0x8f/0xfd
[<ffffffff8004d2be>] sock_map_fd+0x2a/0x59
[<ffffffff802272e9>] sys_socket+0x1f/0x36
[<ffffffff8005e116>] system_call+0x7e/0x83`

Code: 48 89 50 08 48 89 16 c3 0f ca 45 89 c0 66 c1 c1 08 89 d2 4c
RIP  [<ffffffff8003a3f8>] __d_rehash+0x18/0x20
  • Here is a sample __d_lookup oops
Pid: 5211, comm: cmanicd Not tainted 2.6.18-308.el5 #1
RIP: 0010:[<ffffffff80009885>]  [<ffffffff80009885>] __d_lookup+0xdb/0xff
RSP: 0018:ffff81021963bc88  EFLAGS: 00010286
RAX: ffff81000904c900 RBX: 0000000000000101 RCX: 0000000000000014
RDX: 00000000000f8b60 RSI: ffff81021963bd28 RDI: ffff810211e97d20
RBP: ffff000000000000 R08: ffff81000001b600 R09: 0000000000000000
R10: ffff81021cb618c0 R11: ffffffff8012d790 R12: ffff810211e97d20
R13: ffff81021963bd28 R14: 0000000000038028 R15: 0000000000000002
FS:  0000000042852940(0063) GS:ffff81022ff18640(0000) knlGS:00000000f7efe8d0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaae2a2000 CR3: 0000000219baf000 CR4: 00000000000006e0
Process cmanicd (pid: 5211, threadinfo ffff81021963a000, task ffff81021a2b9860)
Stack:  ffff81020c098018 0000000000000101 0000000000000000 ffff81021c765798
 0000000000000000 ffff81021963bea8 ffff81021963bd28 ffffffff8000cfb0
 ffff81021963bda8 ffff81021963bd38 ffff810107aef280 0000000000000101
Call Trace:
 [<ffffffff8000cfb0>] do_lookup+0x2c/0x227
 [<ffffffff80009c53>] __link_path_walk+0x3aa/0xf39
 [<ffffffff8000eb31>] link_path_walk+0x45/0xb8
 [<ffffffff8000ce04>] do_path_lookup+0x294/0x310
 [<ffffffff8002384b>] __path_lookup_intent_open+0x56/0x97
 [<ffffffff8001b120>] open_namei+0x73/0x6c0
 [<ffffffff80027607>] do_filp_open+0x1c/0x38
 [<ffffffff80019fd3>] do_sys_open+0x44/0xbe
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Code: 48 8b 45 00 0f 18 08 48 8d 5d e8 44 39 73 30 75 e6 e9 70 ff
RIP  [<ffffffff80009885>] __d_lookup+0xdb/0xff
 RSP <ffff81021963bc88>
  • A blade server otherwise running without problem starts experiencing regular kernel panics when SSH activity occurs.
  • Panic can also occur when creating or removing files.
  • When crash+kdump is activated the blades do not panic anymore.

Environment

  • Red Hat Enterprise Linux 5.4 - 5.8
  • This issue has been seen on different hardware platforms
  • Hardware: Alcatel-Lucent Atcav2 Platform
  • Hardware: Dell PowerEdge M910
  • Hardware: HP ProLiant BL465c G6 (AMD Opteron)
    • BIOS: HP Version: A13, Release Date: 05/02/2011
  • Hardware: HP ProLiant DL380 G7 (Intel Xeon)
    • System BIOS p67 12/1/2010
    • ILO 1.16
    • Smart Array p10i 3.52
  • Intel Westmere Xeon's in the E7 family
    • Intel(R) Xeon(R) CPU E7- 8837 @ 2.67GHz

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.