RHEL5: Kernel crash because of 2 corrupted 64-bit pointers 0xffff000000000000 followed by 0x000000000000ffff
Issue
- Kernel crash occurs often in a common kernel code path, often with RIP: __d_rehash+0x18/0x20, or RIP: __d_lookup+0xdb/0xff
- Processes such as oracle, sshd, HP's cmanicd, are crashed at either __d_rehash or __d_lookup.
- Last processes that ran include kipmi0
- Examining a crash dump, the kernel's dentry_hashtable has been corrupted with 2 non-NULL 64-bit values which are not valid kernel pointers. The corrupted values are usually the same values on every panic, 0xffff000000000000 followed by 0x000000000000ffff and are always at the same address.
- On the same hardware, the same exact memory locations of dentry_hashtable are found corrupted, despite running different RHEL5 kernel versions from RHEL5.4 up to RHEL5.8.
- The same or very similar issue was originally reported in Red Hat Bugzilla 603620.
- Here is a sample __d_rehash oops:
`Unable to handle kernel paging request at 00000000bf3c48e5 RIP:
[<ffffffff8003a3f8>] __d_rehash+0x18/0x20
...
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 xfrm_nalgo crypto_api mptctl loop dm_mirror dm_log dm_multipath scsi_dh dm_mod video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ngrstio(PU) ipmi_poweroff ipmi_devintf ipmi_si ipmi_msghandler i2c_dev sd_mod sg usb_storage ixgbe shpchp pcspkr mptsas ehci_hcd uhci_hcd mptscsih mptbase scsi_transport_sas scsi_mod igb 8021q dca i2c_i801 i2c_core
Pid: 16249, comm: sshd Tainted: P 2.6.18-194.el5 #1
RIP: 0010:[<ffffffff8003a3f8>] [<ffffffff8003a3f8>] __d_rehash+0x18/0x20
RSP: 0018:ffff81043fbf7ec0 EFLAGS: 00010206
RAX: 00000000bf3c48dd RBX: ffff810431d75228 RCX: 0000000000000016
RDX: ffff810431d75240 RSI: ffff810001c8e708 RDI: ffff810431d75228
RBP: ffff81043316fcc0 R08: 00000000ffffffff R09: 0000000000000020
R10: 0000000000000000 R11: ffffffff80128780 R12: ffff810431ec1a80
R13: ffff810431ec1ad0 R14: 000000000000000c R15: 00002b08f76db299
FS: 00002b08f89df470(0000) GS:ffffffff803cb000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000bf3c48e5 CR3: 0000000436ed4000 CR4: 00000000000006e0
Process sshd (pid: 16249, threadinfo ffff81043fbf6000, task ffff810403ffd080)
Stack: ffffffff80042367 ffff810431d75228 ffffffff8022718e 343535303632315b
000000000000005d ffff81043316fcc0 ffff81043fbf7f40 0000000900133c0a
ffff81043fbf7ed8 0000000000000004 0000000000000004 0000000000000004
Call Trace:
[<ffffffff80042367>] d_rehash+0x21/0x34
[<ffffffff8022718e>] sock_attach_fd+0x8f/0xfd
[<ffffffff8004d2be>] sock_map_fd+0x2a/0x59
[<ffffffff802272e9>] sys_socket+0x1f/0x36
[<ffffffff8005e116>] system_call+0x7e/0x83`
Code: 48 89 50 08 48 89 16 c3 0f ca 45 89 c0 66 c1 c1 08 89 d2 4c
RIP [<ffffffff8003a3f8>] __d_rehash+0x18/0x20
- Here is a sample __d_lookup oops
Pid: 5211, comm: cmanicd Not tainted 2.6.18-308.el5 #1
RIP: 0010:[<ffffffff80009885>] [<ffffffff80009885>] __d_lookup+0xdb/0xff
RSP: 0018:ffff81021963bc88 EFLAGS: 00010286
RAX: ffff81000904c900 RBX: 0000000000000101 RCX: 0000000000000014
RDX: 00000000000f8b60 RSI: ffff81021963bd28 RDI: ffff810211e97d20
RBP: ffff000000000000 R08: ffff81000001b600 R09: 0000000000000000
R10: ffff81021cb618c0 R11: ffffffff8012d790 R12: ffff810211e97d20
R13: ffff81021963bd28 R14: 0000000000038028 R15: 0000000000000002
FS: 0000000042852940(0063) GS:ffff81022ff18640(0000) knlGS:00000000f7efe8d0
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaae2a2000 CR3: 0000000219baf000 CR4: 00000000000006e0
Process cmanicd (pid: 5211, threadinfo ffff81021963a000, task ffff81021a2b9860)
Stack: ffff81020c098018 0000000000000101 0000000000000000 ffff81021c765798
0000000000000000 ffff81021963bea8 ffff81021963bd28 ffffffff8000cfb0
ffff81021963bda8 ffff81021963bd38 ffff810107aef280 0000000000000101
Call Trace:
[<ffffffff8000cfb0>] do_lookup+0x2c/0x227
[<ffffffff80009c53>] __link_path_walk+0x3aa/0xf39
[<ffffffff8000eb31>] link_path_walk+0x45/0xb8
[<ffffffff8000ce04>] do_path_lookup+0x294/0x310
[<ffffffff8002384b>] __path_lookup_intent_open+0x56/0x97
[<ffffffff8001b120>] open_namei+0x73/0x6c0
[<ffffffff80027607>] do_filp_open+0x1c/0x38
[<ffffffff80019fd3>] do_sys_open+0x44/0xbe
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
Code: 48 8b 45 00 0f 18 08 48 8d 5d e8 44 39 73 30 75 e6 e9 70 ff
RIP [<ffffffff80009885>] __d_lookup+0xdb/0xff
RSP <ffff81021963bc88>
- A blade server otherwise running without problem starts experiencing regular kernel panics when SSH activity occurs.
- Panic can also occur when creating or removing files.
- When crash+kdump is activated the blades do not panic anymore.
Environment
- Red Hat Enterprise Linux 5.4 - 5.8
- This issue has been seen on different hardware platforms
- Hardware: Alcatel-Lucent Atcav2 Platform
- Hardware: Dell PowerEdge M910
- Hardware: HP ProLiant BL465c G6 (AMD Opteron)
- BIOS: HP Version: A13, Release Date: 05/02/2011
- Hardware: HP ProLiant DL380 G7 (Intel Xeon)
- System BIOS p67 12/1/2010
- ILO 1.16
- Smart Array p10i 3.52
- Intel Westmere Xeon's in the E7 family
- Intel(R) Xeon(R) CPU E7- 8837 @ 2.67GHz
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.