Kernel panics at boot while discovering the SAN LUNs using HDLM

Solution Verified - Updated -

Environment

Red Hat Enterprise Linux 5

Issue

Server does not boot up and reports kernel panic while discovering the SAN LUNs using HDLM.
Below are the error messages:

Unable to handle kernel NULL pointer dereference at 0000000000000058 RIP:
[<ffffffff88075135>] :scsi_mod:scsi_device_lookup+0x11/0x62
PGD 81a8cd067 PUD 81e31e067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /block/sddlmaa/removable
CPU 2
Modules linked in: ipmi_si mpt2sas mptctl ipmi_devintf ipmi_msghandler dell_rbu gab(PU) llt(PU) nfs fscache nfs_acl lockdd
Pid: 10199, comm: dlmcfgmgr Tainted: P      2.6.18-164.el5 #1
RIP: 0010:[<ffffffff88075135>]  [<ffffffff88075135>] :scsi_mod:scsi_device_lookup+0x11/0x62
RSP: 0018:ffff81081dcc9b08  EFLAGS: 00010246
RAX: 0000ffff00000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000004000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000100 R12: 0000000000000000
R13: 0000000000000000 R14: ffff81083c05e4d0 R15: 000000000000004c
FS:  00002b90f51ac260(0000) GS:ffff81011cf1eec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000058 CR3: 000000081d1d6000 CR4: 00000000000006e0
Process dlmcfgmgr (pid: 10199, threadinfo ffff81081dcc8000, task ffff810817942040)
Stack:  0000000000000000 0000000000000000 ffff81081dcc9c4c ffff81082ca30f20
ffff81083c05e4d0 ffffffff883a0b3c ffff81083c4c0061 0000000000000015
ffff81083c4c33c0 0000000000000006 ffff81083c4c33c0 ffffffff883a0bf9
Call Trace:
[<ffffffff883a0b3c>] :sddlmfdrv:DlmfdrvGetSdDevFromProcPartition+0x6d/0x121
[<ffffffff883a0bf9>] :sddlmfdrv:DlmfdrvDummyDevInProcPartition+0x9/0x14
[<ffffffff883ac129>] :sddlmfdrv:DlmfdrvApiIoctl+0xf5d/0x1d20
[<ffffffff883ad001>] :sddlmfdrv:DlmfdrvStartLogSend+0x115/0x156
[<ffffffff883ad502>] :sddlmfdrv:HSPLog_Main+0x4c0/0x528
[<ffffffff883aa2e7>] :sddlmfdrv:DlmfdrvIoctl+0x1a13/0x23ba
[<ffffffff8000ce81>] do_lookup+0x65/0x1e6
[<ffffffff8000d3a4>] dput+0x2c/0x114
[<ffffffff80143e24>] blkdev_driver_ioctl+0x5d/0x72
[<ffffffff80144475>] blkdev_ioctl+0x63c/0x697
[<ffffffff80063ad5>] mutex_lock+0xd/0x1d
[<ffffffff800e37f2>] do_open+0x250/0x30f
[<ffffffff800e3b05>] blkdev_open+0x0/0x4f
[<ffffffff800e3b28>] blkdev_open+0x23/0x4f
[<ffffffff8001e8b4>] __dentry_open+0x101/0x1dc
[<ffffffff800e2ec5>] block_ioctl+0x1b/0x1f
[<ffffffff800420a5>] do_ioctl+0x21/0x6b
[<ffffffff800302ce>] vfs_ioctl+0x457/0x4b9
[<ffffffff800b61b0>] audit_syscall_entry+0x180/0x1b3
[<ffffffff8004c766>] sys_ioctl+0x59/0x78
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
Code: 48 8b 7f 58 89 cb e8 98 f9 fe f7 89 d9 44 89 e2 44 89 ee 48
RIP  [<ffffffff88075135>] :scsi_mod:scsi_device_lookup+0x11/0x62
RSP <ffff81081dcc9b08>
CR2: 0000000000000058
<0>Kernel panic - not syncing: Fatal exception

Resolution

The issue has got resolved after re-installing HDLM multipathing software.
If is still not resolved, apply to the HDLM software vendor Hitachi.

Root Cause

It is clear from the call trace that the kernel has crashed on dereferencing a NULL pointer on 17th byte of scsi_device_lookup() function which was called from the stack of functions implemented in "sddlmfdrv" kernel module:

Unable to handle kernel NULL pointer dereference at 0000000000000058 RIP:
[<ffffffff88075135>] :scsi_mod:scsi_device_lookup+0x11/0x62
Call Trace:
[<ffffffff883a0b3c>] :sddlmfdrv:DlmfdrvGetSdDevFromProcPartition+0x6d/0x121
[<ffffffff883a0bf9>] :sddlmfdrv:DlmfdrvDummyDevInProcPartition+0x9/0x14
[<ffffffff883ac129>] :sddlmfdrv:DlmfdrvApiIoctl+0xf5d/0x1d20
[<ffffffff883ad001>] :sddlmfdrv:DlmfdrvStartLogSend+0x115/0x156
[<ffffffff883ad502>] :sddlmfdrv:HSPLog_Main+0x4c0/0x528
[<ffffffff883aa2e7>] :sddlmfdrv:DlmfdrvIoctl+0x1a13/0x23ba

crash> dis -r scsi_device_lookup+0x11
0xffffffff88075124 <scsi_device_lookup>:        push   %r14
0xffffffff88075126 <scsi_device_lookup+0x2>:    push   %r13
0xffffffff88075128 <scsi_device_lookup+0x4>:    mov    %esi,%r13d
0xffffffff8807512b <scsi_device_lookup+0x7>:    push   %r12
0xffffffff8807512d <scsi_device_lookup+0x9>:    mov    %edx,%r12d
0xffffffff88075130 <scsi_device_lookup+0xc>:    push   %rbp
0xffffffff88075131 <scsi_device_lookup+0xd>:    mov    %rdi,%rbp
0xffffffff88075134 <scsi_device_lookup+0x10>:   push   %rbx
0xffffffff88075135 <scsi_device_lookup+0x11>:   mov    0x58(%rdi),%rdi
                                          RDI is 0x0 here ---^

RDI is pointer to the "struct Scsi_Host" passed as the first argument to scsi_device_lookup() function.
"host_lock" member is at offset 0x58 which we see as the first operation in C-code of scsi_device_lookup():

struct scsi_device *scsi_device_lookup(struct Scsi_Host *shost, uint channel, uint id, uint lun)
{
    struct scsi_device *sdev;
    unsigned long flags;

    spin_lock_irqsave(shost->host_lock, flags);
                              ^--- the crash is here, shost is 0x0

Invalid "struct Scsi_Host *shost" value (0x0) was passed from the stack of functions implemented in the "sddlmfdrv" kernel module.
This module is part of Hitachi HDLM (Hitach Dynamic Link Manager). This module is not shipped and therefore is not supported by Red Hat.
Contact the 3rd-party software vendor on this issue for the resolution.

Diagnostic Steps

Gather the "vmcore" file of the crash using KDump utility and check the kernel log.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments