RHEL6.3.z: kernel panic in lpfc driver due to corrupt stack, RIP list_del, called from scsi_error_handler
Issue
- Kernel crashes with Emulex lpfc driver during SCSI abort / recovery scenario
- System panic with the following messages
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffa0079f8d
lpfc 0000:02:00.3: 1:(0):0748 abort handler timed out waiting for abort to complete: ret 0x2003, ID 0, LUN 0
lpfc 0000:02:00.3: 1:(0):0748 abort handler timed out waiting for abort to complete: ret 0x2003, ID 0, LUN 0
lpfc 0000:02:00.3: 1:(0):0748 abort handler timed out waiting for abort to complete: ret 0x2003, ID 0, LUN 0
lpfc 0000:02:00.3: 1:(0):0748 abort handler timed out waiting for abort to complete: ret 0x2003, ID 0, LUN 1
lpfc 0000:02:00.3: 1:(0):0748 abort handler timed out waiting for abort to complete: ret 0x2003, ID 0, LUN 0
lpfc 0000:02:00.3: 1:(0):0748 abort handler timed out waiting for abort to complete: ret 0x2003, ID 0, LUN 1
lpfc 0000:02:00.3: 1:(0):0748 abort handler timed out waiting for abort to complete: ret 0x2003, ID 0, LUN 0
lpfc 0000:02:00.3: 1:(0):0748 abort handler timed out waiting for abort to complete: ret 0x2003, ID 0, LUN 1
lpfc 0000:02:00.3: 1:(0):0748 abort handler timed out waiting for abort to complete: ret 0x2003, ID 0, LUN 2
[various blocked task messages...]
lpfc 0000:02:00.3: 1:0338 IOCB wait timeout error - no wake response Data x3c
lpfc 0000:02:00.3: 1:(0):0727 TMF FCP_LUN_RESET to TGT 0 LUN 0 failed (0, 0) iocb_flag x204
lpfc 0000:02:00.3: 1:(0):0713 SCSI layer issued Device Reset (0, 0) return x2007
lpfc 0000:02:00.3: 1:(0):0724 I/O flush failure for context LUN : cnt x5
lpfc 0000:02:00.3: 1:0338 IOCB wait timeout error - no wake response Data x3c
lpfc 0000:02:00.3: 1:(0):0727 TMF FCP_LUN_RESET to TGT 0 LUN 1 failed (0, 0) iocb_flag x204
lpfc 0000:02:00.3: 1:(0):0713 SCSI layer issued Device Reset (0, 1) return x2007
lpfc 0000:02:00.3: 1:(0):0724 I/O flush failure for context LUN : cnt x3
lpfc 0000:02:00.3: 1:0338 IOCB wait timeout error - no wake response Data x3c
lpfc 0000:02:00.3: 1:(0):0727 TMF FCP_LUN_RESET to TGT 0 LUN 2 failed (0, 0) iocb_flag x204
lpfc 0000:02:00.3: 1:(0):0713 SCSI layer issued Device Reset (0, 2) return x2007
lpfc 0000:02:00.3: 1:(0):0724 I/O flush failure for context LUN : cnt x1
lpfc 0000:02:00.3: 1:0338 IOCB wait timeout error - no wake response Data x3c
lpfc 0000:02:00.3: 1:(0):0727 TMF FCP_TARGET_RESET to TGT 0 LUN 0 failed (0, 0) iocb_flag x204
lpfc 0000:02:00.3: 1:(0):0723 SCSI layer issued Target Reset (0, 0) return x2007
lpfc 0000:02:00.3: 1:(0):0724 I/O flush failure for context TGT : cnt x9
lpfc 0000:02:00.3: 1:0338 IOCB wait timeout error - no wake response Data x3c
lpfc 0000:02:00.3: 1:(0):0727 TMF FCP_TARGET_RESET to TGT 0 LUN 0 failed (0, 0) iocb_flag x204
lpfc 0000:02:00.3: 1:(0):0700 Bus Reset on target 0 failed
lpfc 0000:02:00.3: 1:(0):0724 I/O flush failure for context HOST : cnt x9
lpfc 0000:02:00.3: 1:(0):0714 SCSI layer issued Bus Reset Data: x2003
general protection fault: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:06:00.3/host4/rport-4:0-2/target4:0:0/4:0:0:2/timeout
CPU 16
Modules linked in: iptable_filter ip_tables mptctl mptbase bonding 8021q garp stp llc ipv6 microcode power_meter sg be2net(U) serio_raw iTCO_wdt iTCO_vendor_support hpilo hpwdt i7core_edac edac_core shpchp ext4 mbcache jbd2 dm_round_robin sd_mod crc_t10dif lpfc scsi_transport_fc scsi_tgt hpsa(U) dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 668, comm: scsi_eh_2 Not tainted 2.6.32-279.22.1.el6.x86_64 #1 HP ProLiant BL460c G7
RIP: 0010:[<ffffffff81279e90>] [<ffffffff81279e90>] list_del+0x10/0xa0
RSP: 0018:ffff8817e8c07ad0 EFLAGS: 00010282
RAX: dead000000200200 RBX: ffff8817e9b1de00 RCX: 0000000000000035
RDX: 000000000000000d RSI: ffff8817e9666200 RDI: ffff8817e9b1de00
RBP: ffff8817e8c07ae0 R08: ffff8817e8c07b00 R09: 0000000000000000
R10: ffff880028404180 R11: 0000000000000000 R12: ffff8817e8c07b00
R13: 0000000000000000 R14: 000000000000000d R15: 000000000000000e
FS: 0000000000000000(0000) GS:ffff880c36700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fc347a8d600 CR3: 0000000bec06e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_eh_2 (pid: 668, threadinfo ffff8817e8c06000, task ffff8817ea430080)
Stack:
ffff8817e8c07b00 ffff880be88b8000 ffff8817e8c07b40 ffffffffa00c3616
<d> ffff8817e9b1de00 0000000000000400 ffff8817e96e4800 ffff8817e9666000
<d> ffff8817e8c07b40 ffff880be88b8000 ffff8817e8e40c00 0000000000000000
Call Trace:
[<ffffffffa00c3616>] lpfc_sli4_repost_scsi_sgl_list+0x66/0x160 [lpfc]
[<ffffffffa008cfe1>] lpfc_sli4_hba_setup+0xdd1/0x1d70 [lpfc]
[<ffffffffa00780e3>] ? lpfc_sli_release_iocbq+0x53/0x70 [lpfc]
[<ffffffffa0094887>] ? lpfc_fabric_abort_hba+0x97/0xb0 [lpfc]
[<ffffffffa00809b0>] ? lpfc_sli_abort_iocb_ring+0xd0/0xf0 [lpfc]
[<ffffffffa00b2c13>] ? lpfc_hba_down_post_s4+0x1b3/0x1c0 [lpfc]
[<ffffffffa00afed8>] lpfc_online+0x178/0x1f0 [lpfc]
[<ffffffffa00c154b>] lpfc_host_reset_handler+0x4b/0xb0 [lpfc]
[<ffffffff81359952>] scsi_try_host_reset+0x42/0x120
[<ffffffff8135b30e>] scsi_eh_ready_devs+0x57e/0x840
[<ffffffff8135bce3>] scsi_error_handler+0x503/0x6e0
[<ffffffff8135b7e0>] ? scsi_error_handler+0x0/0x6e0
[<ffffffff81090876>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffff810907e0>] ? kthread+0x0/0xa0
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 89 95 fc fe ff ff e9 ab fd ff ff 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 <4c> 8b 00 4c 39 c7 75 39 48 8b 03 4c 8b 40 08 4c 39 c3 75 4c 48
RIP [<ffffffff81279e90>] list_del+0x10/0xa0
RSP <ffff8817e8c07ad0>
Environment
- Red Hat Enterprise Linux 6.3.z kernels
- seen on kernel 2.6.32-279.22.1.el6
- other kernels 2.6.32-279.14.1.el6 or above may be affected
- 6.3 GA kernel is not affected
- Emulex lpfc driver
- Neither RHEL6.4 nor RHEL5.6 kernels is believed to be vulnerable to this panic. Testing showed the panic was reliably reproduced with a RHEL6.3.z kernel 2.6.32-279.22.1.el6, but unable to be reproduced with a RHEL5.6 or RHEL6.4 kernel.
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.