Red Hat Insights can detect this issue
- Red Hat Enterprise Linux 6, 7
- Cisco UCSB-B200-M3
- Cisco FCoE HBAs
The system is having two FNIC HBAs and there was an issue in the connectivity to sub paths through one of the HBAs. Due to this, following error messages were logged:
rport-1:0-0: blocked FC remote port time out: removing rport rport-1:0-1: blocked FC remote port time out: removing target and saving binding
After above error messages system got stuck with many processes waiting for completion of IO on lvm volumes/filesystems created over SAN devices.
There was no issue in the sub paths connected through another HBA, please provide RCA about why the IO was not failovered to remaining paths?
Cisco ucs B200M3 LUN, IO failover problem
Check if the
pci=nomsikernel option is enabled, this would force kernel to use the legacy (INTx) interrupt mode for Cisco FCoE HBAs instead of recommended MSI-X interrupt mode.
$ cat boot/grub/grub.conf : title Red Hat Enterprise Linux (2.6.32-504.23.4.el6.x86_64) root (hd0,0) kernel /vmlinuz-2.6.32-504.23.4.el6.x86_64 ro root=/dev/mapper/vgsystem-lvroot pci=nomsi rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vgsystem/lvswap rd_NO_MD rd_LVM_LV=vgsystem/lvroot crashkernel=auto rd_NO_DM transparent_hugepage=never initrd /initramfs-2.6.32-504.23.4.el6.x86_64.img
If the system is having
pci=nomsikernel option enbaled, then please remove this option, and reboot the system so that it could use recommended MSI-X interrupt mode for Cisco fnic HBAs.
System was using legacy interrupt mode (INTx) for Cisco FCoE HBAs, this had caused issues in interrrupt processing on HBA and error recovery was stalled.
The kernel line in
/boot/grub/grub.conffile was having
pci=nomsioption enabled which forced kernel to use legacy interrupt mode for all PCI devices including FCoE HBAs, and disable the MSI-X mode.
The vmcore collected at the time of hang shows that following sub paths to the SAN devices were in failed state:
mpathb dm-0 HITACHI OPEN-V size=3145728M features='1 queue_if_no_path' hwhandler=None +- policy='round-robin' `- 1:0:0:1 sdb 8:16 [scsi_device: 0xffff88204da8b800 sdev_state: SDEV_RUNNING] `- 2:0:0:1 sdd 8:48 [scsi_device: 0xffff88204e483000 sdev_state: SDEV_TRANSPORT_OFFLINE] <--- mpatha (SCSI_ID) dm-1 HITACHI OPEN-V size=51200M features='1 queue_if_no_path' hwhandler=None +- policy='round-robin' `- 1:0:0:0 sda 8:0 [scsi_device: 0xffff8840537ba800 sdev_state: SDEV_RUNNING] `- 2:0:0:0 sdc 8:32 [scsi_device: 0xffff88405379a800 sdev_state: SDEV_CANCEL] <---
Above paths to SAN devices were failed due to the issues in connectivity to SAN devices. IO commands issued to the SAN devices were getting failed with
Also the presence of
blocked FC remote port time out: removing rporterror in logs shows that remote storage FC port was lost:
sd 2:0:0:1: rejecting I/O to offline device sd 2:0:0:1: rejecting I/O to offline device sd 2:0:0:1: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK sd 2:0:0:1: [sdd] Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK sd 2:0:0:1: [sdd] CDB: Write(10): 2a 00 92 e4 f5 80 00 03 70 00 end_request: I/O error, dev sdd, sector 2464478592 sd 2:0:0:1: [sdd] Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK sd 2:0:0:1: [sdd] CDB: Read(10): 28 00 18 dc ec b8 00 00 08 00 [...] console [netcon0] enabled netconsole: network logging started rport-2:0-0: blocked FC remote port time out: removing rport rport-2:0-1: blocked FC remote port time out: removing target and saving binding sd 2:0:0:0: rejecting I/O to offline device
But there was still one sub path in active, running state, so dm-multipath should have been able to failover IO to remaining sub paths. While further examining status of multipathd process it was observed that it it was stuck in
UN(uninterruptible) state due to the blocked
fc_wq_2on affected HBA:
[0 00:51:16.912] [UN] PID: 47312 TASK: ffff882265fd4ab0 CPU: 36 COMMAND: "multipathd" [0 00:51:16.973] [UN] PID: 1291 TASK: ffff88204da74ab0 CPU: 13 COMMAND: "fc_wq_2"
The affected HBA was blocked with status
crash> scsishow --check WARNING: host 0xffff88204d998000 is blocked! HBA driver refusing all commands with SCSI_MLQUEUE_HOST_BUSY?
While further reviewing the HBA stats h/w vendor had found that it was not processing interrupt as expected, due to which it got blocked. The issues with interrupt processing were caused because of using legacy (INTx) interrupt mode instead of recommended MSI-X interrupt mode for Cisco FCoE HBAs. The logs in vmcore confirmed that Cisco vnic, fnic devices were using legacy interrupt mode which is not recommended:
enic 0000:06:00.0: vNIC resources used: wq 1 rq 1 cq 2 intr 3 intr mode legacy PCI INTx enic 0000:06:00.0: (unregistered net_device): enic: INTR mode is not MSIX, Not initializing adaptive coalescing <---------- [...] enic 0000:06:00.1: vNIC resources used: wq 1 rq 1 cq 2 intr 3 intr mode legacy PCI INTx enic 0000:06:00.1: (unregistered net_device): enic: INTR mode is not MSIX, Not initializing adaptive coalescing <----------
There were also following
IRQ handler type mismatcherrors which confirms issues with interrupt negotiation:
IRQ handler type mismatch for IRQ 0 current handler: timer Pid: 2845, comm: work_for_cpu Not tainted 2.6.32-504.23.4.el6.x86_64 #1 Call Trace: [<ffffffff810ebd62>] ? __setup_irq+0x382/0x3c0 [<ffffffffa0163fd0>] ? enic_isr_legacy+0x0/0x130 [enic] [<ffffffff810ec563>] ? request_threaded_irq+0x133/0x230 [<ffffffffa016718c>] ? enic_probe+0x5fc/0xdd0 [enic] [<ffffffff810985a0>] ? do_work_for_cpu+0x0/0x30 [<ffffffff812af417>] ? local_pci_probe+0x17/0x20 [<ffffffff810985b8>] ? do_work_for_cpu+0x18/0x30 [<ffffffff8109e78e>] ? kthread+0x9e/0xc0 [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 enic 0000:06:00.1: eth1: Unable to request irq. enic: probe of 0000:06:00.1 failed with error -16 enic 0000:06:00.2: enabling device (0000 -> 0002)
The kernel options used during boot process shows that at the time of issue, system was using
pci=nomsioption which had disabled use of MSI-X interrupt mode:
Command line: ro root=/dev/mapper/vgsystem-lvroot pci=nomsi rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vgsystem/lvswap rd_NO_MD rd_LVM_LV=vgsystem/lvroot crashkernel=auto rd_NO_DM transparent_hugepage=never
- Red Hat Enterprise Linux
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.