RHEL7: Adding a LUN path to a NetApp SAN causes all disk IO through a HBA port to stall

Solution Verified - Updated -

Issue

Sequence of errors observed were:
- Adding a new path to an existing LUN resulted in a disk IO stall AND scsi layer started aborting timed out requests => the aborts failed resulting in device resets => the device resets failed causing fnic_reset to be called and this recovered all paths including the newly added path.

 # The New LUN path arrives
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: supports implicit TPGS
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: port group 3e9 rel port 0
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: rtpg failed with 8000002
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: port group 3e9 state N non-preferred supports TolUsNA
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: Attached
May 22 11:50:51 server1 kernel: sd 1:0:3:100: Attached scsi generic sg4 type 0
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] 4096-byte physical blocks
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] Write Protect is off
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] Mode Sense: c7 00 00 08
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] Attached SCSI disk
May 22 11:50:51 server1 multipathd[4836]: sde: add path (uevent)
May 22 11:50:51 server1 multipathd[4836]: 3600a0980383043626d244b2f73445230: load table [0 20971520 multipath 4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handler 0 2 1 service-time 0 2 1 8:16 1 8:32 1 service-time 0 2 1 8:48 1 8:64 1]
May 22 11:50:51 server1 multipathd[4836]: sde [8:64]: path added to devmap 3600a0980383043626d244b2f73445230
May 22 11:50:51 server1 kernel: sd 1:0:0:100: alua: port group 3e8 state A non-preferred supports TolUsNA
May 22 11:50:51 server1 kernel: sd 1:0:1:100: alua: port group 3e8 state A non-preferred supports TolUsNA

#Disk IO and multipath commands become unresponsive and scsi Abort's appear in the logs which only stop once the fnic_reset completes :

May 22 11:51:42 server1 kernel: scsi host1: Abort Cmd called FCID 0x101c2, LUN 0x64 TAG 0 flags 3
May 22 11:51:44 server1 kernel: scsi host1: abts cmpl recd. id 0 status FCPIO_TIMEOUT
May 22 11:51:44 server1 kernel: scsi host1: Returning from abort cmd type 2 FAILED
May 22 11:51:44 server1 kernel: scsi host1: Abort Cmd called FCID 0x101c2, LUN 0x64 TAG 1 flags 3
May 22 11:51:46 server1 kernel: scsi host1: abts cmpl recd. id 1 status FCPIO_TIMEOUT
May 22 11:51:46 server1 kernel: scsi host1: Returning from abort cmd type 2 FAILED
May 22 11:52:01 server1 kernel: scsi host1: Abort Cmd called FCID 0x101c2, LUN 0x64 TAG 2 flags 3
May 22 11:52:03 server1 kernel: scsi host1: abts cmpl recd. id 2 status FCPIO_TIMEOUT
May 22 11:52:03 server1 kernel: scsi host1: Returning from abort cmd type 2 FAILED
May 22 11:52:03 server1 kernel: scsi host1: Device reset called FCID 0x101c2, LUN 0x64 sc 0xffff885efca08380
May 22 11:52:03 server1 kernel: scsi host1: TAG 0
May 22 11:52:13 server1 kernel: scsi host1: Abort and terminate issued on Device reset tag 0x0 sc 0xffff885efca08380 
May 22 11:52:13 server1 kernel: scsi host1: Terminate pending dev reset cmpl recd. id 0 status FCPIO_ABORTED
May 22 11:52:13 server1 kernel: scsi host1: dev reset abts cmpl recd. id 60000000 status FCPIO_SUCCESS
May 22 11:52:13 server1 kernel: scsi host1: Device reset completed - failed
May 22 11:52:13 server1 kernel: scsi host1: Returning from device reset FAILED
May 22 11:52:13 server1 kernel: scsi host1: fnic_reset called
May 22 11:52:13 server1 kernel: scsi host1: update_mac 00:25:b5:11:c0:13
May 22 11:52:13 server1 kernel: scsi host1: Issued fw reset

Environment

  • Red Hat Enterprise Linux 7
    • kernel - 3.10.0-693.el7
  • Cisco UCS M4 blade servers with VIC1340 eth/fc cards
  • NetApp target adapter Emulex ( Emulex LPe16000 (LPe16002) rev. 13 ).
  • NetApp : 9.1P2
  • HBA AH403A, firmware : 2.03X6 SLI-3 (U3D2.03X6)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In