RHEL7: Adding a LUN path to a NetApp SAN causes all disk IO through a HBA port to stall

Solution Verified - Updated -

Issue

Sequence of errors observed were:
- Adding a new path to an existing LUN resulted in a disk IO stall AND scsi layer started aborting timed out requests => the aborts failed resulting in device resets => the device resets failed causing fnic_reset to be called and this recovered all paths including the newly added path.

 # The New LUN path arrives
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: supports implicit TPGS
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: port group 3e9 rel port 0
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: rtpg failed with 8000002
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: port group 3e9 state N non-preferred supports TolUsNA
May 22 11:50:51 server1 kernel: scsi 1:0:3:100: alua: Attached
May 22 11:50:51 server1 kernel: sd 1:0:3:100: Attached scsi generic sg4 type 0
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] 4096-byte physical blocks
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] Write Protect is off
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] Mode Sense: c7 00 00 08
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
May 22 11:50:51 server1 kernel: sd 1:0:3:100: [sde] Attached SCSI disk
May 22 11:50:51 server1 multipathd[4836]: sde: add path (uevent)
May 22 11:50:51 server1 multipathd[4836]: 3600a0980383043626d244b2f73445230: load table [0 20971520 multipath 4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handler 0 2 1 service-time 0 2 1 8:16 1 8:32 1 service-time 0 2 1 8:48 1 8:64 1]
May 22 11:50:51 server1 multipathd[4836]: sde [8:64]: path added to devmap 3600a0980383043626d244b2f73445230
May 22 11:50:51 server1 kernel: sd 1:0:0:100: alua: port group 3e8 state A non-preferred supports TolUsNA
May 22 11:50:51 server1 kernel: sd 1:0:1:100: alua: port group 3e8 state A non-preferred supports TolUsNA

#Disk IO and multipath commands become unresponsive and scsi Abort's appear in the logs which only stop once the fnic_reset completes :

May 22 11:51:42 server1 kernel: scsi host1: Abort Cmd called FCID 0x101c2, LUN 0x64 TAG 0 flags 3
May 22 11:51:44 server1 kernel: scsi host1: abts cmpl recd. id 0 status FCPIO_TIMEOUT
May 22 11:51:44 server1 kernel: scsi host1: Returning from abort cmd type 2 FAILED
May 22 11:51:44 server1 kernel: scsi host1: Abort Cmd called FCID 0x101c2, LUN 0x64 TAG 1 flags 3
May 22 11:51:46 server1 kernel: scsi host1: abts cmpl recd. id 1 status FCPIO_TIMEOUT
May 22 11:51:46 server1 kernel: scsi host1: Returning from abort cmd type 2 FAILED
May 22 11:52:01 server1 kernel: scsi host1: Abort Cmd called FCID 0x101c2, LUN 0x64 TAG 2 flags 3
May 22 11:52:03 server1 kernel: scsi host1: abts cmpl recd. id 2 status FCPIO_TIMEOUT
May 22 11:52:03 server1 kernel: scsi host1: Returning from abort cmd type 2 FAILED
May 22 11:52:03 server1 kernel: scsi host1: Device reset called FCID 0x101c2, LUN 0x64 sc 0xffff885efca08380
May 22 11:52:03 server1 kernel: scsi host1: TAG 0
May 22 11:52:13 server1 kernel: scsi host1: Abort and terminate issued on Device reset tag 0x0 sc 0xffff885efca08380 
May 22 11:52:13 server1 kernel: scsi host1: Terminate pending dev reset cmpl recd. id 0 status FCPIO_ABORTED
May 22 11:52:13 server1 kernel: scsi host1: dev reset abts cmpl recd. id 60000000 status FCPIO_SUCCESS
May 22 11:52:13 server1 kernel: scsi host1: Device reset completed - failed
May 22 11:52:13 server1 kernel: scsi host1: Returning from device reset FAILED
May 22 11:52:13 server1 kernel: scsi host1: fnic_reset called
May 22 11:52:13 server1 kernel: scsi host1: update_mac 00:25:b5:11:c0:13
May 22 11:52:13 server1 kernel: scsi host1: Issued fw reset

Environment

  • Red Hat Enterprise Linux 7
    • kernel - 3.10.0-693.el7
  • Cisco UCS M4 blade servers with VIC1340 eth/fc cards
  • NetApp target adapter Emulex ( Emulex LPe16000 (LPe16002) rev. 13 ).
  • NetApp : 9.1P2
  • HBA AH403A, firmware : 2.03X6 SLI-3 (U3D2.03X6)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content