device-mapper-multipath on RHEL5 experiences excessive delay in detecting a lost path from a storage failure that produces no RSCN or loop/link error

Solution Verified - Updated -

Issue

  • When a path fails, it is taking over 5 minutes for multipath to switch to another path
  • With an RDAC-based storage array, mpath_prio_rdac priority callouts may take 300 seconds to fail when the storage is unresponsive, delaying path failover.
  • One of two (redundant) Fibre Switches failed and de-zoned the LUNs presented from that switch
  • Although there was a remaining active path, the application timed out waiting for I/O from the voting disks while multipath waited on the SCSI layer to fail the path
  • Any failure on the fabric that does not produce a Register State Change Notification (RSCN) or a loop/link error will take at least 300 seconds to timeout at the scsi layer, causing the mpath map to be unresponsive for that long.
  • Servers may reboot under load

Environment

  • Red Hat Enterprise Linux (RHEL) 5
  • device-mapper-multipath configured (automatically or through /etc/multipath.conf) to use the tur or readsector0 path checker or the mpath_prio_rdac priority callout.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In