device-mapper-multipath on RHEL5 experiences excessive delay in detecting a lost path from a storage failure that produces no RSCN or loop/link error

Solution Verified - Updated -

Issue

  • When a path fails, it is taking over 5 minutes for multipath to switch to another path
  • With an RDAC-based storage array, mpath_prio_rdac priority callouts may take 300 seconds to fail when the storage is unresponsive, delaying path failover.
  • One of two (redundant) Fibre Switches failed and de-zoned the LUNs presented from that switch
  • Although there was a remaining active path, the application timed out waiting for I/O from the voting disks while multipath waited on the SCSI layer to fail the path
  • Any failure on the fabric that does not produce a Register State Change Notification (RSCN) or a loop/link error will take at least 300 seconds to timeout at the scsi layer, causing the mpath map to be unresponsive for that long.
  • Servers may reboot under load

Environment

  • Red Hat Enterprise Linux (RHEL) 5
  • device-mapper-multipath configured (automatically or through /etc/multipath.conf) to use the tur or readsector0 path checker or the mpath_prio_rdac priority callout.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content