device-mapper-multipath on RHEL5 experiences excessive delay in detecting a lost path from a storage failure that produces no RSCN or loop/link error
Issue
- When a path fails, it is taking over 5 minutes for multipath to switch to another path
- With an RDAC-based storage array, mpath_prio_rdac priority callouts may take 300 seconds to fail when the storage is unresponsive, delaying path failover.
- One of two (redundant) Fibre Switches failed and de-zoned the LUNs presented from that switch
- Although there was a remaining active path, the application timed out waiting for I/O from the voting disks while multipath waited on the SCSI layer to fail the path
- Any failure on the fabric that does not produce a Register State Change Notification (RSCN) or a loop/link error will take at least 300 seconds to timeout at the scsi layer, causing the mpath map to be unresponsive for that long.
- Servers may reboot under load
Environment
- Red Hat Enterprise Linux (RHEL) 5
- device-mapper-multipath configured (automatically or through /etc/multipath.conf) to use the tur or readsector0 path checker or the mpath_prio_rdac priority callout.
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.