multipath is not detecting path failures fast enough which results in application failure and system reboots

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 7
  • device-mapper-multipath

Issue

Several of our servers have devices that go offline and multipath is not switching over to their alternate paths in a timely fashion, which causes production applications to become unavailable and in some cases rebooting.

Resolution

  • Add the following parameters to the defaults{} section in /etc/multipath.conf:
defaults {
        polling_interval 5
        fast_io_fail_tmo 5
        dev_loss_tmo 10
        checker_timeout 15
}
  • Reload multipathd:
# service multipathd reload

Root Cause

  • There are multiple parameters that affect error detection and failover times.
    dev_loss_tmo
    fast_io_fail_tmo
    checker_timeout

  • The dev_loss_tmo (rport) effects extended link timeout, in-flight io is held after a link down event for seconds before the driver gives up waiting for the port to come back. Default is 30-35s, so in-flight io can be held seconds before being killed off. After timeout expiration, rport is put in offline (down) state.

  • The fast_io_fail_tmo (rport) effects how long io is queued and held while rport is in blocked state.

  • checker_timeout specify the timeout to user for path checkers that issue scsi commands with an explicit timeout, in seconds; default taken from /sys/block/sd/device/timeout

  • polling_interval - interval between path checks in seconds.

Diagnostic Steps

  • Default:
# for f in /sys/class/fc_remote_ports/rport-*/fast_io_fail_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done
rport-8:0-0:0x50014380113622b0:off
rport-8:0-1:0x50014380113622b0:off
rport-8:0-2:0x50014380113622b0:off
rport-8:0-3:0x50014380113622b0:off

# for f in /sys/class/fc_remote_ports/rport-*/dev_loss_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done
rport-8:0-0:0x50014380113622b0:30
rport-8:0-1:0x50014380113622b0:30
rport-8:0-2:0x50014380113622b0:30
rport-8:0-3:0x50014380113622b0:30
  • Add the following parameters to /etc/multipath.conf in the defaults section:
defaults {
        user_friendly_names yes
        fast_io_fail_tmo 5
        dev_loss_tmo 10
        no_path_retry fail
}
  • Reload multipathd:
# service multipathd reload
  • Modified:
# for f in /sys/class/fc_remote_ports/rport-*/fast_io_fail_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done
rport-8:0-0:0x50014380113622b0:5
rport-8:0-1:0x50014380113622b0:5
rport-8:0-2:0x50014380113622b0:5
rport-8:0-3:0x50014380113622b0:5

# for f in /sys/class/fc_remote_ports/rport-*/dev_loss_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done
rport-8:0-0:0x50014380113622b0:10
rport-8:0-1:0x50014380113622b0:10
rport-8:0-2:0x50014380113622b0:10
rport-8:0-3:0x50014380113622b0:10
  • Furthering, there are also specific multipath.conf parameters that also control map behavior when all paths are lost.
- no_path_retry: 
Specify  the  number of retries until disable queueing, or fail for immediate failure (no queueing), 
queue for never stop queueing. Default config entry is fail.  However, devices{} stanza will overrule this.  

- flush_on_last_del:
If set to yes , multipathd will disable queueing when the last path to a device
has been deleted. Default is no.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.