Multipath is not detecting path failures fast enough which results in application failure and system reboots
Environment
- Red Hat Enterprise Linux (RHEL) 6, 7, 8
- device-mapper-multipath
- Fibre Channel SAN storage
- Exceptions:
Issue
Several of our servers have devices that go offline and multipath is not switching over to their alternate paths in a timely fashion, which causes production applications to become unavailable and in some cases rebooting.
Resolution
Add the following parameters to the defaults{}
section in /etc/multipath.conf
:
defaults {
polling_interval 5
fast_io_fail_tmo 5
dev_loss_tmo 10
checker_timeout 15
}
Reload multipathd
service:
# service multipathd reload
Also if booting from SAN, rebuild initramfs so updated multipath.conf settings are present at boot time.
Also, see the following on additional details for shortening timeout failover to surviving paths in a fibre channel environment:
- How to set dev_loss_tmo and fast_io_fail_tmo persistently, using a udev rule
- Is there a way to limit multipath failover times in order to avoid Oracle RAC cluster evictions?
- Multipath is not detecting path failures fast enough which results in application failure and system reboots
To lengthen timeout failure to help prevent filesystems entering read-only mode:
Root Cause
There are multiple parameters that affect error detection and failover times:
dev_loss_tmo
fast_io_fail_tmo
checker_timeout
The dev_loss_tmo
(rport) affects extended link timeout, in-flight I/O is held after a link-down event for
The fast_io_fail_tmo
(rport) affects how long io is queued and held while rport is in blocked state.
The checker_timeout
specifies the timeout to user for path checkers that issue SCSI commands with an explicit timeout, in seconds; default is taken from /sys/block/sd<x>/device/timeout
.
The polling_interval
is the interval between path checks in seconds.
Diagnostic Steps
-
Default:
# for f in /sys/class/fc_remote_ports/rport-*/fast_io_fail_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done rport-8:0-0:0x50014380113622b0:off rport-8:0-1:0x50014380113622b0:off rport-8:0-2:0x50014380113622b0:off rport-8:0-3:0x50014380113622b0:off # for f in /sys/class/fc_remote_ports/rport-*/dev_loss_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done rport-8:0-0:0x50014380113622b0:30 rport-8:0-1:0x50014380113622b0:30 rport-8:0-2:0x50014380113622b0:30 rport-8:0-3:0x50014380113622b0:30
-
Add the following parameters to /etc/multipath.conf in the defaults section:
defaults { user_friendly_names yes fast_io_fail_tmo 5 dev_loss_tmo 10 no_path_retry fail }
-
Reload
multipathd
:# service multipathd reload
-
Modified:
# for f in /sys/class/fc_remote_ports/rport-*/fast_io_fail_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done rport-8:0-0:0x50014380113622b0:5 rport-8:0-1:0x50014380113622b0:5 rport-8:0-2:0x50014380113622b0:5 rport-8:0-3:0x50014380113622b0:5 # for f in /sys/class/fc_remote_ports/rport-*/dev_loss_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done rport-8:0-0:0x50014380113622b0:10 rport-8:0-1:0x50014380113622b0:10 rport-8:0-2:0x50014380113622b0:10 rport-8:0-3:0x50014380113622b0:10
-
Furthering, there are also specific
multipath.conf
parameters that also control map behavior when all paths are lost.- no_path_retry: Specify the number of retries until disable queueing, or fail for immediate failure (no queueing), queue for never stop queueing. Default config entry is fail. However, devices{} stanza will overrule this. - flush_on_last_del: If set to yes , multipathd will disable queueing when the last path to a device has been deleted. Default is no.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments