RHV: Allow modifying sanlock timeouts with respect to IBM 2145 multi-site stretched storage.

Solution In Progress - Updated -

Environment

  • Red Hat Virtualization 4.x
  • IBM 2145 storage.

Issue

  • Allow modifying sanlock timeouts with respect to IBM 2145 multi-site stretched storage.

Resolution

  • For RHV <= 4.4.3:
    It would be advised to tune multipath to initiate the failover from SAN faster, thereby IO is received back over the socket to sanlock much faster.

  • For RHV >= 4.4.4:
    vdsm-4.40.39 or later supports configuring sanlock IO timeouts

    It's important to mention that while it's supported the setting of custom parameters, the customization of values different from the default has not currently been tested on large scale environments on multi-site configurations by Red Hat
    Before proceeding with applying custom tunings of sanlock timeouts please lease with the Red Hat customer support opening a ticket over the customer portal.

    How multipath timeouts are related to sanlock timeouts?

    For best results, you need to keep multipath and sanlock timeouts synchronized.

    If multipath is using a shorter timeout, HA VM with a storage lease may pause before the lease expire. When the VM pause, libvirt releases the storage lease. When the lease expires, sanlock will not terminate the HA VM. This will delay starting the HA VM on another host.

    If multipath is using longer timeout, I/O to storage will continue to block even after storage leases on this storage have expired. Processes may be block on storage in uninterruptible state (D state). This will delay and fail vdsm API calls or internal flows.

    In the worst case, processes holding a storage leases cannot be terminated by sanlock 60 seconds after the storage lease was expired. In this case the host watchdog will reboot the host.

    Here are some possible combinations:

    effective timeout 80 120 160
    sanlock:io_timeout 10 15 20
    multipath/no_path_retry[1] 16 24 32

    [1] Using 5 seconds polling_interval.

    Configuring vdsm

    To configure sanlock to use longer I/O timeout, we need to configure vdsm, since vdsm is managing sanlock.

    For each host, install this vdsm configuration drop-in file:

    $ cat /etc/vdsm/vdsm.conf.d/99-FooIO.conf
    # Configuration for FooIO storage.
    
    [sanlock]
    # Set renewal timeout to 120 seconds
    # (8 * io_timeout == 120).
    io_timeout = 15
    

    Configuring multipath

    When using longer sanlock:io_timeout in vdsm, we need to update multipath to use a larger no_path_retry value.

    For each host, install this multipath configuration drop-in file:

    $ cat /etc/multipath/conf.d/FooIO.conf
    # Configuration for FooIO storage.
    
    overrides {
        # Queue I/O for 120 seconds when all paths fail
        # (no_path_retry * polling_interval == 120).
        no_path_retry 24
    }
    

Root Cause

  • In a multi-site stretched storage, when one controller shuts down, IO failover can take quiet some time, sometimes over 100 seconds because the other controllers to failover to are at different geographic regions. They may also be connected with different network fibre providers.
  • Since the time to failover (100 sec or more) is greater than the sanlock IO timeouts to renew leases(80 seconds). sanlock starts killing processes and may eventually reset the host.

Diagnostic Steps

  1. Sanlock read requests time's out after 1 path of the LUN from IBM 2145 storage fails.
May 30 00:01:08 host-1 multipathd: 360050768108200890800000000000c33: remaining active paths: 7
May 30 00:01:09 host-1 multipathd: checker failed path 8:64 in map 360050768108200890800000000000c33
May 30 00:01:09 host-1 multipathd: 360050768108200890800000000000c33: remaining active paths: 6

[...]

May 30 00:01:12 host-1 sanlock[2857]: 2020-05-30 00:01:12 2536547 [3880]: s4 delta_renew read timeout 10 sec offset 0 /dev/99a01e03-584c-4d97-9259-5bce901104f6/ids
May 30 00:01:12 host-1 sanlock[2857]: 2020-05-30 00:01:12 2536547 [3880]: s4 renewal error -202 delta_length 10 last_success 2536517
  1. Read AIO message timeouts are seen from sanlock.
2020-05-30 00:01:12 2536547 [3880]: 99a01e03 aio timeout RD 0x7fe89c0008c0:0x7fe89c0008d0:0x7fe8adbd2000 ioto 10 to_count 1
2020-05-30 00:01:12 2536547 [3880]: s4 delta_renew read timeout 10 sec offset 0 /dev/99a01e03-584c-4d97-9259-5bce901104f6/ids
  1. Host fails to renew it's delta lease to renew it's host id on the sanlock lockspace.
May 30 00:01:12 host-1 sanlock[2857]: 2020-05-30 00:01:12 2536547 [3880]: s4 renewal error -202 delta_length 10 last_success 2536517
May 30 00:01:33 host-1 sanlock[2857]: 2020-05-30 00:01:33 2536568 [3880]: s4 renewal error -202 delta_length 20 last_success 2536517
May 30 00:02:25 host-1 sanlock[2857]: 2020-05-30 00:02:25 2536620 [3880]: s4 renewal error -202 delta_length 10 last_success 2536589
May 30 00:02:45 host-1 sanlock[2857]: 2020-05-30 00:02:45 2536640 [3880]: s4 renewal error -202 delta_length 20 last_success 2536589
May 30 00:03:06 host-1 sanlock[2857]: 2020-05-30 00:03:06 2536661 [3880]: s4 renewal error -202 delta_length 20 last_success 2536589
May 30 00:03:14 host-1 sanlock[2857]: 2020-05-30 00:03:14 2536669 [2857]: s4 check_our_lease failed 80
  1. Eventually watchdog is called which resets the host.
May 30 00:03:04 host-1 wdmd[2882]: test warning now 2536659 ping 2536649 close 2007488 renewal 2536589 expire 2536669 client 2857 sanlock_99a01e03-584c-4d97-9259-5bce901104f6:6
May 30 00:03:04 host-1 wdmd[2882]: /dev/watchdog0 closed unclean

May 30 00:03:34 host-1 wdmd[2882]: test failed rem 30 now 2536689 ping 2536649 close 2536659 renewal 2536589 expire 2536669 client 2857 sanlock_99a01e03-584c-4d97-9259-5bce901104f6:6

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.