NFSv4 server failover/reboot locks up VMs sensitive to I/O delays

Environment

Red Hat Enterprise Linux (RHEL) 7
NFSv4 server

Issue

A redundant NFSv4 node rebooted for maintenance blocks Input/Output (I/O) operations for 60 seconds [by default].
This delay is significantly above the tolerance of many operating systems for I/O operations and results in specific virtual
machines (VMs) going into an unrecoverable softlockup state.
Reboot of the VMs is the only way to recover the VMs, which is unacceptable.
This issue occurs with certain VMs using nfsv4 protocol that are sensitive to I/O delays or disruptions.

Resolution

Collaborate with storage vendor to get the appropriate resolution steps.
For servers running RHEL, refer to the NFSv4 server restart causes long pause in NFS client KCS article.

Root Cause

Applications will only block on I/Os for 20 seconds; if the failover takes longer, the applications will fail.

Diagnostic Steps

Create a 1G file on your NFS share:

cd /var/lib/nova/mnt/50ebb5ce<LONG UUID> 9bee79 - dd if=/dev/zero of=ioping1.tmp bs=1024k count=1026

Determine if the VM is using locks by running the command:
```
lslocks
```

Invoke ioping on the 1G file and capture measurements:

ioping  -Y -D -G -WWW -S 1g -s 10m -i 0.5 -k ioping1.tmp (for VMs not using locks)

flock -x -e ioping2.tmp  ioping  -Y -D -G -WWW -S 1g -s 10m -i 0.5 -k ioping2.tmp (for VMs using locks)

Initiate ioping using a stable network to the filer. A successful execution returns a long stream output.
Note that a stable 10g network was used for this test.

10 MiB >>> ioping1.tmp (nfs4.example.com:/mypath): request=1   
time=17.0 ms (warmup)
10 MiB <<< ioping1.tmp (nfs4.example.com:/mypath): request=2 
time=14.1 ms
10 MiB >>> ioping1.tmp (nfs4.example.com:/mypath): request=3 
time=16.1 ms
10 MiB <<< ioping1.tmp (nfs4.example.com:/mypath): request=4 
time=39.0 ms

A node reboot (or software update), results in the following measurements:

10 MiB <<< ioping1.tmp (nfs4.example.com:/mypath): request=2004 
time=60.0 s (slow)

and:

[stack@tenlab2-director 03201885]$ grep -v ms ioping2.log
10 MiB >>> ioping1.tmp (nfs4.example.com:/mypath): request=1957 
time=59.5 s (slow)

Breaking down the process into two steps enables VMs sensitive to I/O delays or disruptions to survive with no
softlockup:

10 MiB <<< ioping1.tmp (nfs4.example.com:/mypath): request=1172 
time=6.08 s (slow)
10 MiB <<< ioping1.tmp (nfs4.example.com:/mypath): request=1574 
time=12.4 s (slow)

and:

10 MiB <<< ioping1.tmp (nfs4.example.com:/mypath): request=6106 
time=2.14 s
10 MiB >>> ioping1.tmp (nfs4.example.com:/mypath): request=6139 
time=6.71 s
10 MiB <<< ioping1.tmp (nfs4.example.com:/mypath): request=6302 
time=13.4 s

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

NFSv4 server failover/reboot locks up VMs sensitive to I/O delays

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links