NFSv4 server restarts causes long pause in NFS client when try to open a file under the mount point

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5, 6, 7, 8, 9
  • NFSv4.0

Issue

  • NFSv4 server restarts cause a long pause in NFS client when try to cat a text file under the mount point.
  • Setup a simple NFS export on a RHEL  server.
/tmp    *(rw,no_root_squash,fsid=0)
  • Mount that export on another RHEL 6 server.
# mount -t nfs4 x.x.x.x:/ /mnt/tmp
  • Use cat command on a text file under the mount.  If the NFS service on the server with the export is restarted, there is a long pause when try to cat the same text file under the mount point. It looks like the NFS client is failing to renew it's session and being forced to wait for the 90 second grace period.

Resolution

To decrease grace period, follow the steps below depending on the type of environment.

Red Hat Enterprise Linux 5 or later

  • Change the value of kernel parameter in the following files.

  • Note: lease time should also be set the same value as one for grace period. Since "lease time" is used for time interval of a file-lock-request between server and client, decreasing its parameter could impact network traffic to be busier.

# service nfs stop
# echo 10 > /proc/sys/fs/nfs/nlm_grace_period
# echo 10 > /proc/fs/nfsd/nfsv4gracetime
# echo 10 > /proc/fs/nfsd/nfsv4leasetime
# service nfs start
  • Where echo 10, which means 10 seconds, is just an example.

Red Hat Cluster Suite and clusters based on Red Hat Enterprise Linux 6 with rgmanager

  • In order to have the NFSv4 lease and grace time affected on failover the grace time must be applied during NFS service start.
  • Edit /etc/sysconfig/nfs and add the following option:
NFSD_V4_GRACE=10
  • Note: Even though this is the correct configuration for modifying NFSv4 grace time on service start there are currently two open bugs that detail the above option not functioning when applied (tracked via Bugzilla #1063087 and #1063088). This is because nfsd enforces that the grace time is always greater than or equal to the lease time and will modify the value to reflect this over time. The init script that starts nfsd does not implement an echo into nfsv4leasetime, only nfsv4gracetime.

  • To workaround this behavior it's recommended to modify the /etc/rc.d/init.d/nfs init script manually from:

        # Set v4 grace period if requested
        [ -n "$NFSD_V4_GRACE" ] && {
                echo "$NFSD_V4_GRACE" > /proc/fs/nfsd/nfsv4gracetime
        }

to:

        # Set v4 grace period if requested
        [ -n "$NFSD_V4_GRACE" ] && {
                echo "$NFSD_V4_GRACE" > /proc/fs/nfsd/nfsv4leasetime
                echo "$NFSD_V4_GRACE" > /proc/fs/nfsd/nfsv4gracetime
        }

Clusters based on Red Hat Enterprise Linux 6 with pacemaker and Red Hat Enterprise Linux 7, 8, 9

  • In pacemaker-based clusters, the resource ocf:heartbeat:nfsserver is used. Any custom parameters to /etc/sysconfig/nfs, such as grace time, should be entered into configuration of the nfsserver resource itself. The reason behind this is that in pacemaker-based clusters, the NFSD configuration file /etc/sysconfig/nfs is dynamically generated based on cluster-wide configuration (see CIB/Cluster Information Base ).
# pcs resource update your_nfs_server nfsd_args="--grace-time 10"
From description of nfsserver resource

# pcs resource describe nfsserver
Assumed agent name 'ocf:heartbeat:nfsserver' (deduced from 'nfsserver')
ocf:heartbeat:nfsserver - Manages an NFS server

Nfsserver helps to manage the Linux nfs server as a failover-able resource in Linux-HA.
It depends on Linux specific NFS implementation details, so is considered not portable to other platforms yet.

Resource options:
  nfs_init_script: The default init script shipped with the Linux distro. The nfsserver resource agent offloads the start/stop/monitor work to the init script because the procedure to start/stop/monitor nfsserver varies on different Linux distro. In the event that this
                   option is not set, this agent will attempt to use an init script at this location, /etc/init.d/nfs, or detect a systemd unit-file to use in the event that no init script is detected.
  nfs_no_notify: Do not send reboot notifications to NFSv3 clients during server startup.
  nfs_notify_foreground: Keeps the sm-notify attached to its controlling terminal and running in the foreground.
  nfs_smnotify_retry_time: Specifies the length of sm-notify retry time, in minutes, to continue retrying notifications to unresponsive hosts. If this option is not specified, sm-notify attempts to send notifications for 15 minutes. Specifying a value of 0 causes sm-notify
                           to continue sending notifications to unresponsive peers until it is manually killed.
  nfs_ip: Comma separated list of floating IP addresses used to access the nfs service
  nfsd_args: Specifies what arguments to pass to the nfs daemon on startup. View the rpc.nfsd man page for information on what arguments are available. Note that setting this value will override all settings placed in the local /etc/sysconfig/nfs file.
  lockd_udp_port: The udp port lockd should listen on. Note that setting this value will override all settings placed in the local /etc/sysconfig/nfs file.
  lockd_tcp_port: The tcp port lockd should listen on. Note that setting this value will override all settings placed in the local /etc/sysconfig/nfs file.
  statd_outgoing_port: The source port number sm-notify uses when sending reboot notifications. Note that setting this value will override all settings placed in the local /etc/sysconfig/nfs file.
  statd_port: The port number used for RPC listener sockets. Note that setting this value will override all settings placed in the local /etc/sysconfig/nfs file.
  mountd_port: The port number used for rpc.mountd listener sockets. Note that setting this value will override all settings placed in the local /etc/sysconfig/nfs file.
  rquotad_port: The port number used for rpc.rquotad. Note that setting this value will override all settings placed in the local /etc/sysconfig/nfs file.
  nfs_shared_infodir: The nfsserver resource agent will save nfs related information in this specific directory. And this directory must be able to fail-over before nfsserver itself.
  rpcpipefs_dir: The mount point for the sunrpc file system. Default is /var/lib/nfs/rpc_pipefs. This script will mount (bind) nfs_shared_infodir on /var/lib/nfs/ (cannot be changed), and this script will mount the sunrpc file system on /var/lib/nfs/rpc_pipefs (default, can
                 be changed by this parameter). If you want to move only rpc_pipefs/ (e.g. to keep rpc_pipefs/ local) from default, please set this value.

Default operations:
  start: interval=0s timeout=40
  stop: interval=0s timeout=20s
  monitor: interval=10 timeout=20s

Root Cause

  • The delay is because of a 90 second grace period.
  • The purpose of the grace period is to give the clients enough time to notice that the server has rebooted, and to reclaim their existing locks without danger of having somebody else steal the lock from them. This is definitely a strongly recommended feature to prevent any data corruption in your mailbox/database/logfile/... that relies on those locks. NFSv4 RFC says,

    During the grace period, the server must reject READ and WRITE operations and non-reclaim locking requests (i.e., other LOCK and OPEN operations) with an error of NFS4ERR_GRACE.

  • Retransmit interval for NFS4ERR_GRACE is 0.1*2^n seconds (max: 15), and clients may need to wait for more than 90 seconds.

  • In NFSv4.1, RECLAIM_COMPLETE call is defined, and a client can notify a server that reclaim is finished. If all NFS clients send RECLAIM_COMPLETE, the server does not delay to respond. NFSv4.1 RFC says,

    A RECLAIM_COMPLETE operation is used to indicate that the client has reclaimed all of the locking state that it will recover, when it is recovering state due to either a server restart or the transfer of a file system to another server.

Diagnostic Steps

  • Capture a tcpdump on the NFS client using the command:
 # tcpdump -s0 -i INTERFACE host NFS.SERVER.IP -w /tmp/tcpdump.pcap 
  • Where 'INTERFACE' is the ethernet interface that communicates with the NFS server.

  • Open the tcpdump with wireshark and look for NFS4ERR_GRACE replies for outgoing OPEN calls:

 55 2014-11-08 15:43:21.569853    10.12.13.14 -> 10.12.13.25    NFS 330  V4 Call OPEN DH: 0x1178f166/foo
 56 2014-11-08 15:43:21.569895    10.12.13.14 -> 10.12.13.25   NFS 122  V4 Reply (Call In 55) OPEN Status: NFS4ERR_GRACE

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments