RHEL mount hangs: nfs: server [...] not responding, still trying

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5, 6, 7, 8, 9
  • NFS Client (nfs-utils package)

Issue

  • NFS shares hang with the following error(s) in /var/log/messages:

    kernel: nfs: server <servername> not responding, still trying
    
    kernel: nfs: server <servername> not responding, timed out
    

Resolution

The resolution for this issue will vary depending on whether the root cause is:

  • Problem between the NFS Client and Server
  • Problem on the NFS Server
  • Problem on the NFS Client

Investigation will be required on both NFS Client and NFS Server.

For non-Red Hat NFS Clients or Servers, engage the vendor of the non-Red Hat system. Investigate connectivity issues such as network link down, network packet loss, system hang, NFS client/server hang or slowness, storage hang or slowness.

The team responsible for the network between NFS Client and NFS Server should be engaged to investigate connectivity and capacity issues.

Root Cause

Explanation of the Message

  • If the NFS client does not receive a response from the NFS server, the "server ... not responding, still trying" message may appear in syslog.
  • Each message indicates that one NFS/RPC request (for example, one NFS WRITE) has been sent retrans times and timed out each time. With the default options of retrans and timeo, this message will be printed after 180 seconds. For more information, see the retrans and timeo options in the NFS manual page ('man nfs').
  • NOTE: A very low value for timeo NFS mount option, which is much less than the default of 600, may increase the likelihood and frequency of this message. For example, setting timeo=5 with the default retrans=2 will cause this message to be printed if the NFS server takes longer than 0.5 + 1.0 = 1.5 seconds to respond to any NFS request. Under a heavy NFS workload, it is not unusual for an NFS server to take longer than 1.5 seconds to respond to one or more NFS requests. For more information on timeo and retrans, see the NFS manual page (man nfs).

Categories of Root Causes

There are 3 possible categories of root causes:

  • Problem between the NFS Client and Server
  • Problem on the NFS Server
  • Problem on the NFS Client

Within each category, there are specific instances given below.

Problem between the NFS Client and NFS Server

For example, overloaded, mis-configured, or malfunctioning switches, firewalls, or networks may cause NFS requests to get dropped or mangled between the NFS Client and NFS Server.

Some specific instances have been:

A problem on the NFS Server

For example, the NFS server is overloaded or contains a hardware or software bug which causes it to drop NFS requests.

Some specific instances have been:

  • Non-Red Hat NFS Server: A problem with the disk configuration at storage pool level. NFS Server vendor: "Specifically, we think that the lack of free space in the pool plus the somewhat random nature of the files to access makes auto-tiering fail on relocation operations."
  • Non-Red Hat NFS Server: A TCP performance issue when certain conditions were met, fixed by a specific patch
  • Non-Red Hat NFS Server: A configuration issue caused data to be sent through the wrong network interface
  • Red Hat NFS Server: Thread count may be too low on the NFS server. For more information on this, see "How do I increase the number of threads created by the NFS daemon in RHEL 4, 5 and 6?"
  • Red Hat NFS Server: Three different bugs, and when all were present, a complete DoS of the NFS Server occurred: https://access.redhat.com/solutions/544553
  • RHEL7 NFS client or server under heavy load with certain NICs and jumbo frames may silently drop packets due to default / too low min_free_kbytes setting: https://access.redhat.com/solutions/4085851

A problem on the NFS Client

For example, the NFS Client networking misconfiguration, NIC driver or firmware bug causing NFS requests to be dropped, NFS Client firewall not allowing NFS traffic in our out.

Some specific instances have been:

Diagnostic Steps

Initial steps to rule out common problems

  • First, identify the timeframe of the problem. The beginning of the incident is the timestamp on the not responding, still trying message, adjusted backwards for the timeo and retrans values (see the Root Cause section about timeo and retrans).

The end of the incident is when you see an nfs server: ... OK message.

If there is no OK message then the problem is ongoing (the NFS Server still has not responded).

If there are multiple not responding messages, there may be multiple timeframes or you may need to adjust further.

For example:

# grep server.example.com /proc/mounts 
server.example.com:/export /mnt nfs4 rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,minorversion=0,local_lock=none,addr=y.y.y.y 0 0
# grep "server.example.com" /var/log/messages
Sep 29 22:54:39 client kernel: nfs: server server.example.com  not responding, still trying
Sep 29 22:54:49 client kernel: nfs: server server.example.com OK

Since the default mount options are being used, the problem began 180 seconds before the not responding message. The problem ended when the 'OK' message was seen:

Sep 29 22:51:39 - Problem BEGIN:  adjusted start time of the problem based on 'timeo' and 'retrans'
Sep 29 22:54:39 - 'not responding, still trying' seen
Sep 29 22:54:49 - Problem END: 'OK' seen

The timeframe of the problem has now been determined.

  • On the NFS Server, check any logs for signs of performance issues during the timeframe(s) identified. For non-Red Hat NFS servers, engage your NFS Server vendor and give them the timeframe of the problem to investigate.

  • On the NFS Client and NFS Server, check if there are problems with the network interface and/or network. For example:

    • Look for dropped packets in ip -s link and/or ethtool output. For one such possibility, see: System dropping packets due to rx_fw_discards
    • The xsos tool can also be used to look for packet loss on network interfaces, see: How can I install xsos command in Red Hat Enterprise Linux?
    • Check the MTU settings on the NFS Client, the NFS Server, and throughout the path from the NFS Client to NFS Server. All systems must have the same MTU configured.
    • Look for evidence of packet loss outside the system by running netstat -s on RHEL. Counters under the TcpExt heading such as retransmits or congestion may indicate external packet loss. However, note these are system-wide TCP counters which have incremented since system boot, so errors may be related to other TCP connections and not the NFS connection.
    • If bonding is being used, and the NFS transport is TCP, check for an incorrect bonding mode, as described in What is the best bonding mode for TCP traffic such as NFS, ISCSI, CIFS, etc?
  • Identify any other NFS Client accessing the same NFS Server, especially any identical NFS Client (mounting same exports, same mount options, same Red Hat version, etc). Do any similar NFS Clients experience the not responding messages at the same timeframe? If there are other NFS Clients, this lends credence to either a NFS Server issue or a networking/connectivity issue between the NFS Client and NFS Server.

  • Identify any network equipment such as routers, switches, or firewalls between the NFS Client and NFS Server. If possible, examine any logs or monitoring statistics (eg: Cacti, rrdtool) from these devices at the timeframe of the incident.

Troubleshooting with packet captures

The goal of these steps is to isolate the problem into one of 3 categories:

  • Problem between the NFS Client and NFS Server
  • Problem on the NFS Server
  • Problem on the NFS Client

Once the problem is isolated, further troubleshooting is required to fix the problem, and is beyond the scope of this solution.

The most direct means of troubleshooting this issue requires at least packet captures from both the NFS Client and NFS Server perspectives. In some scenarios, it may be possible to diagnose with a packet capture on just one side, such as the NFS Client, but both sides are highly recommended.

NOTE: Any tcpdump capture should only contain packets involving the problematic NFS server. If using tcpdump, you can accomplish this by using the 'host' pcap-filter and providing the NFS server name or IP address from the "not responding" message. Failing to filter the packet capture to only the problematic NFS server is very likely to result in delays in root cause analysis. Example:

# grep "not responding" /var/log/messages
Sep 29 22:54:39 client kernel: nfs: server server.example.com  not responding, still trying
# tcpdump -i eth0 -s 0 -w /tmp/tcpdump.pcap host server.example.com

Gathering packet captures with tcpdump (Red Hat NFS Client or Server)

For generic steps on gathering a packet capture on any Red Hat NFS Server or NFS Client, see How to capture network packets with tcpdump?

For a simple way to gather a packet capture using the tcpdump tool on a RHEL NFS Client, use the tcpdump-watch.sh script on the following solution: https://access.redhat.com/articles/4330981#intermittent.

The script takes a single parameter, the NFS Server name or IP address, and watches /var/log/messages for the nfs: server ... not responding, still trying messages. When it sees the message, the tcpdump is stopped.

Please note: the default tcpdump arguments in the tcpdump-watch.sh script may work for many environments, but some environments may need slight changes. For example, if there are large NFS READs and WRITEs, in the initial packet capture and/or there are a lot of packets dropped by the tcpdump process, then reduce the size of the packet captured to ~512 bytes with the "snaplen" parameter (-s 512). In addition, if the packet capture collects more than NFS traffic between the NFS Client and NFS Server, you may need to add one or more pcap-filters such as port 2049 to capture only traffic to/from the NFS port. For more information on pcap-filters, see the manual page man pcap-filter.

Gathering packet captures on an NFS Server (non-Red Hat NFS Server)

Contact your NFS Server vendor for official steps on gathering a packet capture from the NFS Server, or use a port mirror to capture traffic from the NFS Server perspective.

Analysis of packet captures

Take the timeframe of the problem calculated in the initial steps and use Wireshark or tshark to inspect the packet capture files. Be sure to use the correct TZ shell variable when running Wireshark or tshark so the timestamps on the packets will line up with the timeframe of the problem. For more information on the TZ variable, see the section titled "Timestamps in packet traces and matching other event timestamps" in NFS packet trace analysis tips and tricks. Examine the packet captures for signs of network problems, such as retransmits/duplicates, TCP/IP handshake problems, delays in NFS RPC replies, etc.

For a few examples of common scenarios which may be seen in a tcpdump gathered on the NFS Client, please see NFS client tcpdump analysis: 3 common failure scenarios

Troubleshooting with vmcores

In general, a vmcore (copy of kernel memory created by causing a kernel panic) is not required to investigate a connectivity issue such as this.

Red Hat may request a vmcore from an NFS Client or NFS Server at a later date if it is believed there is a specific bug within RHEL, but a vmcore is not an initial or common troubleshooting step for this sort of issue.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments