RHEL mount hangs: nfs: server [...] not responding, still trying
Environment
- Red Hat Enterprise Linux 5, 6, 7, 8, 9
- NFS Client (
nfs-utils
package)
Issue
-
NFS shares hang with the following error(s) in
/var/log/messages
:kernel: nfs: server <servername> not responding, still trying
kernel: nfs: server <servername> not responding, timed out
Resolution
The resolution for this issue will vary depending on whether the root cause is:
- Problem between the NFS Client and Server
- Problem on the NFS Server
- Problem on the NFS Client
Investigation will be required on both NFS Client and NFS Server.
For non-Red Hat NFS Clients or Servers, engage the vendor of the non-Red Hat system. Investigate connectivity issues such as network link down, network packet loss, system hang, NFS client/server hang or slowness, storage hang or slowness.
The team responsible for the network between NFS Client and NFS Server should be engaged to investigate connectivity and capacity issues.
Root Cause
Explanation of the Message
- If the NFS client does not receive a response from the NFS server, the "
server ... not responding, still trying
" message may appear in syslog. - Each message indicates that one NFS/RPC request (for example, one NFS WRITE) has been sent
retrans
times and timed out each time. With the default options ofretrans
andtimeo
, this message will be printed after 180 seconds. For more information, see theretrans
andtimeo
options in the NFS manual page ('man nfs'). - NOTE: A very low value for
timeo
NFS mount option, which is much less than the default of 600, may increase the likelihood and frequency of this message. For example, settingtimeo=5
with the defaultretrans=2
will cause this message to be printed if the NFS server takes longer than 0.5 + 1.0 = 1.5 seconds to respond to any NFS request. Under a heavy NFS workload, it is not unusual for an NFS server to take longer than 1.5 seconds to respond to one or more NFS requests. For more information ontimeo
andretrans
, see the NFS manual page (man nfs
).
Categories of Root Causes
There are 3 possible categories of root causes:
- Problem between the NFS Client and Server
- Problem on the NFS Server
- Problem on the NFS Client
Within each category, there are specific instances given below.
Problem between the NFS Client and NFS Server
For example, overloaded, mis-configured, or malfunctioning switches, firewalls, or networks may cause NFS requests to get dropped or mangled between the NFS Client and NFS Server.
Some specific instances have been:
- A damaged security appliance mangling packets between the NFS Client and NFS Server:
https://access.redhat.com/solutions/1122483 - The port-channel aka EtherChannel aka bonding configuration on the switch was incorrect:
https://access.redhat.com/solutions/190183 - A second system on the network had duplicated the IP address of the NFS Server
- The switch was dropping TCP SYN,ACK packets: https://access.redhat.com/solutions/1262663
- Issue was with a Riverbed WAN optimizer device
- Cisco ASA between NFS Server and NFS Clients could not handle wrap of TCP Sequence number:
https://access.redhat.com/solutions/2778561
A problem on the NFS Server
For example, the NFS server is overloaded or contains a hardware or software bug which causes it to drop NFS requests.
Some specific instances have been:
- Non-Red Hat NFS Server: A problem with the disk configuration at storage pool level. NFS Server vendor: "Specifically, we think that the lack of free space in the pool plus the somewhat random nature of the files to access makes auto-tiering fail on relocation operations."
- Non-Red Hat NFS Server: A TCP performance issue when certain conditions were met, fixed by a specific patch
- Non-Red Hat NFS Server: A configuration issue caused data to be sent through the wrong network interface
- Red Hat NFS Server: Thread count may be too low on the NFS server. For more information on this, see "How do I increase the number of threads created by the NFS daemon in RHEL 4, 5 and 6?"
- Red Hat NFS Server: Three different bugs, and when all were present, a complete DoS of the NFS Server occurred: https://access.redhat.com/solutions/544553
- RHEL7 NFS client or server under heavy load with certain NICs and jumbo frames may silently drop packets due to default / too low min_free_kbytes setting: https://access.redhat.com/solutions/4085851
A problem on the NFS Client
For example, the NFS Client networking misconfiguration, NIC driver or firmware bug causing NFS requests to be dropped, NFS Client firewall not allowing NFS traffic in our out.
Some specific instances have been:
- An incorrect MTU (network) setting on the client causing timeouts (and a watchdog reboot)
- Jumbo packets (
MTU=9000
) selected on one system, but not across the rest of the network - An incorrect/inefficient bonding mode is in use: What is the best bonding mode for TCP traffic such as NFS, ISCSI, CIFS, etc?
- The
net.ipv4.tcp_frto
setting may trigger this issue: https://access.redhat.com/solutions/1531943 - An NFS client kernel regression that caused the RPC layer to become non-functional. For more information, see RHEL6.7.z: NFS client with kernels 2.6.32-573.10.2.el6 or above hangs with 'not responding, still trying' messages and running processes in _spin_lock
- Possible regression in RHEL6.9 kernels involving an NFS client's sunrpc TCP port re-use logic as detailed in https://access.redhat.com/solutions/3018371
- RHEL7.6: NFSv3 client hangs after 5 minute idle timer drops the TCP connection and a subsequent TCP 3-way handhake fails due to duplicate SYN or unexpected RST from the NFS client as described in https://access.redhat.com/solutions/3765711
- RHEL7 NFS client or server under heavy load with certain NICs and jumbo frames may silently drop packets due to default / too low min_free_kbytes setting: https://access.redhat.com/solutions/4085851
Diagnostic Steps
Initial steps to rule out common problems
- First, identify the timeframe of the problem. The beginning of the incident is the timestamp on the
not responding, still trying
message, adjusted backwards for thetimeo
andretrans
values (see the Root Cause section abouttimeo
andretrans
).
The end of the incident is when you see an nfs server: ... OK
message.
If there is no OK
message then the problem is ongoing (the NFS Server still has not responded).
If there are multiple not responding
messages, there may be multiple timeframes or you may need to adjust further.
For example:
# grep server.example.com /proc/mounts
server.example.com:/export /mnt nfs4 rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,minorversion=0,local_lock=none,addr=y.y.y.y 0 0
# grep "server.example.com" /var/log/messages
Sep 29 22:54:39 client kernel: nfs: server server.example.com not responding, still trying
Sep 29 22:54:49 client kernel: nfs: server server.example.com OK
Since the default mount options are being used, the problem began 180 seconds before the not responding
message. The problem ended when the 'OK' message was seen:
Sep 29 22:51:39 - Problem BEGIN: adjusted start time of the problem based on 'timeo' and 'retrans'
Sep 29 22:54:39 - 'not responding, still trying' seen
Sep 29 22:54:49 - Problem END: 'OK' seen
The timeframe of the problem has now been determined.
-
On the NFS Server, check any logs for signs of performance issues during the timeframe(s) identified. For non-Red Hat NFS servers, engage your NFS Server vendor and give them the timeframe of the problem to investigate.
-
On the NFS Client and NFS Server, check if there are problems with the network interface and/or network. For example:
- Look for dropped packets in
ip -s link
and/orethtool
output. For one such possibility, see: System dropping packets due to rx_fw_discards - The
xsos
tool can also be used to look for packet loss on network interfaces, see: How can I install xsos command in Red Hat Enterprise Linux? - Check the MTU settings on the NFS Client, the NFS Server, and throughout the path from the NFS Client to NFS Server. All systems must have the same MTU configured.
- Look for evidence of packet loss outside the system by running
netstat -s
on RHEL. Counters under theTcpExt
heading such asretransmits
orcongestion
may indicate external packet loss. However, note these are system-wide TCP counters which have incremented since system boot, so errors may be related to other TCP connections and not the NFS connection. - If bonding is being used, and the NFS transport is TCP, check for an incorrect bonding mode, as described in What is the best bonding mode for TCP traffic such as NFS, ISCSI, CIFS, etc?
- Look for dropped packets in
-
Identify any other NFS Client accessing the same NFS Server, especially any identical NFS Client (mounting same exports, same mount options, same Red Hat version, etc). Do any similar NFS Clients experience the
not responding
messages at the same timeframe? If there are other NFS Clients, this lends credence to either a NFS Server issue or a networking/connectivity issue between the NFS Client and NFS Server. -
Identify any network equipment such as routers, switches, or firewalls between the NFS Client and NFS Server. If possible, examine any logs or monitoring statistics (eg: Cacti, rrdtool) from these devices at the timeframe of the incident.
Troubleshooting with packet captures
The goal of these steps is to isolate the problem into one of 3 categories:
- Problem between the NFS Client and NFS Server
- Problem on the NFS Server
- Problem on the NFS Client
Once the problem is isolated, further troubleshooting is required to fix the problem, and is beyond the scope of this solution.
The most direct means of troubleshooting this issue requires at least packet captures from both the NFS Client and NFS Server perspectives. In some scenarios, it may be possible to diagnose with a packet capture on just one side, such as the NFS Client, but both sides are highly recommended.
NOTE: Any tcpdump capture should only contain packets involving the problematic NFS server. If using tcpdump, you can accomplish this by using the 'host' pcap-filter
and providing the NFS server name or IP address from the "not responding" message. Failing to filter the packet capture to only the problematic NFS server is very likely to result in delays in root cause analysis. Example:
# grep "not responding" /var/log/messages
Sep 29 22:54:39 client kernel: nfs: server server.example.com not responding, still trying
# tcpdump -i eth0 -s 0 -w /tmp/tcpdump.pcap host server.example.com
Gathering packet captures with tcpdump (Red Hat NFS Client or Server)
For generic steps on gathering a packet capture on any Red Hat NFS Server or NFS Client, see How to capture network packets with tcpdump?
For a simple way to gather a packet capture using the tcpdump
tool on a RHEL NFS Client, use the tcpdump-watch.sh
script on the following solution: https://access.redhat.com/articles/4330981#intermittent.
The script takes a single parameter, the NFS Server name or IP address, and watches /var/log/messages
for the nfs: server ... not responding, still trying
messages. When it sees the message, the tcpdump is stopped.
Please note: the default tcpdump
arguments in the tcpdump-watch.sh
script may work for many environments, but some environments may need slight changes. For example, if there are large NFS READs and WRITEs, in the initial packet capture and/or there are a lot of packets dropped by the tcpdump
process, then reduce the size of the packet captured to ~512 bytes with the "snaplen" parameter (-s 512
). In addition, if the packet capture collects more than NFS traffic between the NFS Client and NFS Server, you may need to add one or more pcap-filters
such as port 2049
to capture only traffic to/from the NFS port. For more information on pcap-filters
, see the manual page man pcap-filter
.
Gathering packet captures on an NFS Server (non-Red Hat NFS Server)
Contact your NFS Server vendor for official steps on gathering a packet capture from the NFS Server, or use a port mirror to capture traffic from the NFS Server perspective.
- For NetApp filers, contact NetApp for official recommendations for your filer and environment. You may be able to use the
pktt
command as described by How do I capture a packet trace of NFS operations on a NetApp filer?. - For EMC Isilon filers, please contact EMC for official recommendations for your filer and environment. You may be able to use the
isi_netlogger
command or the web interface as described by How do I capture a packet trace of NFS operations on a EMC Isilon filer?.
Analysis of packet captures
Take the timeframe of the problem calculated in the initial steps and use Wireshark or tshark
to inspect the packet capture files. Be sure to use the correct TZ
shell variable when running Wireshark or tshark
so the timestamps on the packets will line up with the timeframe of the problem. For more information on the TZ
variable, see the section titled "Timestamps in packet traces and matching other event timestamps" in NFS packet trace analysis tips and tricks. Examine the packet captures for signs of network problems, such as retransmits/duplicates, TCP/IP handshake problems, delays in NFS RPC replies, etc.
For a few examples of common scenarios which may be seen in a tcpdump gathered on the NFS Client, please see NFS client tcpdump analysis: 3 common failure scenarios
Troubleshooting with vmcores
In general, a vmcore (copy of kernel memory created by causing a kernel panic) is not required to investigate a connectivity issue such as this.
Red Hat may request a vmcore from an NFS Client or NFS Server at a later date if it is believed there is a specific bug within RHEL, but a vmcore is not an initial or common troubleshooting step for this sort of issue.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments