RHEL7: NFS4.0 client protocol loop of LOCK / NFS4ERR_OLD_STATEID (10024) seen with EMC Isilon NFS server
Issue
- We are running webdav with nfs shares on Apache 2.4 on RHEL 7. Occasionally applications connecting to Apache receive tcp resets and are unable to POST and PUT.
- Upon restart of Apache, the problem clears.
- Webdav writes are failing. Reads still function with no issues. We see through tcpdump that Apache cannot write fast enough, it appears, to clear the full TCP buffer. We see this regularly but that doesn't always lead to the problem. When the problem happens, Apache cannot write to the nfs share at all. It manifests in the same way, by filling up the tcp buffer before all the data is written to the nfs share, but eventually, writing to the share stops and only a recycle of Apache will allow writes to continue. At this point, the application (client) keeps getting TCP window full messages when it checks on the size of the buffer. Eventually it stops waiting and resets the connection. This is when we see problems with the application.
- We did upgrade to the following release on recommendation (via kernel and prereq rpms):
- Red Hat Enterprise Linux Server release 7.6 (Maipo)
- We saw more issues and had to revert back to 7.5.
Environment
- Red Hat Enterprise Linux 7 (NFS client)
- seen on kernel-3.10.0-862.14.1.el7
- seen on kernel-3.10.0-957.5.1.el7
- seen with EMC Isilon (NFS server)
- NFS4.0
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.