RHEL 6 NFS clients become unresponsive due to invalid rpc_xprt->snd_task pointer
Issue
- We have seen several (10 or so) cases so far since our 6.3 upgrade where an nfs mount on a client becomes unresponsive.
df
will hang, commands that access the mount will hang. - New mounts to the same cluster through new nodes will succeed. New mount requests to the same cluster through the stuck server-node will hang.
- multiple servers which randomly but regularly lose access to 9 NFS shares. The NFS server is some EMC filer. The shares are using TCP. UDP shares are not affected. The same TCP shares on other yet-unaffected servers are fine. NFS server during the issue is network-available and correctly answers to "showmount -e" request.
- NFS clients become unresponsive, with "failed to lock transport" messages appearing for each RPC request made on the stuck RPC transport (if verbose RPC debugging is enabled)
Environment
- Originally see on Red Hat Enterprise Linux 6.3 (NFS Client)
- Originally seen on kernel-2.6.32-279.14.1.el6, believed other kernels between 2.6.32-220.el6 and 2.6.32-358.2.1.el6 are affected
- Mostly seen with NFSv3 protocol, HOWEVER as the bug is in the SUNRPC layer, other protocols (NFS v4, NFS v4.1, NFSACL) are susceptible as well.
- Seen with Isilon NFS server cluster and EMC NFS server, as well as NetApp filers
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.