RHEL6.6: NFS4 client hangs with repeated 'Callback slot table overflowed' and constantly running rpciod and NFS4 state manager thread

Solution In Progress - Updated -

Issue

Something slightly different happened the other day, but I'm not sure if it's related to this issue. I got a call from our app guy regarding a problem with I/O errors to one of the NFS file systems on the nfs-client system that's part of this case (client). I found the messages file with thousands of these entries:

Apr  9 17:05:30 nfs-client kernel: Callback slot table overflowed
Apr  9 17:05:30 nfs-client kernel: Callback slot table overflowed
...

The job I had that was writing to this NFS file system showed a 35 second gap (matches the grace time):

Sat Apr  9 17:04:57 CDT 2016
Sat Apr  9 17:04:58 CDT 2016
Sat Apr  9 17:04:59 CDT 2016
Sat Apr  9 17:05:34 CDT 2016
Sat Apr  9 17:05:35 CDT 2016
Sat Apr  9 17:05:36 CDT 2016

Output of "top"showed some unusually high activity with rpc processes, and a "mana" process:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1864 root      20   0     0    0    0 R  5.0  0.0 902:11.26 rpciod/0
 1867 root      20   0     0    0    0 S  3.3  0.0 201:17.00 rpciod/3
65088 root      22   2  105m 2412  220 D  3.3  0.0   0:00.10 bash
 1866 root      20   0     0    0    0 S  1.7  0.0 893:51.26 rpciod/2
47912 mzadmin   21   1 5707m 556m  11m S  1.7  3.5  48761:47 java
65103 root      20   0 17124 1228  868 R  1.7  0.0   0:00.09 top
65105 root      20   0     0    0    0 D  1.7  0.0   0:00.01 10.20.0.32-mana

I was also able to confirm the I/O errors, which weren't always consistent (see 2 writes to /share02, first one worked, second one did not):

[root@nfs-client ~]# touch /share02/hi
[root@nfs-client ~]# rm /share02/hi
rm: remove regular empty file `/share02/hi'? y
[root@nfs-client ~]# touch /share03/hi
touch: cannot touch `/share03/hi': Input/output error
[root@nfs-client ~]# touch /share01/hi
[root@nfs-client ~]# rm /share01/hi
rm: remove regular empty file `/share01/hi'? y
[root@nfs-client ~]# cd /share02
[root@nfs-client share02]# touch hithere
touch: cannot touch `hithere': Input/output error

I was able to unmount these NFS file systems normally, and the RPC and "mana" processes went away. I rebooted nfs-client just to make sure it was clean.
The reboot of nfs-client occurred at Apr 9 18:12. I then saw soft lockup messages on the NFS server as described in RHEL7.2: NFS4 server repeated soft lockups due to laundromat_main kworker process stuck in __destroy_client

Environment

  • Red Hat Enterprise Linux 6.6 (NFS client)
    • kernel-2.6.32-504.8.1.el6.x86_64
  • NFS4.1
  • RHEL7.2 NFS server cluster
    • kernel-3.10.0-327.10.1.el7

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content