RHEL6.6: NFS4 client hangs with repeated 'Callback slot table overflowed' and constantly running rpciod and NFS4 state manager thread
Issue
Something slightly different happened the other day, but I'm not sure if it's related to this issue. I got a call from our app guy regarding a problem with I/O errors to one of the NFS file systems on the nfs-client system that's part of this case (client). I found the messages file with thousands of these entries:
Apr 9 17:05:30 nfs-client kernel: Callback slot table overflowed
Apr 9 17:05:30 nfs-client kernel: Callback slot table overflowed
...
The job I had that was writing to this NFS file system showed a 35 second gap (matches the grace time):
Sat Apr 9 17:04:57 CDT 2016
Sat Apr 9 17:04:58 CDT 2016
Sat Apr 9 17:04:59 CDT 2016
Sat Apr 9 17:05:34 CDT 2016
Sat Apr 9 17:05:35 CDT 2016
Sat Apr 9 17:05:36 CDT 2016
Output of "top"showed some unusually high activity with rpc processes, and a "mana" process:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1864 root 20 0 0 0 0 R 5.0 0.0 902:11.26 rpciod/0
1867 root 20 0 0 0 0 S 3.3 0.0 201:17.00 rpciod/3
65088 root 22 2 105m 2412 220 D 3.3 0.0 0:00.10 bash
1866 root 20 0 0 0 0 S 1.7 0.0 893:51.26 rpciod/2
47912 mzadmin 21 1 5707m 556m 11m S 1.7 3.5 48761:47 java
65103 root 20 0 17124 1228 868 R 1.7 0.0 0:00.09 top
65105 root 20 0 0 0 0 D 1.7 0.0 0:00.01 10.20.0.32-mana
I was also able to confirm the I/O errors, which weren't always consistent (see 2 writes to /share02, first one worked, second one did not):
[root@nfs-client ~]# touch /share02/hi
[root@nfs-client ~]# rm /share02/hi
rm: remove regular empty file `/share02/hi'? y
[root@nfs-client ~]# touch /share03/hi
touch: cannot touch `/share03/hi': Input/output error
[root@nfs-client ~]# touch /share01/hi
[root@nfs-client ~]# rm /share01/hi
rm: remove regular empty file `/share01/hi'? y
[root@nfs-client ~]# cd /share02
[root@nfs-client share02]# touch hithere
touch: cannot touch `hithere': Input/output error
I was able to unmount these NFS file systems normally, and the RPC and "mana" processes went away. I rebooted nfs-client just to make sure it was clean.
The reboot of nfs-client occurred at Apr 9 18:12. I then saw soft lockup messages on the NFS server as described in RHEL7.2: NFS4 server repeated soft lockups due to laundromat_main kworker process stuck in __destroy_client
Environment
- Red Hat Enterprise Linux 6.6 (NFS client)
- kernel-2.6.32-504.8.1.el6.x86_64
- NFS4.1
- RHEL7.2 NFS server cluster
- kernel-3.10.0-327.10.1.el7
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.