RHEL6.7: NFS client hangs and unable to sudo, single NFS4 WRITE hanging, possible NFS4 protocol loop with NetApp NFS server
Environment
- Red Hat Enterprise Linux 6
- NFS Client
- kernel-2.6.32-573.3.1.el6
- NetApp Release 8.1.4P6 7-Mode
- NFS Server
- NFS4
Issue
- sudo access hangs, we noticed lot of I/O and hands while performing sudo. We also saw lot of defunct process running on the server. This hang experienced only on one virtual server, and this is the first occurrence
- One task has taken a page fault, holds a mm rw_semaphore as a result, and is waiting inside
io_schedule, waiting on the completion of a write to an NFS4 share. This task has been waiting for many hours at the same place, so presumably the NFS4 WRITE is not completing.
crash> ps -m 15927
[ 1 07:40:36.122] [UN] PID: 15927 TASK: ffff8805354daab0 CPU: 1 COMMAND: "foobar"
PID: 15927 TASK: ffff8805354daab0 CPU: 1 COMMAND: "foobar"
#0 [ffff880531323998] schedule at ffffffff81538300
#1 [ffff880531323a70] io_schedule at ffffffff81538b43
#2 [ffff880531323a90] nfs_wait_bit_uninterruptible at ffffffffa02c6bde [nfs]
#3 [ffff880531323aa0] __wait_on_bit at ffffffff8153960f
#4 [ffff880531323af0] out_of_line_wait_on_bit at ffffffff815396b8
#5 [ffff880531323b60] nfs_wait_on_request at ffffffffa02c6bbf [nfs]
#6 [ffff880531323b70] nfs_updatepage at ffffffffa02cd6a2 [nfs]
#7 [ffff880531323be0] nfs_vm_page_mkwrite at ffffffffa02bb73e [nfs]
#8 [ffff880531323c10] do_wp_page at ffffffff81151bd0
#9 [ffff880531323cb0] handle_pte_fault at ffffffff811526ad
#10 [ffff880531323d90] handle_mm_fault at ffffffff81153199
#11 [ffff880531323e00] __do_page_fault at ffffffff8104f156
#12 [ffff880531323f20] do_page_fault at ffffffff8153e95e
#13 [ffff880531323f50] page_fault at ffffffff8153bd05
- Many other tasks in the system are blocked inside
rwsem_down_failed_common, including many 'ps' tasks. Any pid forking a 'ps' task ends up blocking in the same place due to waiting on the semaphore.
PID: 31932 TASK: ffff88037ac53520 CPU: 1 COMMAND: "ps"
#0 [ffff8803a6ed3b80] schedule at ffffffff81538300
#1 [ffff8803a6ed3c58] rwsem_down_failed_common at ffffffff8153b475
#2 [ffff8803a6ed3cb8] rwsem_down_read_failed at ffffffff8153b606
#3 [ffff8803a6ed3cf8] call_rwsem_down_read_failed at ffffffff8129d4b4
#4 [ffff8803a6ed3d60] __access_remote_vm at ffffffff811537e1
#5 [ffff8803a6ed3e00] access_process_vm at ffffffff811539eb
#6 [ffff8803a6ed3e50] proc_pid_cmdline at ffffffff811fee6d
#7 [ffff8803a6ed3ea0] proc_info_read at ffffffff811fff4d
#8 [ffff8803a6ed3ef0] vfs_read at ffffffff81192215
#9 [ffff8803a6ed3f30] sys_read at ffffffff81192561
#10 [ffff8803a6ed3f80] system_call_fastpath at ffffffff8100b0d2
Resolution
- There is currently no resolution to this problem as there was insufficient data to fully root cause.
- Next steps include gathering a tcpdump leading up to the start of the problem. If not possible, gather a tcpdump after the problem has occurred.
- On the NFS client, use https://access.redhat.com/articles/770453
- On the NFS server (NetApp), please contact NetApp support, though you can use the following document as a starting point, https://access.redhat.com/solutions/425893
Root Cause
- In short, the sudo processes got hung due to forking a series of processes, which eventually led to a 'ps' process, which got stuck for the same reason every other process in the system was hung - waiting on a single semaphore. The semaphore was held by one process which took a page fault, which required an NFSv4 write to be sent. This NFS write appears to be stuck, and looks to be most likely being sent and replied to with a retryable NFS4 error code (such as NFS4ERR_BAD_STATEID), most likely due to a NFS4 protocol loop. Unfortunately we cannot tell which error code is coming back from the NFS server, and the ultimate problem may be either an NFS client or NFS server bug, we cannot tell. The TCP connection looks fine, with the state being TCP_CONNECTED, and various time stamps and statistics indicating the NFS server is responding. Indirectly other processes either issue a sync or end up waiting on this same semaphore, and so cannot complete and the machine eventually appears hung.
- It seems odd to see so many 'ps' processes in the system, so there could be some configuration or package which forks a ps subprocess for each userspace process. The source of the 'ps' processes is not fully known. However, many of them appear to be forked from a process that is part of the BMC Patrol Agent system monitoring suite.
- The root cause of the NFS4 WRITE being stuck is believed to be a NFS4 protocol loop, as described in https://access.redhat.com/articles/770453. Unfortunately the specific root cause was not identified. This could be either an NFS client or NFS server bug.
- The NFS server release, NetApp Release 8.1.4P6 7-Mode, rules out at least one NetApp bug which has been seen to produce NFS client hangs with WRITEs stuck and NFS4ERR_BAD_STATEID error code:
http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=814268 and https://access.redhat.com/solutions/752643
- The NFS server release, NetApp Release 8.1.4P6 7-Mode, rules out at least one NetApp bug which has been seen to produce NFS client hangs with WRITEs stuck and NFS4ERR_BAD_STATEID error code:
Diagnostic Steps
- Configure kdump and gather a vmcore, then upload to Red Hat for analysis.
Detailed vmcore analysis
- In the vmcore there is an oldest blocked task had taken a page fault and was waiting on NFS
crash> bt
PID: 15927 TASK: ffff8805354daab0 CPU: 1 COMMAND: "foobar"
#0 [ffff880531323998] schedule at ffffffff81538300
#1 [ffff880531323a70] io_schedule at ffffffff81538b43
#2 [ffff880531323a90] nfs_wait_bit_uninterruptible at ffffffffa02c6bde [nfs]
#3 [ffff880531323aa0] __wait_on_bit at ffffffff8153960f
#4 [ffff880531323af0] out_of_line_wait_on_bit at ffffffff815396b8
#5 [ffff880531323b60] nfs_wait_on_request at ffffffffa02c6bbf [nfs]
#6 [ffff880531323b70] nfs_updatepage at ffffffffa02cd6a2 [nfs]
#7 [ffff880531323be0] nfs_vm_page_mkwrite at ffffffffa02bb73e [nfs]
#8 [ffff880531323c10] do_wp_page at ffffffff81151bd0
#9 [ffff880531323cb0] handle_pte_fault at ffffffff811526ad
#10 [ffff880531323d90] handle_mm_fault at ffffffff81153199
#11 [ffff880531323e00] __do_page_fault at ffffffff8104f156
#12 [ffff880531323f20] do_page_fault at ffffffff8153e95e
#13 [ffff880531323f50] page_fault at ffffffff8153bd05
RIP: 00007f8dc1322d90 RSP: 0000000086a816d0 RFLAGS: 00210202
RAX: 0000000000000000 RBX: 0000000000000052 RCX: 00007f8dc2e063ed
RDX: 000000007eaa08a8 RSI: 0000000000000004 RDI: 0000000000000000
RBP: 0000000086673600 R8: 0000000000000004 R9: 000000007eaa08a8
R10: 00007f8e04050ad2 R11: 00007f8e04050aa9 R12: 00007f8e04050ab9
R13: 00007f8e4b498ce0 R14: 00007f8e4e8ba800 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
- During the page fault, the mmap_sem semaphore is taken, and held by this process until the NFS WRITE completes.
-
If we pull this semaphore off of the stack, we see there are many waiters on it.
-
The report mentioned sudo hanging, and there are sudo processes in the vmcore. And note these processes have state 'IN'.
crash> foreach sudo ps -m
[ 0 03:27:53.489] [IN] PID: 7772 TASK: ffff8802fa43aab0 CPU: 1 COMMAND: "sudo"
[ 0 04:51:39.633] [IN] PID: 27386 TASK: ffff880280775520 CPU: 1 COMMAND: "sudo"
[ 0 05:45:38.337] [IN] PID: 29449 TASK: ffff880386d45520 CPU: 2 COMMAND: "sudo"
crash> foreach sudo bt
PID: 7772 TASK: ffff8802fa43aab0 CPU: 1 COMMAND: "sudo"
#0 [ffff8803520677f8] schedule at ffffffff81538300
#1 [ffff8803520678d0] schedule_hrtimeout_range at ffffffff8153a90d
#2 [ffff880352067970] poll_schedule_timeout at ffffffff811a9519
#3 [ffff880352067990] do_select at ffffffff811aa7ab
#4 [ffff880352067d70] core_sys_select at ffffffff811aaa3a
#5 [ffff880352067f10] sys_select at ffffffff811aadc7
#6 [ffff880352067f80] system_call_fastpath at ffffffff8100b0d2
RIP: 00007fc304366403 RSP: 00007ffd582edd78 RFLAGS: 00010202
RAX: 0000000000000017 RBX: ffffffff8100b0d2 RCX: 000000000000009e
RDX: 00007fc305cadbb0 RSI: 00007fc305ce1550 RDI: 000000000000000e
RBP: 00007ffd582ee2e0 R8: 0000000000000000 R9: 0101010101010101
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fc305925140 R14: 00007ffd582ee180 R15: 00007fc305cadbb0
ORIG_RAX: 0000000000000017 CS: 0033 SS: 002b
PID: 27386 TASK: ffff880280775520 CPU: 1 COMMAND: "sudo"
#0 [ffff88026488f7f8] schedule at ffffffff81538300
#1 [ffff88026488f8d0] schedule_hrtimeout_range at ffffffff8153a90d
#2 [ffff88026488f970] poll_schedule_timeout at ffffffff811a9519
#3 [ffff88026488f990] do_select at ffffffff811aa7ab
#4 [ffff88026488fd70] core_sys_select at ffffffff811aaa3a
#5 [ffff88026488ff10] sys_select at ffffffff811aadc7
#6 [ffff88026488ff80] system_call_fastpath at ffffffff8100b0d2
RIP: 00007f1ffb8cc403 RSP: 00007fffd32b0518 RFLAGS: 00010206
RAX: 0000000000000017 RBX: ffffffff8100b0d2 RCX: 000000000000009e
RDX: 00007f1ffe5d6aa0 RSI: 00007f1ffe5d4470 RDI: 000000000000000e
RBP: 00007fffd32b0a80 R8: 0000000000000000 R9: 0101010101010101
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f1ffe218140 R14: 00007fffd32b0920 R15: 00007f1ffe5d6aa0
ORIG_RAX: 0000000000000017 CS: 0033 SS: 002b
PID: 29449 TASK: ffff880386d45520 CPU: 2 COMMAND: "sudo"
#0 [ffff88037ad4f7f8] schedule at ffffffff81538300
#1 [ffff88037ad4f8d0] schedule_hrtimeout_range at ffffffff8153a90d
#2 [ffff88037ad4f970] poll_schedule_timeout at ffffffff811a9519
#3 [ffff88037ad4f990] do_select at ffffffff811aa7ab
#4 [ffff88037ad4fd70] core_sys_select at ffffffff811aaa3a
#5 [ffff88037ad4ff10] sys_select at ffffffff811aadc7
#6 [ffff88037ad4ff80] system_call_fastpath at ffffffff8100b0d2
RIP: 00007f4d39b12403 RSP: 00007ffca20cc8f8 RFLAGS: 00010202
RAX: 0000000000000017 RBX: ffffffff8100b0d2 RCX: 000000000000009e
RDX: 00007f4d3b0aa450 RSI: 00007f4d3b0b0800 RDI: 000000000000000e
RBP: 00007ffca20cce60 R8: 0000000000000000 R9: 0101010101010101
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f4d3acee140 R14: 00007ffca20ccd00 R15: 00007f4d3b0aa450
ORIG_RAX: 0000000000000017 CS: 0033 SS: 002b
If we pick one of the sudo processes, and take a look at the timestamp, and assume the sudo process forks other processes, we quickly figure out the likely tie without getting mired in a lot of vmcore details, but just looking at the last_ran times which are very close, but after the sudo blocking time. In the end, looking at the process hierarchy of the last process in the chain, pid 31932, the 'sudo' get stuck due to forking a 'ps' process, which gets stuck due to the semaphore held by the process which took the page fault, and got stuck on that one NFS write.
crash> ps -m | grep "0 05:45:"
...
[ 0 05:45:34.831] [IN] PID: 1610 TASK: ffff8803a6df4ab0 CPU: 1 COMMAND: "crond"
[ 0 05:45:38.437] [UN] PID: 31932 TASK: ffff88037ac53520 CPU: 1 COMMAND: "ps"
[ 0 05:45:38.513] [IN] PID: 31937 TASK: ffff880386f52040 CPU: 1 COMMAND: "grep"
[ 0 05:45:38.504] [IN] PID: 31938 TASK: ffff880376f58ab0 CPU: 0 COMMAND: "awk"
[ 0 05:45:38.514] [IN] PID: 31934 TASK: ffff880119bd5520 CPU: 1 COMMAND: "grep"
[ 0 05:45:38.504] [IN] PID: 31936 TASK: ffff880386f53520 CPU: 0 COMMAND: "xargs"
[ 0 05:45:38.514] [IN] PID: 31935 TASK: ffff88037adb2040 CPU: 1 COMMAND: "awk"
[ 0 05:45:38.337] [IN] PID: 29449 TASK: ffff880386d45520 CPU: 2 COMMAND: "sudo"
crash> ps -p 31932
PID: 0 TASK: ffffffff81a95020 CPU: 0 COMMAND: "swapper"
PID: 1 TASK: ffff880537a4b520 CPU: 2 COMMAND: "init"
PID: 6740 TASK: ffff88052f036ab0 CPU: 3 COMMAND: "sshd"
PID: 16434 TASK: ffff8804be48e040 CPU: 1 COMMAND: "sshd"
PID: 22330 TASK: ffff8803a6cd2040 CPU: 2 COMMAND: "ksh"
PID: 29449 TASK: ffff880386d45520 CPU: 2 COMMAND: "sudo"
PID: 31914 TASK: ffff880386dbcab0 CPU: 3 COMMAND: "bash"
PID: 31931 TASK: ffff8804c28caab0 CPU: 0 COMMAND: "bash"
PID: 31932 TASK: ffff88037ac53520 CPU: 1 COMMAND: "ps"
crash> bt 31932
PID: 31932 TASK: ffff88037ac53520 CPU: 1 COMMAND: "ps"
#0 [ffff8803a6ed3b80] schedule at ffffffff81538300
#1 [ffff8803a6ed3c58] rwsem_down_failed_common at ffffffff8153b475
#2 [ffff8803a6ed3cb8] rwsem_down_read_failed at ffffffff8153b606
#3 [ffff8803a6ed3cf8] call_rwsem_down_read_failed at ffffffff8129d4b4
#4 [ffff8803a6ed3d60] __access_remote_vm at ffffffff811537e1
#5 [ffff8803a6ed3e00] access_process_vm at ffffffff811539eb
#6 [ffff8803a6ed3e50] proc_pid_cmdline at ffffffff811fee6d
#7 [ffff8803a6ed3ea0] proc_info_read at ffffffff811fff4d
#8 [ffff8803a6ed3ef0] vfs_read at ffffffff81192215
#9 [ffff8803a6ed3f30] sys_read at ffffffff81192561
#10 [ffff8803a6ed3f80] system_call_fastpath at ffffffff8100b0d2
RIP: 0000003eda4db570 RSP: 00007ffe940f54b0 RFLAGS: 00000202
- If we pull the semaphore off of the stack, we see it is the same semaphore which is held by the process which has taken the page fault, and is waiting on NFS to complete. Thus sudo is hung due to the NFS WRITE not completing, and this is believed to be due to an NFS4 protocol loop.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
