RHEL6.7: NFS client hangs and unable to sudo, single NFS4 WRITE hanging, possible NFS4 protocol loop with NetApp NFS server

Solution In Progress - Updated -

Environment

  • Red Hat Enterprise Linux 6
    • NFS Client
    • kernel-2.6.32-573.3.1.el6
  • NetApp Release 8.1.4P6 7-Mode
    • NFS Server
  • NFS4

Issue

  • sudo access hangs, we noticed lot of I/O and hands while performing sudo. We also saw lot of defunct process running on the server. This hang experienced only on one virtual server, and this is the first occurrence
  • One task has taken a page fault, holds a mm rw_semaphore as a result, and is waiting inside io_schedule, waiting on the completion of a write to an NFS4 share. This task has been waiting for many hours at the same place, so presumably the NFS4 WRITE is not completing.
crash> ps -m 15927
[ 1 07:40:36.122] [UN]  PID: 15927  TASK: ffff8805354daab0  CPU: 1   COMMAND: "foobar"
PID: 15927  TASK: ffff8805354daab0  CPU: 1   COMMAND: "foobar"
 #0 [ffff880531323998] schedule at ffffffff81538300
 #1 [ffff880531323a70] io_schedule at ffffffff81538b43
 #2 [ffff880531323a90] nfs_wait_bit_uninterruptible at ffffffffa02c6bde [nfs]
 #3 [ffff880531323aa0] __wait_on_bit at ffffffff8153960f
 #4 [ffff880531323af0] out_of_line_wait_on_bit at ffffffff815396b8
 #5 [ffff880531323b60] nfs_wait_on_request at ffffffffa02c6bbf [nfs]
 #6 [ffff880531323b70] nfs_updatepage at ffffffffa02cd6a2 [nfs]
 #7 [ffff880531323be0] nfs_vm_page_mkwrite at ffffffffa02bb73e [nfs]
 #8 [ffff880531323c10] do_wp_page at ffffffff81151bd0
 #9 [ffff880531323cb0] handle_pte_fault at ffffffff811526ad
#10 [ffff880531323d90] handle_mm_fault at ffffffff81153199
#11 [ffff880531323e00] __do_page_fault at ffffffff8104f156
#12 [ffff880531323f20] do_page_fault at ffffffff8153e95e
#13 [ffff880531323f50] page_fault at ffffffff8153bd05
  • Many other tasks in the system are blocked inside rwsem_down_failed_common, including many 'ps' tasks. Any pid forking a 'ps' task ends up blocking in the same place due to waiting on the semaphore.
PID: 31932  TASK: ffff88037ac53520  CPU: 1   COMMAND: "ps"
 #0 [ffff8803a6ed3b80] schedule at ffffffff81538300
 #1 [ffff8803a6ed3c58] rwsem_down_failed_common at ffffffff8153b475
 #2 [ffff8803a6ed3cb8] rwsem_down_read_failed at ffffffff8153b606
 #3 [ffff8803a6ed3cf8] call_rwsem_down_read_failed at ffffffff8129d4b4
 #4 [ffff8803a6ed3d60] __access_remote_vm at ffffffff811537e1
 #5 [ffff8803a6ed3e00] access_process_vm at ffffffff811539eb
 #6 [ffff8803a6ed3e50] proc_pid_cmdline at ffffffff811fee6d
 #7 [ffff8803a6ed3ea0] proc_info_read at ffffffff811fff4d
 #8 [ffff8803a6ed3ef0] vfs_read at ffffffff81192215
 #9 [ffff8803a6ed3f30] sys_read at ffffffff81192561
#10 [ffff8803a6ed3f80] system_call_fastpath at ffffffff8100b0d2

Resolution

  • There is currently no resolution to this problem as there was insufficient data to fully root cause.
  • Next steps include gathering a tcpdump leading up to the start of the problem. If not possible, gather a tcpdump after the problem has occurred.

Root Cause

  • In short, the sudo processes got hung due to forking a series of processes, which eventually led to a 'ps' process, which got stuck for the same reason every other process in the system was hung - waiting on a single semaphore. The semaphore was held by one process which took a page fault, which required an NFSv4 write to be sent. This NFS write appears to be stuck, and looks to be most likely being sent and replied to with a retryable NFS4 error code (such as NFS4ERR_BAD_STATEID), most likely due to a NFS4 protocol loop. Unfortunately we cannot tell which error code is coming back from the NFS server, and the ultimate problem may be either an NFS client or NFS server bug, we cannot tell. The TCP connection looks fine, with the state being TCP_CONNECTED, and various time stamps and statistics indicating the NFS server is responding. Indirectly other processes either issue a sync or end up waiting on this same semaphore, and so cannot complete and the machine eventually appears hung.
  • It seems odd to see so many 'ps' processes in the system, so there could be some configuration or package which forks a ps subprocess for each userspace process. The source of the 'ps' processes is not fully known. However, many of them appear to be forked from a process that is part of the BMC Patrol Agent system monitoring suite.
  • The root cause of the NFS4 WRITE being stuck is believed to be a NFS4 protocol loop, as described in https://access.redhat.com/articles/770453. Unfortunately the specific root cause was not identified. This could be either an NFS client or NFS server bug.

Diagnostic Steps

  • Configure kdump and gather a vmcore, then upload to Red Hat for analysis.

Detailed vmcore analysis

  • In the vmcore there is an oldest blocked task had taken a page fault and was waiting on NFS
crash> bt
PID: 15927  TASK: ffff8805354daab0  CPU: 1   COMMAND: "foobar"
 #0 [ffff880531323998] schedule at ffffffff81538300
 #1 [ffff880531323a70] io_schedule at ffffffff81538b43
 #2 [ffff880531323a90] nfs_wait_bit_uninterruptible at ffffffffa02c6bde [nfs]
 #3 [ffff880531323aa0] __wait_on_bit at ffffffff8153960f
 #4 [ffff880531323af0] out_of_line_wait_on_bit at ffffffff815396b8
 #5 [ffff880531323b60] nfs_wait_on_request at ffffffffa02c6bbf [nfs]
 #6 [ffff880531323b70] nfs_updatepage at ffffffffa02cd6a2 [nfs]
 #7 [ffff880531323be0] nfs_vm_page_mkwrite at ffffffffa02bb73e [nfs]
 #8 [ffff880531323c10] do_wp_page at ffffffff81151bd0
 #9 [ffff880531323cb0] handle_pte_fault at ffffffff811526ad
#10 [ffff880531323d90] handle_mm_fault at ffffffff81153199
#11 [ffff880531323e00] __do_page_fault at ffffffff8104f156
#12 [ffff880531323f20] do_page_fault at ffffffff8153e95e
#13 [ffff880531323f50] page_fault at ffffffff8153bd05
    RIP: 00007f8dc1322d90  RSP: 0000000086a816d0  RFLAGS: 00210202
    RAX: 0000000000000000  RBX: 0000000000000052  RCX: 00007f8dc2e063ed
    RDX: 000000007eaa08a8  RSI: 0000000000000004  RDI: 0000000000000000
    RBP: 0000000086673600   R8: 0000000000000004   R9: 000000007eaa08a8
    R10: 00007f8e04050ad2  R11: 00007f8e04050aa9  R12: 00007f8e04050ab9
    R13: 00007f8e4b498ce0  R14: 00007f8e4e8ba800  R15: 0000000000000007
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
  • During the page fault, the mmap_sem semaphore is taken, and held by this process until the NFS WRITE completes.
  • If we pull this semaphore off of the stack, we see there are many waiters on it.

  • The report mentioned sudo hanging, and there are sudo processes in the vmcore. And note these processes have state 'IN'.

crash> foreach sudo ps -m
[ 0 03:27:53.489] [IN]  PID: 7772   TASK: ffff8802fa43aab0  CPU: 1   COMMAND: "sudo"
[ 0 04:51:39.633] [IN]  PID: 27386  TASK: ffff880280775520  CPU: 1   COMMAND: "sudo"
[ 0 05:45:38.337] [IN]  PID: 29449  TASK: ffff880386d45520  CPU: 2   COMMAND: "sudo"
crash> foreach sudo bt
PID: 7772   TASK: ffff8802fa43aab0  CPU: 1   COMMAND: "sudo"
 #0 [ffff8803520677f8] schedule at ffffffff81538300
 #1 [ffff8803520678d0] schedule_hrtimeout_range at ffffffff8153a90d
 #2 [ffff880352067970] poll_schedule_timeout at ffffffff811a9519
 #3 [ffff880352067990] do_select at ffffffff811aa7ab
 #4 [ffff880352067d70] core_sys_select at ffffffff811aaa3a
 #5 [ffff880352067f10] sys_select at ffffffff811aadc7
 #6 [ffff880352067f80] system_call_fastpath at ffffffff8100b0d2
    RIP: 00007fc304366403  RSP: 00007ffd582edd78  RFLAGS: 00010202
    RAX: 0000000000000017  RBX: ffffffff8100b0d2  RCX: 000000000000009e
    RDX: 00007fc305cadbb0  RSI: 00007fc305ce1550  RDI: 000000000000000e
    RBP: 00007ffd582ee2e0   R8: 0000000000000000   R9: 0101010101010101
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 00007fc305925140  R14: 00007ffd582ee180  R15: 00007fc305cadbb0
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

PID: 27386  TASK: ffff880280775520  CPU: 1   COMMAND: "sudo"
 #0 [ffff88026488f7f8] schedule at ffffffff81538300
 #1 [ffff88026488f8d0] schedule_hrtimeout_range at ffffffff8153a90d
 #2 [ffff88026488f970] poll_schedule_timeout at ffffffff811a9519
 #3 [ffff88026488f990] do_select at ffffffff811aa7ab
 #4 [ffff88026488fd70] core_sys_select at ffffffff811aaa3a
 #5 [ffff88026488ff10] sys_select at ffffffff811aadc7
 #6 [ffff88026488ff80] system_call_fastpath at ffffffff8100b0d2
    RIP: 00007f1ffb8cc403  RSP: 00007fffd32b0518  RFLAGS: 00010206
    RAX: 0000000000000017  RBX: ffffffff8100b0d2  RCX: 000000000000009e
    RDX: 00007f1ffe5d6aa0  RSI: 00007f1ffe5d4470  RDI: 000000000000000e
    RBP: 00007fffd32b0a80   R8: 0000000000000000   R9: 0101010101010101
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 00007f1ffe218140  R14: 00007fffd32b0920  R15: 00007f1ffe5d6aa0
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

PID: 29449  TASK: ffff880386d45520  CPU: 2   COMMAND: "sudo"
 #0 [ffff88037ad4f7f8] schedule at ffffffff81538300
 #1 [ffff88037ad4f8d0] schedule_hrtimeout_range at ffffffff8153a90d
 #2 [ffff88037ad4f970] poll_schedule_timeout at ffffffff811a9519
 #3 [ffff88037ad4f990] do_select at ffffffff811aa7ab
 #4 [ffff88037ad4fd70] core_sys_select at ffffffff811aaa3a
 #5 [ffff88037ad4ff10] sys_select at ffffffff811aadc7
 #6 [ffff88037ad4ff80] system_call_fastpath at ffffffff8100b0d2
    RIP: 00007f4d39b12403  RSP: 00007ffca20cc8f8  RFLAGS: 00010202
    RAX: 0000000000000017  RBX: ffffffff8100b0d2  RCX: 000000000000009e
    RDX: 00007f4d3b0aa450  RSI: 00007f4d3b0b0800  RDI: 000000000000000e
    RBP: 00007ffca20cce60   R8: 0000000000000000   R9: 0101010101010101
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 00007f4d3acee140  R14: 00007ffca20ccd00  R15: 00007f4d3b0aa450
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

If we pick one of the sudo processes, and take a look at the timestamp, and assume the sudo process forks other processes, we quickly figure out the likely tie without getting mired in a lot of vmcore details, but just looking at the last_ran times which are very close, but after the sudo blocking time. In the end, looking at the process hierarchy of the last process in the chain, pid 31932, the 'sudo' get stuck due to forking a 'ps' process, which gets stuck due to the semaphore held by the process which took the page fault, and got stuck on that one NFS write.

crash> ps -m | grep "0 05:45:"
...
[ 0 05:45:34.831] [IN]  PID: 1610   TASK: ffff8803a6df4ab0  CPU: 1   COMMAND: "crond"
[ 0 05:45:38.437] [UN]  PID: 31932  TASK: ffff88037ac53520  CPU: 1   COMMAND: "ps"
[ 0 05:45:38.513] [IN]  PID: 31937  TASK: ffff880386f52040  CPU: 1   COMMAND: "grep"
[ 0 05:45:38.504] [IN]  PID: 31938  TASK: ffff880376f58ab0  CPU: 0   COMMAND: "awk"
[ 0 05:45:38.514] [IN]  PID: 31934  TASK: ffff880119bd5520  CPU: 1   COMMAND: "grep"
[ 0 05:45:38.504] [IN]  PID: 31936  TASK: ffff880386f53520  CPU: 0   COMMAND: "xargs"
[ 0 05:45:38.514] [IN]  PID: 31935  TASK: ffff88037adb2040  CPU: 1   COMMAND: "awk"
[ 0 05:45:38.337] [IN]  PID: 29449  TASK: ffff880386d45520  CPU: 2   COMMAND: "sudo"

crash> ps -p 31932
PID: 0      TASK: ffffffff81a95020  CPU: 0   COMMAND: "swapper"
 PID: 1      TASK: ffff880537a4b520  CPU: 2   COMMAND: "init"
  PID: 6740   TASK: ffff88052f036ab0  CPU: 3   COMMAND: "sshd"
   PID: 16434  TASK: ffff8804be48e040  CPU: 1   COMMAND: "sshd"
    PID: 22330  TASK: ffff8803a6cd2040  CPU: 2   COMMAND: "ksh"
     PID: 29449  TASK: ffff880386d45520  CPU: 2   COMMAND: "sudo"
      PID: 31914  TASK: ffff880386dbcab0  CPU: 3   COMMAND: "bash"
       PID: 31931  TASK: ffff8804c28caab0  CPU: 0   COMMAND: "bash"
        PID: 31932  TASK: ffff88037ac53520  CPU: 1   COMMAND: "ps"

crash> bt 31932
PID: 31932  TASK: ffff88037ac53520  CPU: 1   COMMAND: "ps"
 #0 [ffff8803a6ed3b80] schedule at ffffffff81538300
 #1 [ffff8803a6ed3c58] rwsem_down_failed_common at ffffffff8153b475
 #2 [ffff8803a6ed3cb8] rwsem_down_read_failed at ffffffff8153b606
 #3 [ffff8803a6ed3cf8] call_rwsem_down_read_failed at ffffffff8129d4b4
 #4 [ffff8803a6ed3d60] __access_remote_vm at ffffffff811537e1
 #5 [ffff8803a6ed3e00] access_process_vm at ffffffff811539eb
 #6 [ffff8803a6ed3e50] proc_pid_cmdline at ffffffff811fee6d
 #7 [ffff8803a6ed3ea0] proc_info_read at ffffffff811fff4d
 #8 [ffff8803a6ed3ef0] vfs_read at ffffffff81192215
 #9 [ffff8803a6ed3f30] sys_read at ffffffff81192561
#10 [ffff8803a6ed3f80] system_call_fastpath at ffffffff8100b0d2
    RIP: 0000003eda4db570  RSP: 00007ffe940f54b0  RFLAGS: 00000202
  • If we pull the semaphore off of the stack, we see it is the same semaphore which is held by the process which has taken the page fault, and is waiting on NFS to complete. Thus sudo is hung due to the NFS WRITE not completing, and this is believed to be due to an NFS4 protocol loop.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.