NFS4.0: High Load but CPU idle - leading to a complete outage of the system opening files on an NFS4 mount

Solution Verified - Updated -

Red Hat Insights can detect this issue

Proactively detect and remediate issues impacting your systems.
View matching systems and remediation

Environment

  • Red Hat Enterprise Linux 6 (NFS client)
  • Red Hat Enterprise Linux 7 (NFS client)
  • Any NFS server
    • seen with NetApp
  • NFS4.0
  • Any application where the same userid opens / closes a lot of files on NFS4 in a short time
    • seen with httpd (apache) web server
    • seen with java processes opening and closing files serially

Issue

  • The load is growing very fast from about 5 to greater than 4000 (sometimes up to 4500). On the system there are several webserver processes running, one of them is serving our online service. It seems that this specific webserver is causing the problem because the system keeps being responsive, and also the webservers of other applications seem to have no troubles. But the online service website is getting very slow up to complete outage.
  • The problem happened two times within the last 7 days, both times leading in a complete outage of our online service. The problem was there for about 3 hours both times, and then it seems to be gone and we don't really know why. If the problem will show up again it's extremely urgent!.
  • Application with one userid open / closes a lot of files on NFS4 in a short period of time.
  • Many processes blocked waiting for NFS4 / RPC tasks to complete, RPC tasks blocked in Seqid_waitqueue

Resolution

Red Hat Enterprise Linux 6 (NFS client)

  • There is currently no resolution to this problem. The bug filed for RHEL6 is closed as WONT_FIX.

Red Hat Enterprise Linux 7 (NFS client)

  • This issue has been resolved with the errata RHSA-2016-2574 for the package(s) `kernel-3.10.0-514.el7 or later.
    • Originally tracked in private Red Hat Bug 1287789 - NFSv4.1 parallel OPEN feature.
    • Remount the NFS share as a NFSv4.1 share to take advantage of the parallel OPEN Calls.

NFS server or network

  • Address any underlying problem leading to increased latency for NFS4 OPEN or CLOSE operations.

Workarounds

  • Use NFSv3 (verified)

Root Cause

  • There is a limitation to the Linux NFS4.0 client implementation that an "open_owner" is mapped to a userid. This results in a bottleneck if one user opens and closes a lot of files in a short period of time. Each OPEN / CLOSE operation has to wait for a sequence id, which essentially serializes each OPEN / CLOSE request. If an NFS server's response time for OPEN / CLOSE requests increases due to some secondary load or complication, this NFS4 client limitation can become pronounced, and in some cases, cause an unresponsive machine.
  • The NFS4.1 protocol addresses the limitation of serialization of OPENs per open_owner. For more information, see RFC 5661 Section 9.10

Diagnostic Steps

General steps

  • Gather a tcpdump on the NFS4 client at the time of the problem.
  • Configure kdump and gather a vmcore at the time of the problem.
  • Use the attached 'monitor-seqid-waitqueue.sh' script, or something similar, to determine whether a lot of RPC tasks are stuck waiting on the Seqid_waitqueue.
    • WARNING: Running the attached script has been seen to increase latency of RPC tasks. Thus, it is only recommended for a short time as a diagnostic, and should not be run constantly in a production environment.

Analysis of tcpdump

  • A tcpdump should show a high number of NFS4 OPEN and CLOSE requests. Use either tshark / wireshark to count the NFS4 operations, or the attached pcap_nfsstat.sh script.
NFSv4 Call    Count     Min       Avg       Max       Total
ACCESS              98  0.000150  0.005951  0.151759  0.583243
CLOSE             2712  0.000464  0.004367  0.182066  11.844358
GETATTR             45  0.000146  0.003879  0.051450  0.174542
LOOKUP            6492  0.000343  0.005689  0.201797  36.930133
OPEN              5070  0.000640  0.009056  0.202305  45.914946
READ                 1  0.000581  0.000581  0.000581  0.000581
RENEW                2  0.000146  0.000348  0.000551  0.000697
summary          14424  0.000000  0.006617  0.202305  95.448500

Analysis of vmcore

  • A vmcore will show a high load average, with most of the load average the result of tasks in 'UN' state:
crash> sys | grep LOAD
LOAD AVERAGE: 4462.05, 3818.55, 2068.32
crash> ps -S
  RU: 9
  IN: 723
  UN: 4454
  • If you look at the blocked tasks in 'UN' state, most / all of them are waiting for an rpc completion.
crash> foreach UN bt | awk '/#1 / { print $3,$5 }' | sort | uniq -c | sort -nr
   4454 rpc_wait_bit_killable ffffffffa01a7422
  • Most of the blocked processes are in either an open or close system call. For example:
crash> foreach UN bt
PID: 13292  TASK: ffff8803c6837520  CPU: 3   COMMAND: "httpd"
 #0 [ffff8803de797b48] schedule at ffffffff815299d0
 #1 [ffff8803de797c20] rpc_wait_bit_killable at ffffffffa01a7422 [sunrpc]
 #2 [ffff8803de797c30] __wait_on_bit at ffffffff8152acdf
 #3 [ffff8803de797c80] out_of_line_wait_on_bit at ffffffff8152ad88
 #4 [ffff8803de797cf0] __rpc_wait_for_completion_task at ffffffffa01a73dd [sunrpc]
 #5 [ffff8803de797d00] nfs4_do_close at ffffffffa026469d [nfs]
 #6 [ffff8803de797db0] __nfs4_close at ffffffffa02719c5 [nfs]
 #7 [ffff8803de797e00] nfs4_close_sync at ffffffffa0271a98 [nfs]
 #8 [ffff8803de797e10] nfs4_close_context at ffffffffa025eaf0 [nfs]
 #9 [ffff8803de797e20] __put_nfs_open_context at ffffffffa0242ad3 [nfs]
#10 [ffff8803de797e70] nfs_release at ffffffffa0242bb3 [nfs]
#11 [ffff8803de797ea0] nfs_file_release at ffffffffa0240fa1 [nfs]
#12 [ffff8803de797ec0] __fput at ffffffff8118fa45
#13 [ffff8803de797f10] fput at ffffffff8118fb85
#14 [ffff8803de797f20] filp_close at ffffffff8118addd
#15 [ffff8803de797f50] sys_close at ffffffff8118aeb5
#16 [ffff8803de797f80] system_call_fastpath at ffffffff8100b0f2
...
PID: 6181   TASK: ffff880437037520  CPU: 3   COMMAND: "httpd"
 #0 [ffff8804371d97d8] schedule at ffffffff815299d0
 #1 [ffff8804371d98b0] rpc_wait_bit_killable at ffffffffa01a7422 [sunrpc]
 #2 [ffff8804371d98c0] __wait_on_bit at ffffffff8152acdf
 #3 [ffff8804371d9910] out_of_line_wait_on_bit at ffffffff8152ad88
 #4 [ffff8804371d9980] __rpc_wait_for_completion_task at ffffffffa01a73dd [sunrpc]
 #5 [ffff8804371d9990] nfs4_run_open_task at ffffffffa025c6bf [nfs]
 #6 [ffff8804371d9a20] nfs4_do_open at ffffffffa025e297 [nfs]
 #7 [ffff8804371d9b30] nfs4_atomic_open at ffffffffa025e7fe [nfs]
 #8 [ffff8804371d9b40] nfs_open_revalidate at ffffffffa023fc57 [nfs]
 #9 [ffff8804371d9bf0] do_lookup at ffffffff8119dd36
#10 [ffff8804371d9c50] __link_path_walk at ffffffff8119eb04
#11 [ffff8804371d9d30] path_walk at ffffffff8119f61a
#12 [ffff8804371d9d70] filename_lookup at ffffffff8119f82b
#13 [ffff8804371d9db0] do_filp_open at ffffffff811a0d04
#14 [ffff8804371d9f20] do_sys_open at ffffffff8118af77
#15 [ffff8804371d9f70] sys_open at ffffffff8118b080
#16 [ffff8804371d9f80] system_call_fastpath at ffffffff8100b0f2
  • If you look at the outstanding NFS / RPC requests in the system, most all of them have the following fields, which means these RPC tasks have not been sent to the NFS server, but are waiting in the Seqid_waitqueue, which is the serialization mechanism for NFS4 OPEN / CLOSE requests:
    • tk_waitqueue.name == 'Seqid_waitqueue',
    • tk_msg.rpc_proc->p_name == CLOSE or OPEN
    • tk_action == rpc_prepare_task
  • Example
  • First, select a task blocked in rpc_wait_bit_killable, and doing either an NFS4 OPEN or CLOSE
PID: 6181   TASK: ffff880437037520  CPU: 3   COMMAND: "httpd"
 #0 [ffff8804371d97d8] schedule at ffffffff815299d0
 #1 [ffff8804371d98b0] rpc_wait_bit_killable at ffffffffa01a7422 [sunrpc]
 #2 [ffff8804371d98c0] __wait_on_bit at ffffffff8152acdf
 #3 [ffff8804371d9910] out_of_line_wait_on_bit at ffffffff8152ad88
 #4 [ffff8804371d9980] __rpc_wait_for_completion_task at ffffffffa01a73dd [sunrpc]
 #5 [ffff8804371d9990] nfs4_run_open_task at ffffffffa025c6bf [nfs]
 #6 [ffff8804371d9a20] nfs4_do_open at ffffffffa025e297 [nfs]
 #7 [ffff8804371d9b30] nfs4_atomic_open at ffffffffa025e7fe [nfs]
 #8 [ffff8804371d9b40] nfs_open_revalidate at ffffffffa023fc57 [nfs]
 #9 [ffff8804371d9bf0] do_lookup at ffffffff8119dd36
#10 [ffff8804371d9c50] __link_path_walk at ffffffff8119eb04
#11 [ffff8804371d9d30] path_walk at ffffffff8119f61a
#12 [ffff8804371d9d70] filename_lookup at ffffffff8119f82b
#13 [ffff8804371d9db0] do_filp_open at ffffffff811a0d04
#14 [ffff8804371d9f20] do_sys_open at ffffffff8118af77
#15 [ffff8804371d9f70] sys_open at ffffffff8118b080
#16 [ffff8804371d9f80] system_call_fastpath at ffffffff8100b0f2
  • Now obtain the rpc_task from the stack.
crash> print __rpc_wait_for_completion_task
$6 = {int (struct rpc_task *, int (*)(void *))} 0xffffffffa01a73b0 <__rpc_wait_for_completion_task>

crash> dis -r ffffffffa025c6bf | tail
0xffffffffa025c69c <nfs4_run_open_task+252>:    mov    %al,0x358(%rbx)
0xffffffffa025c6a2 <nfs4_run_open_task+258>:    mov    %r12,%rdi
0xffffffffa025c6a5 <nfs4_run_open_task+261>:    callq  0xffffffffa019e3f0 <rpc_run_task>
0xffffffffa025c6aa <nfs4_run_open_task+266>:    cmp    $0xfffffffffffff000,%rax
0xffffffffa025c6b0 <nfs4_run_open_task+272>:    mov    %rax,%r12  <---- rpc_task *
0xffffffffa025c6b3 <nfs4_run_open_task+275>:    ja     0xffffffffa025c6f4 <nfs4_run_open_task+340>
0xffffffffa025c6b5 <nfs4_run_open_task+277>:    xor    %esi,%esi
0xffffffffa025c6b7 <nfs4_run_open_task+279>:    mov    %rax,%rdi  <---- rpc_task *
0xffffffffa025c6ba <nfs4_run_open_task+282>:    callq  0xffffffffa01a73b0 <__rpc_wait_for_completion_task>
0xffffffffa025c6bf <nfs4_run_open_task+287>:    test   %eax,%eax
crash> dis __rpc_wait_for_completion_task | grep r12
crash> dis out_of_line_wait_on_bit | head
0xffffffff8152ad10 <out_of_line_wait_on_bit>:   push   %rbp
0xffffffff8152ad11 <out_of_line_wait_on_bit+1>: mov    %rsp,%rbp  <--- new value of %rbp based on %rsp
0xffffffff8152ad14 <out_of_line_wait_on_bit+4>: sub    $0x60,%rsp
0xffffffff8152ad18 <out_of_line_wait_on_bit+8>: mov    %rbx,-0x20(%rbp)
0xffffffff8152ad1c <out_of_line_wait_on_bit+12>:        mov    %r12,-0x18(%rbp)  <--- rpc_task stored relative to %rbp
0xffffffff8152ad20 <out_of_line_wait_on_bit+16>:        mov    %r13,-0x10(%rbp)
0xffffffff8152ad24 <out_of_line_wait_on_bit+20>:        mov    %r14,-0x8(%rbp)
0xffffffff8152ad28 <out_of_line_wait_on_bit+24>:        nopl   0x0(%rax,%rax,1)
0xffffffff8152ad2d <out_of_line_wait_on_bit+29>:        mov    %rdi,%r12
0xffffffff8152ad30 <out_of_line_wait_on_bit+32>:        mov    %esi,%ebx
crash> bt -f | grep __rpc_wait_for_completion_task
 #4 [ffff8804371d9980] __rpc_wait_for_completion_task at ffffffffa01a73dd [sunrpc]
crash> px (0xffff8804371d9980 - 8)
$4 = 0xffff8804371d9978
crash> px (0xffff8804371d9978 - 0x18)
$5 = 0xffff8804371d9960
crash> rd 0xffff8804371d9960
ffff8804371d9960:  ffff8803ca5b3a80                    .:[.....
  • Once obtained, check the address with kmem and the fields
crash> kmem 0xffff8803ca5b3a80
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff880435762dc0 rpc_tasks                256       4462      4755    317     4k
SLAB              MEMORY            TOTAL  ALLOCATED  FREE
ffff8803ca5b3000  ffff8803ca5b3080     15         15     0
FREE / [ALLOCATED]
  [ffff8803ca5b3a80]

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea000d443f28 3ca5b3000                0        0  1 40000000000080 slab

crash> rpc_task.tk_waitqueue,tk_msg,tk_action 0xffff8803ca5b3a80
  tk_waitqueue = 0xffff880437062298
  tk_msg = {
    rpc_proc = 0xffffffffa028e140 <nfs4_procedures+224>, 
    rpc_argp = 0xffff88040114b808, 
    rpc_resp = 0xffff88040114b880, 
    rpc_cred = 0xffff880430fecd40
  }
  tk_action = 0xffffffffa01a5ae0 <rpc_prepare_task>
crash> rpc_procinfo.p_name 0xffffffffa028e140
  p_name = 0xffffffffa02832b5 "OPEN"
crash> rpc_wait_queue.name 0xffff880437062298
  name = 0xffffffffa02834e8 "Seqid_waitqueue"

Attachments

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.