NFS4.0: High Load but CPU idle - leading to a complete outage of the system opening files on an NFS4 mount
Red Hat Insights can detect this issue
Proactively detect and remediate issues impacting your systems.
Environment
- Red Hat Enterprise Linux 6 (NFS client)
- Red Hat Enterprise Linux 7 (NFS client)
- Any NFS server
- seen with NetApp
- NFS4.0
- Any application where the same userid opens / closes a lot of files on NFS4 in a short time
- seen with httpd (apache) web server
- seen with java processes opening and closing files serially
Issue
- The load is growing very fast from about 5 to greater than 4000 (sometimes up to 4500). On the system there are several webserver processes running, one of them is serving our online service. It seems that this specific webserver is causing the problem because the system keeps being responsive, and also the webservers of other applications seem to have no troubles. But the online service website is getting very slow up to complete outage.
- The problem happened two times within the last 7 days, both times leading in a complete outage of our online service. The problem was there for about 3 hours both times, and then it seems to be gone and we don't really know why. If the problem will show up again it's extremely urgent!.
- Application with one userid open / closes a lot of files on NFS4 in a short period of time.
- Many processes blocked waiting for NFS4 / RPC tasks to complete, RPC tasks blocked in Seqid_waitqueue
Resolution
Red Hat Enterprise Linux 6 (NFS client)
- There is currently no resolution to this problem. The bug filed for RHEL6 is closed as WONT_FIX.
Red Hat Enterprise Linux 7 (NFS client)
- This issue has been resolved with the errata RHSA-2016-2574 for the package(s) `kernel-3.10.0-514.el7 or later.
- Originally tracked in private Red Hat Bug 1287789 - NFSv4.1 parallel OPEN feature.
- Remount the NFS share as a NFSv4.1 share to take advantage of the parallel OPEN Calls.
NFS server or network
- Address any underlying problem leading to increased latency for NFS4 OPEN or CLOSE operations.
Workarounds
- Use NFSv3 (verified)
Root Cause
- There is a limitation to the Linux NFS4.0 client implementation that an "open_owner" is mapped to a userid. This results in a bottleneck if one user opens and closes a lot of files in a short period of time. Each OPEN / CLOSE operation has to wait for a sequence id, which essentially serializes each OPEN / CLOSE request. If an NFS server's response time for OPEN / CLOSE requests increases due to some secondary load or complication, this NFS4 client limitation can become pronounced, and in some cases, cause an unresponsive machine.
- The NFS4.1 protocol addresses the limitation of serialization of OPENs per open_owner. For more information, see RFC 5661 Section 9.10
- The use of parallel opens for a NFS client opening and closing files in quick session can lead to NFS server side errors, but the client should transparently recover with no issues. The NFS server side errors are not inherently problematic, but confirm with the NFS vendor to be certain.
Diagnostic Steps
General steps
- Gather a tcpdump on the NFS4 client at the time of the problem.
- Configure kdump and gather a vmcore at the time of the problem.
- Use the attached 'monitor-seqid-waitqueue.sh' script, or something similar, to determine whether a lot of RPC tasks are stuck waiting on the Seqid_waitqueue.
- WARNING: Running the attached script has been seen to increase latency of RPC tasks. Thus, it is only recommended for a short time as a diagnostic, and should not be run constantly in a production environment.
Analysis of tcpdump
- A tcpdump should show a high number of NFS4 OPEN and CLOSE requests. Use either tshark / wireshark to count the NFS4 operations, or the attached
pcap_nfsstat.sh
script.
NFSv4 Call Count Min Avg Max Total
ACCESS 98 0.000150 0.005951 0.151759 0.583243
CLOSE 2712 0.000464 0.004367 0.182066 11.844358
GETATTR 45 0.000146 0.003879 0.051450 0.174542
LOOKUP 6492 0.000343 0.005689 0.201797 36.930133
OPEN 5070 0.000640 0.009056 0.202305 45.914946
READ 1 0.000581 0.000581 0.000581 0.000581
RENEW 2 0.000146 0.000348 0.000551 0.000697
summary 14424 0.000000 0.006617 0.202305 95.448500
Analysis of vmcore
- A vmcore will show a high load average, with most of the load average the result of tasks in 'UN' state:
crash> sys | grep LOAD
LOAD AVERAGE: 4462.05, 3818.55, 2068.32
crash> ps -S
RU: 9
IN: 723
UN: 4454
- If you look at the blocked tasks in 'UN' state, most / all of them are waiting for an rpc completion.
crash> foreach UN bt | awk '/#1 / { print $3,$5 }' | sort | uniq -c | sort -nr
4454 rpc_wait_bit_killable ffffffffa01a7422
- Most of the blocked processes are in either an open or close system call. For example:
crash> foreach UN bt
PID: 13292 TASK: ffff8803c6837520 CPU: 3 COMMAND: "httpd"
#0 [ffff8803de797b48] schedule at ffffffff815299d0
#1 [ffff8803de797c20] rpc_wait_bit_killable at ffffffffa01a7422 [sunrpc]
#2 [ffff8803de797c30] __wait_on_bit at ffffffff8152acdf
#3 [ffff8803de797c80] out_of_line_wait_on_bit at ffffffff8152ad88
#4 [ffff8803de797cf0] __rpc_wait_for_completion_task at ffffffffa01a73dd [sunrpc]
#5 [ffff8803de797d00] nfs4_do_close at ffffffffa026469d [nfs]
#6 [ffff8803de797db0] __nfs4_close at ffffffffa02719c5 [nfs]
#7 [ffff8803de797e00] nfs4_close_sync at ffffffffa0271a98 [nfs]
#8 [ffff8803de797e10] nfs4_close_context at ffffffffa025eaf0 [nfs]
#9 [ffff8803de797e20] __put_nfs_open_context at ffffffffa0242ad3 [nfs]
#10 [ffff8803de797e70] nfs_release at ffffffffa0242bb3 [nfs]
#11 [ffff8803de797ea0] nfs_file_release at ffffffffa0240fa1 [nfs]
#12 [ffff8803de797ec0] __fput at ffffffff8118fa45
#13 [ffff8803de797f10] fput at ffffffff8118fb85
#14 [ffff8803de797f20] filp_close at ffffffff8118addd
#15 [ffff8803de797f50] sys_close at ffffffff8118aeb5
#16 [ffff8803de797f80] system_call_fastpath at ffffffff8100b0f2
...
PID: 6181 TASK: ffff880437037520 CPU: 3 COMMAND: "httpd"
#0 [ffff8804371d97d8] schedule at ffffffff815299d0
#1 [ffff8804371d98b0] rpc_wait_bit_killable at ffffffffa01a7422 [sunrpc]
#2 [ffff8804371d98c0] __wait_on_bit at ffffffff8152acdf
#3 [ffff8804371d9910] out_of_line_wait_on_bit at ffffffff8152ad88
#4 [ffff8804371d9980] __rpc_wait_for_completion_task at ffffffffa01a73dd [sunrpc]
#5 [ffff8804371d9990] nfs4_run_open_task at ffffffffa025c6bf [nfs]
#6 [ffff8804371d9a20] nfs4_do_open at ffffffffa025e297 [nfs]
#7 [ffff8804371d9b30] nfs4_atomic_open at ffffffffa025e7fe [nfs]
#8 [ffff8804371d9b40] nfs_open_revalidate at ffffffffa023fc57 [nfs]
#9 [ffff8804371d9bf0] do_lookup at ffffffff8119dd36
#10 [ffff8804371d9c50] __link_path_walk at ffffffff8119eb04
#11 [ffff8804371d9d30] path_walk at ffffffff8119f61a
#12 [ffff8804371d9d70] filename_lookup at ffffffff8119f82b
#13 [ffff8804371d9db0] do_filp_open at ffffffff811a0d04
#14 [ffff8804371d9f20] do_sys_open at ffffffff8118af77
#15 [ffff8804371d9f70] sys_open at ffffffff8118b080
#16 [ffff8804371d9f80] system_call_fastpath at ffffffff8100b0f2
- If you look at the outstanding NFS / RPC requests in the system, most all of them have the following fields, which means these RPC tasks have not been sent to the NFS server, but are waiting in the Seqid_waitqueue, which is the serialization mechanism for NFS4 OPEN / CLOSE requests:
- tk_waitqueue.name == 'Seqid_waitqueue',
- tk_msg.rpc_proc->p_name == CLOSE or OPEN
- tk_action == rpc_prepare_task
- Example
- First, select a task blocked in rpc_wait_bit_killable, and doing either an NFS4 OPEN or CLOSE
PID: 6181 TASK: ffff880437037520 CPU: 3 COMMAND: "httpd"
#0 [ffff8804371d97d8] schedule at ffffffff815299d0
#1 [ffff8804371d98b0] rpc_wait_bit_killable at ffffffffa01a7422 [sunrpc]
#2 [ffff8804371d98c0] __wait_on_bit at ffffffff8152acdf
#3 [ffff8804371d9910] out_of_line_wait_on_bit at ffffffff8152ad88
#4 [ffff8804371d9980] __rpc_wait_for_completion_task at ffffffffa01a73dd [sunrpc]
#5 [ffff8804371d9990] nfs4_run_open_task at ffffffffa025c6bf [nfs]
#6 [ffff8804371d9a20] nfs4_do_open at ffffffffa025e297 [nfs]
#7 [ffff8804371d9b30] nfs4_atomic_open at ffffffffa025e7fe [nfs]
#8 [ffff8804371d9b40] nfs_open_revalidate at ffffffffa023fc57 [nfs]
#9 [ffff8804371d9bf0] do_lookup at ffffffff8119dd36
#10 [ffff8804371d9c50] __link_path_walk at ffffffff8119eb04
#11 [ffff8804371d9d30] path_walk at ffffffff8119f61a
#12 [ffff8804371d9d70] filename_lookup at ffffffff8119f82b
#13 [ffff8804371d9db0] do_filp_open at ffffffff811a0d04
#14 [ffff8804371d9f20] do_sys_open at ffffffff8118af77
#15 [ffff8804371d9f70] sys_open at ffffffff8118b080
#16 [ffff8804371d9f80] system_call_fastpath at ffffffff8100b0f2
- Now obtain the rpc_task from the stack.
crash> print __rpc_wait_for_completion_task
$6 = {int (struct rpc_task *, int (*)(void *))} 0xffffffffa01a73b0 <__rpc_wait_for_completion_task>
crash> dis -r ffffffffa025c6bf | tail
0xffffffffa025c69c <nfs4_run_open_task+252>: mov %al,0x358(%rbx)
0xffffffffa025c6a2 <nfs4_run_open_task+258>: mov %r12,%rdi
0xffffffffa025c6a5 <nfs4_run_open_task+261>: callq 0xffffffffa019e3f0 <rpc_run_task>
0xffffffffa025c6aa <nfs4_run_open_task+266>: cmp $0xfffffffffffff000,%rax
0xffffffffa025c6b0 <nfs4_run_open_task+272>: mov %rax,%r12 <---- rpc_task *
0xffffffffa025c6b3 <nfs4_run_open_task+275>: ja 0xffffffffa025c6f4 <nfs4_run_open_task+340>
0xffffffffa025c6b5 <nfs4_run_open_task+277>: xor %esi,%esi
0xffffffffa025c6b7 <nfs4_run_open_task+279>: mov %rax,%rdi <---- rpc_task *
0xffffffffa025c6ba <nfs4_run_open_task+282>: callq 0xffffffffa01a73b0 <__rpc_wait_for_completion_task>
0xffffffffa025c6bf <nfs4_run_open_task+287>: test %eax,%eax
crash> dis __rpc_wait_for_completion_task | grep r12
crash> dis out_of_line_wait_on_bit | head
0xffffffff8152ad10 <out_of_line_wait_on_bit>: push %rbp
0xffffffff8152ad11 <out_of_line_wait_on_bit+1>: mov %rsp,%rbp <--- new value of %rbp based on %rsp
0xffffffff8152ad14 <out_of_line_wait_on_bit+4>: sub $0x60,%rsp
0xffffffff8152ad18 <out_of_line_wait_on_bit+8>: mov %rbx,-0x20(%rbp)
0xffffffff8152ad1c <out_of_line_wait_on_bit+12>: mov %r12,-0x18(%rbp) <--- rpc_task stored relative to %rbp
0xffffffff8152ad20 <out_of_line_wait_on_bit+16>: mov %r13,-0x10(%rbp)
0xffffffff8152ad24 <out_of_line_wait_on_bit+20>: mov %r14,-0x8(%rbp)
0xffffffff8152ad28 <out_of_line_wait_on_bit+24>: nopl 0x0(%rax,%rax,1)
0xffffffff8152ad2d <out_of_line_wait_on_bit+29>: mov %rdi,%r12
0xffffffff8152ad30 <out_of_line_wait_on_bit+32>: mov %esi,%ebx
crash> bt -f | grep __rpc_wait_for_completion_task
#4 [ffff8804371d9980] __rpc_wait_for_completion_task at ffffffffa01a73dd [sunrpc]
crash> px (0xffff8804371d9980 - 8)
$4 = 0xffff8804371d9978
crash> px (0xffff8804371d9978 - 0x18)
$5 = 0xffff8804371d9960
crash> rd 0xffff8804371d9960
ffff8804371d9960: ffff8803ca5b3a80 .:[.....
- Once obtained, check the address with kmem and the fields
crash> kmem 0xffff8803ca5b3a80
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff880435762dc0 rpc_tasks 256 4462 4755 317 4k
SLAB MEMORY TOTAL ALLOCATED FREE
ffff8803ca5b3000 ffff8803ca5b3080 15 15 0
FREE / [ALLOCATED]
[ffff8803ca5b3a80]
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea000d443f28 3ca5b3000 0 0 1 40000000000080 slab
crash> rpc_task.tk_waitqueue,tk_msg,tk_action 0xffff8803ca5b3a80
tk_waitqueue = 0xffff880437062298
tk_msg = {
rpc_proc = 0xffffffffa028e140 <nfs4_procedures+224>,
rpc_argp = 0xffff88040114b808,
rpc_resp = 0xffff88040114b880,
rpc_cred = 0xffff880430fecd40
}
tk_action = 0xffffffffa01a5ae0 <rpc_prepare_task>
crash> rpc_procinfo.p_name 0xffffffffa028e140
p_name = 0xffffffffa02832b5 "OPEN"
crash> rpc_wait_queue.name 0xffff880437062298
name = 0xffffffffa02834e8 "Seqid_waitqueue"
Attachments
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments