NFS Client hang experienced with IPoverIB setup

Solution In Progress - Updated -

Issue

  • NFS Client hang experienced with IPoverIB setup with Mellanox OFED 4.4, 4.5 and 4.6
  • NFS tasks are hung waiting with TCP socket state CLOSE_WAIT
crash> ps -S
  RU: 29
  IN: 1343
  WA: 1
  UN: 11

All blocked tasks on the system were NFS  ( iperf tests over TCP were successful without hang issue) 

crash> ps -m | grep UN
[0 00:00:00.000] [UN]  PID: 147171  TASK: ffff898304098010  CPU: 18  COMMAND: "ssh"
[0 00:00:03.077] [UN]  PID: 148379  TASK: ffff89bcb54757e0  CPU: 4   COMMAND: "df"
[0 00:00:26.490] [UN]  PID: 148336  TASK: ffff89a174f00010  CPU: 3   COMMAND: "bash"
[0 00:00:51.093] [UN]  PID: 148196  TASK: ffff8995363f0010  CPU: 2   COMMAND: "ls"
[0 00:08:27.989] [UN]  PID: 146168  TASK: ffff89bb48da34c0  CPU: 2   COMMAND: "dd"
[0 00:08:42.673] [UN]  PID: 145899  TASK: ffff89bb48da4650  CPU: 6   COMMAND: "dd"
[0 00:08:42.959] [UN]  PID: 147211  TASK: ffff89a442e857e0  CPU: 5   COMMAND: "df"
[0 00:08:44.780] [UN]  PID: 147075  TASK: ffff899e57af91a0  CPU: 27  COMMAND: "agetit"
[0 00:08:45.143] [UN]  PID: 146620  TASK: ffff89c1b7ecb4c0  CPU: 22  COMMAND: "agetit"
[0 00:08:45.107] [UN]  PID: 147112  TASK: ffff89c0838dc650  CPU: 13  COMMAND: "chk_file"
[0 00:08:46.742] [UN]  PID: 147007  TASK: ffff89c27fed34c0  CPU: 1   COMMAND: "dd"          << Oldest D state task 

== Oldest hung task 
crash> bt
PID: 147007  TASK: ffff89c27fed34c0  CPU: 1   COMMAND: "dd"
 #0 [ffff89c1bcf4fb78] __schedule at ffffffffa8968972
 #1 [ffff89c1bcf4fc00] schedule at ffffffffa8968e19
 #2 [ffff89c1bcf4fc10] rpc_wait_bit_killable at ffffffffc0510f14 [sunrpc]
 #3 [ffff89c1bcf4fc30] __wait_on_bit at ffffffffa8966a97
 #4 [ffff89c1bcf4fc70] out_of_line_wait_on_bit at ffffffffa8966c01
 #5 [ffff89c1bcf4fce8] __rpc_wait_for_completion_task at ffffffffc0510eed [sunrpc]
 #6 [ffff89c1bcf4fcf8] nfs4_do_close at ffffffffc1470527 [nfsv4]
 #7 [ffff89c1bcf4fda8] __nfs4_close at ffffffffc148123d [nfsv4]
 #8 [ffff89c1bcf4fde8] nfs4_close_sync at ffffffffc14822f8 [nfsv4]
 #9 [ffff89c1bcf4fdf8] nfs4_close_context at ffffffffc1463e7d [nfsv4]
#10 [ffff89c1bcf4fe08] __put_nfs_open_context at ffffffffc06632df [nfs]
#11 [ffff89c1bcf4fe48] nfs_file_clear_open_context at ffffffffc0665514 [nfs]
#12 [ffff89c1bcf4fe78] nfs_file_release at ffffffffc066101b [nfs]
#13 [ffff89c1bcf4fe98] __fput at ffffffffa8443b4c
#14 [ffff89c1bcf4fee0] ____fput at ffffffffa8443dae
#15 [ffff89c1bcf4fef0] task_work_run at ffffffffa82be88b
#16 [ffff89c1bcf4ff30] do_notify_resume at ffffffffa822bc65
#17 [ffff89c1bcf4ff50] int_signal at ffffffffa8976134

Environment

  • RHEL 7.4
  • NFSV3, NFSV4
  • MLNX_OFED_LINUX-4.4-2.0.7.0 (OFED-4.4-2.0.7)
  • MLNX_OFED_LINUX-4.5-1.0.1.0 (OFED-4.5-1.0.1)
  • MLNX_OFED_LINUX-4.6-1.0.1.1 (OFED-4.6-1.0.1)
  • RDMA also seen in the setup but not directly related.
  • GPFS

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content