NFS Client hang experienced with IPoverIB setup

Solution In Progress - Updated -

Issue

  • NFS Client hang experienced with IPoverIB setup with Mellanox OFED 4.4, 4.5 and 4.6
  • NFS tasks are hung waiting with TCP socket state CLOSE_WAIT
crash> ps -S
  RU: 29
  IN: 1343
  WA: 1
  UN: 11

All blocked tasks on the system were NFS  ( iperf tests over TCP were successful without hang issue) 

crash> ps -m | grep UN
[0 00:00:00.000] [UN]  PID: 147171  TASK: ffff898304098010  CPU: 18  COMMAND: "ssh"
[0 00:00:03.077] [UN]  PID: 148379  TASK: ffff89bcb54757e0  CPU: 4   COMMAND: "df"
[0 00:00:26.490] [UN]  PID: 148336  TASK: ffff89a174f00010  CPU: 3   COMMAND: "bash"
[0 00:00:51.093] [UN]  PID: 148196  TASK: ffff8995363f0010  CPU: 2   COMMAND: "ls"
[0 00:08:27.989] [UN]  PID: 146168  TASK: ffff89bb48da34c0  CPU: 2   COMMAND: "dd"
[0 00:08:42.673] [UN]  PID: 145899  TASK: ffff89bb48da4650  CPU: 6   COMMAND: "dd"
[0 00:08:42.959] [UN]  PID: 147211  TASK: ffff89a442e857e0  CPU: 5   COMMAND: "df"
[0 00:08:44.780] [UN]  PID: 147075  TASK: ffff899e57af91a0  CPU: 27  COMMAND: "agetit"
[0 00:08:45.143] [UN]  PID: 146620  TASK: ffff89c1b7ecb4c0  CPU: 22  COMMAND: "agetit"
[0 00:08:45.107] [UN]  PID: 147112  TASK: ffff89c0838dc650  CPU: 13  COMMAND: "chk_file"
[0 00:08:46.742] [UN]  PID: 147007  TASK: ffff89c27fed34c0  CPU: 1   COMMAND: "dd"          << Oldest D state task 

== Oldest hung task 
crash> bt
PID: 147007  TASK: ffff89c27fed34c0  CPU: 1   COMMAND: "dd"
 #0 [ffff89c1bcf4fb78] __schedule at ffffffffa8968972
 #1 [ffff89c1bcf4fc00] schedule at ffffffffa8968e19
 #2 [ffff89c1bcf4fc10] rpc_wait_bit_killable at ffffffffc0510f14 [sunrpc]
 #3 [ffff89c1bcf4fc30] __wait_on_bit at ffffffffa8966a97
 #4 [ffff89c1bcf4fc70] out_of_line_wait_on_bit at ffffffffa8966c01
 #5 [ffff89c1bcf4fce8] __rpc_wait_for_completion_task at ffffffffc0510eed [sunrpc]
 #6 [ffff89c1bcf4fcf8] nfs4_do_close at ffffffffc1470527 [nfsv4]
 #7 [ffff89c1bcf4fda8] __nfs4_close at ffffffffc148123d [nfsv4]
 #8 [ffff89c1bcf4fde8] nfs4_close_sync at ffffffffc14822f8 [nfsv4]
 #9 [ffff89c1bcf4fdf8] nfs4_close_context at ffffffffc1463e7d [nfsv4]
#10 [ffff89c1bcf4fe08] __put_nfs_open_context at ffffffffc06632df [nfs]
#11 [ffff89c1bcf4fe48] nfs_file_clear_open_context at ffffffffc0665514 [nfs]
#12 [ffff89c1bcf4fe78] nfs_file_release at ffffffffc066101b [nfs]
#13 [ffff89c1bcf4fe98] __fput at ffffffffa8443b4c
#14 [ffff89c1bcf4fee0] ____fput at ffffffffa8443dae
#15 [ffff89c1bcf4fef0] task_work_run at ffffffffa82be88b
#16 [ffff89c1bcf4ff30] do_notify_resume at ffffffffa822bc65
#17 [ffff89c1bcf4ff50] int_signal at ffffffffa8976134

Environment

  • RHEL 7.4
  • NFSV3, NFSV4
  • MLNX_OFED_LINUX-4.4-2.0.7.0 (OFED-4.4-2.0.7)
  • MLNX_OFED_LINUX-4.5-1.0.1.0 (OFED-4.5-1.0.1)
  • MLNX_OFED_LINUX-4.6-1.0.1.1 (OFED-4.6-1.0.1)
  • RDMA also seen in the setup but not directly related.
  • GPFS

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In