RHEL7: NFS4 client hangs with NFS4 WRITE sent with NFS4ERR_STALE_STATEID (10023) error code

Solution In Progress - Updated -

Issue

A NFS file system became unavailable on the NFSv4 client
Suddenly, the client got a bad sequence-id error:

Dec 23 02:00:37 foo kernel: NFS: v4 server nfs.example.com  returned a bad sequence-id error!

Three minutes later, hung task messages for a "tee" command began:

Dec 23 02:03:27 foo kernel: INFO: task tee:28212 blocked for more than 120 seconds.
Dec 23 02:03:27 foo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 23 02:03:27 foo kernel: tee             D ffff881032d04500     0 28212  28197 0x00000000
Dec 23 02:03:27 foo kernel: ffff8808528e3bf0 0000000000000086 ffff88083521c500 ffff8808528e3fd8
Dec 23 02:03:27 foo kernel: ffff8808528e3fd8 ffff8808528e3fd8 ffff88083521c500 ffff88105f094780
Dec 23 02:03:27 foo kernel: 0000000000000000 7fffffffffffffff ffffffff81168960 ffff8808528e3d50
Dec 23 02:03:27 foo kernel: Call Trace:
Dec 23 02:03:27 foo kernel: [<ffffffff81168960>] ? wait_on_page_read+0x60/0x60
Dec 23 02:03:27 foo kernel: [<ffffffff8163ae29>] schedule+0x29/0x70
Dec 23 02:03:27 foo kernel: [<ffffffff81638b19>] schedule_timeout+0x209/0x2d0
Dec 23 02:03:27 foo kernel: [<ffffffff8101c829>] ? read_tsc+0x9/0x10
Dec 23 02:03:27 foo kernel: [<ffffffff81168960>] ? wait_on_page_read+0x60/0x60
Dec 23 02:03:27 foo kernel: [<ffffffff8163a45e>] io_schedule_timeout+0xae/0x130
Dec 23 02:03:27 foo kernel: [<ffffffff8163a4f8>] io_schedule+0x18/0x20
Dec 23 02:03:27 foo kernel: [<ffffffff8116896e>] sleep_on_page+0xe/0x20
Dec 23 02:03:27 foo kernel: [<ffffffff81638ca0>] __wait_on_bit+0x60/0x90
Dec 23 02:03:27 foo kernel: [<ffffffff811686f6>] wait_on_page_bit+0x86/0xb0
Dec 23 02:03:27 foo kernel: [<ffffffff810a6b40>] ? wake_atomic_t_function+0x40/0x40
Dec 23 02:03:27 foo kernel: [<ffffffff81168831>] filemap_fdatawait_range+0x111/0x1b0
Dec 23 02:03:27 foo kernel: [<ffffffff8117598e>] ? do_writepages+0x1e/0x40
Dec 23 02:03:27 foo kernel: [<ffffffff8116a735>] ? __filemap_fdatawrite_range+0x65/0x80
Dec 23 02:03:27 foo kernel: [<ffffffff8116a85f>] filemap_write_and_wait_range+0x3f/0x70
Dec 23 02:03:27 foo kernel: [<ffffffffa0a221ef>] nfs4_file_fsync+0x5f/0xa0 [nfsv4]
Dec 23 02:03:27 foo kernel: [<ffffffff8120f7cb>] vfs_fsync+0x2b/0x40
Dec 23 02:03:27 foo kernel: [<ffffffffa09c9f0a>] nfs_file_flush+0x7a/0xb0 [nfs]
Dec 23 02:03:27 foo kernel: [<ffffffff811dc274>] filp_close+0x34/0x80
Dec 23 02:03:27 foo kernel: [<ffffffff811fcbc8>] __close_fd+0x78/0xa0
Dec 23 02:03:27 foo kernel: [<ffffffff811dd983>] SyS_close+0x23/0x50
Dec 23 02:03:27 foo kernel: [<ffffffff81645e89>] system_call_fastpath+0x16/0x1b

Sometime after, we executed a "ls" command to the NFS but it did not return.
So they restarted NFS service on the NFS server but the problem was not solved.

Dec 23 03:16:34 nfs.example.com systemd: Stopping NFS server and services...
Dec 23 03:16:34 nfs.example.com kernel: nfsd: last server has exited, flushing export cache
Dec 23 03:16:34 nfs.example.com systemd: Stopping NFSv4 ID-name mapping service...
Dec 23 03:16:34 nfs.example.com systemd: Started Kernel Module supporting RPCSEC_GSS.
Dec 23 03:16:34 nfs.example.com systemd: Started RPC security service for NFS server.
Dec 23 03:16:34 nfs.example.com systemd: Started RPC security service for NFS client and server.
Dec 23 03:16:34 nfs.example.com systemd: Stopping NFS Mount Daemon...
Dec 23 03:16:34 nfs.example.com rpc.mountd[2426]: Caught signal 15, un-registering and exiting.
Dec 23 03:16:34 nfs.example.com systemd: Starting NFSv4 ID-name mapping service...
Dec 23 03:16:34 nfs.example.com systemd: Starting NFS Mount Daemon...
Dec 23 03:16:34 nfs.example.com systemd: Started NFSv4 ID-name mapping service.
Dec 23 03:16:34 nfs.example.com rpc.mountd[50572]: Version 1.3.0 starting
Dec 23 03:16:34 nfs.example.com systemd: Started NFS Mount Daemon.
Dec 23 03:16:34 nfs.example.com systemd: Starting NFS server and services...
Dec 23 03:16:34 nfs.example.com kernel: NFSD: starting 90-second grace period (net ffffffff81a25e00)
Dec 23 03:16:34 nfs.example.com systemd: Started NFS server and services.
Dec 23 03:16:34 nfs.example.com systemd: Starting Notify NFS peers of a restart...
Dec 23 03:16:34 nfs.example.com sm-notify[50713]: Version 1.3.0 starting
Dec 23 03:16:34 nfs.example.com sm-notify[50713]: Already notifying clients; Exiting!
Dec 23 03:16:34 nfs.example.com systemd: Started Notify NFS peers of a restart.

Environment

  • Red Hat Enterprise Linux 7.2
    • seen on kernel-3.10.0-327.13.1.el7
    • nfs-utils-1.3.0-0.21.el7

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content