RHEL7: NFS4 client hangs with NFS4 WRITE sent with NFS4ERR_STALE_STATEID (10023) error code
Issue
A NFS file system became unavailable on the NFSv4 client
Suddenly, the client got a bad sequence-id error:
Dec 23 02:00:37 foo kernel: NFS: v4 server nfs.example.com returned a bad sequence-id error!
Three minutes later, hung task messages for a "tee" command began:
Dec 23 02:03:27 foo kernel: INFO: task tee:28212 blocked for more than 120 seconds.
Dec 23 02:03:27 foo kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 23 02:03:27 foo kernel: tee D ffff881032d04500 0 28212 28197 0x00000000
Dec 23 02:03:27 foo kernel: ffff8808528e3bf0 0000000000000086 ffff88083521c500 ffff8808528e3fd8
Dec 23 02:03:27 foo kernel: ffff8808528e3fd8 ffff8808528e3fd8 ffff88083521c500 ffff88105f094780
Dec 23 02:03:27 foo kernel: 0000000000000000 7fffffffffffffff ffffffff81168960 ffff8808528e3d50
Dec 23 02:03:27 foo kernel: Call Trace:
Dec 23 02:03:27 foo kernel: [<ffffffff81168960>] ? wait_on_page_read+0x60/0x60
Dec 23 02:03:27 foo kernel: [<ffffffff8163ae29>] schedule+0x29/0x70
Dec 23 02:03:27 foo kernel: [<ffffffff81638b19>] schedule_timeout+0x209/0x2d0
Dec 23 02:03:27 foo kernel: [<ffffffff8101c829>] ? read_tsc+0x9/0x10
Dec 23 02:03:27 foo kernel: [<ffffffff81168960>] ? wait_on_page_read+0x60/0x60
Dec 23 02:03:27 foo kernel: [<ffffffff8163a45e>] io_schedule_timeout+0xae/0x130
Dec 23 02:03:27 foo kernel: [<ffffffff8163a4f8>] io_schedule+0x18/0x20
Dec 23 02:03:27 foo kernel: [<ffffffff8116896e>] sleep_on_page+0xe/0x20
Dec 23 02:03:27 foo kernel: [<ffffffff81638ca0>] __wait_on_bit+0x60/0x90
Dec 23 02:03:27 foo kernel: [<ffffffff811686f6>] wait_on_page_bit+0x86/0xb0
Dec 23 02:03:27 foo kernel: [<ffffffff810a6b40>] ? wake_atomic_t_function+0x40/0x40
Dec 23 02:03:27 foo kernel: [<ffffffff81168831>] filemap_fdatawait_range+0x111/0x1b0
Dec 23 02:03:27 foo kernel: [<ffffffff8117598e>] ? do_writepages+0x1e/0x40
Dec 23 02:03:27 foo kernel: [<ffffffff8116a735>] ? __filemap_fdatawrite_range+0x65/0x80
Dec 23 02:03:27 foo kernel: [<ffffffff8116a85f>] filemap_write_and_wait_range+0x3f/0x70
Dec 23 02:03:27 foo kernel: [<ffffffffa0a221ef>] nfs4_file_fsync+0x5f/0xa0 [nfsv4]
Dec 23 02:03:27 foo kernel: [<ffffffff8120f7cb>] vfs_fsync+0x2b/0x40
Dec 23 02:03:27 foo kernel: [<ffffffffa09c9f0a>] nfs_file_flush+0x7a/0xb0 [nfs]
Dec 23 02:03:27 foo kernel: [<ffffffff811dc274>] filp_close+0x34/0x80
Dec 23 02:03:27 foo kernel: [<ffffffff811fcbc8>] __close_fd+0x78/0xa0
Dec 23 02:03:27 foo kernel: [<ffffffff811dd983>] SyS_close+0x23/0x50
Dec 23 02:03:27 foo kernel: [<ffffffff81645e89>] system_call_fastpath+0x16/0x1b
Sometime after, we executed a "ls" command to the NFS but it did not return.
So they restarted NFS service on the NFS server but the problem was not solved.
Dec 23 03:16:34 nfs.example.com systemd: Stopping NFS server and services...
Dec 23 03:16:34 nfs.example.com kernel: nfsd: last server has exited, flushing export cache
Dec 23 03:16:34 nfs.example.com systemd: Stopping NFSv4 ID-name mapping service...
Dec 23 03:16:34 nfs.example.com systemd: Started Kernel Module supporting RPCSEC_GSS.
Dec 23 03:16:34 nfs.example.com systemd: Started RPC security service for NFS server.
Dec 23 03:16:34 nfs.example.com systemd: Started RPC security service for NFS client and server.
Dec 23 03:16:34 nfs.example.com systemd: Stopping NFS Mount Daemon...
Dec 23 03:16:34 nfs.example.com rpc.mountd[2426]: Caught signal 15, un-registering and exiting.
Dec 23 03:16:34 nfs.example.com systemd: Starting NFSv4 ID-name mapping service...
Dec 23 03:16:34 nfs.example.com systemd: Starting NFS Mount Daemon...
Dec 23 03:16:34 nfs.example.com systemd: Started NFSv4 ID-name mapping service.
Dec 23 03:16:34 nfs.example.com rpc.mountd[50572]: Version 1.3.0 starting
Dec 23 03:16:34 nfs.example.com systemd: Started NFS Mount Daemon.
Dec 23 03:16:34 nfs.example.com systemd: Starting NFS server and services...
Dec 23 03:16:34 nfs.example.com kernel: NFSD: starting 90-second grace period (net ffffffff81a25e00)
Dec 23 03:16:34 nfs.example.com systemd: Started NFS server and services.
Dec 23 03:16:34 nfs.example.com systemd: Starting Notify NFS peers of a restart...
Dec 23 03:16:34 nfs.example.com sm-notify[50713]: Version 1.3.0 starting
Dec 23 03:16:34 nfs.example.com sm-notify[50713]: Already notifying clients; Exiting!
Dec 23 03:16:34 nfs.example.com systemd: Started Notify NFS peers of a restart.
Environment
- Red Hat Enterprise Linux 7.2
- seen on kernel-3.10.0-327.13.1.el7
- nfs-utils-1.3.0-0.21.el7
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.