RHEL6: NFS4 WRITE continuously sent and completing with NFS4ERR_BAD_STATEID (10025) with NetApp due to multiple filehandles for same file
Issue
- hung task timeout and/or panic, with the process triggering the panic doing a close on an NFS file, flushing pages, and waiting on page writeback to complete
- hung task backtrace similar to the following
INFO: task foo:4347 blocked for more than 720 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
foo D 0000000000000012 0 4347 798 0x00000080
ffff88101240fc78 0000000000000082 0000000000000000 ffff88020acfd840
ffff88101240fd08 ffffffff8112dd37 ffff88101240fc58 0000000000000282
ffff881012791098 ffff88101240ffd8 000000000000fb88 ffff881012791098
Call Trace:
[<ffffffff8150e3e3>] io_schedule+0x73/0xc0
[<ffffffff81119d3d>] sync_page+0x3d/0x50
[<ffffffff8150ed9f>] __wait_on_bit+0x5f/0x90
[<ffffffff81119f73>] wait_on_page_bit+0x73/0x80
[<ffffffff8111a39b>] wait_on_page_writeback_range+0xfb/0x190
[<ffffffff8111a568>] filemap_write_and_wait_range+0x78/0x90
[<ffffffff811b1ace>] vfs_fsync_range+0x7e/0xe0
[<ffffffff811b1b9d>] vfs_fsync+0x1d/0x20
[<ffffffffa03a6670>] nfs_file_flush+0x70/0xa0 [nfs]
...
- just prior to the process going blocked, we sometimes see the following message:
nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
- In addition, sometime prior to the problem, one or more bad sequence-id messages may be seen.
NFS: v4 server nfs-server returned a bad sequence-id error!
Environment
- Red Hat Enterprise Linux 6 (NFS Client)
- seen on kernels 2.6.32-358.12.1.el6 and 2.6.32-431.5.1.el6
- other kernels likely affected
- NetApp (NFS Server)
- Ontap 8.1.2P4
- delegations disabled
- NFSv3 and NFSv4 enabled and active on the same NetApp volume
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.