RHEL6: NFS4 WRITE continuously sent and completing with NFS4ERR_BAD_STATEID (10025) with NetApp due to multiple filehandles for same file
Issue
- hung task timeout and/or panic, with the process triggering the panic doing a close on an NFS file, flushing pages, and waiting on page writeback to complete
- hung task backtrace similar to the following
INFO: task foo:4347 blocked for more than 720 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
foo D 0000000000000012 0 4347 798 0x00000080
ffff88101240fc78 0000000000000082 0000000000000000 ffff88020acfd840
ffff88101240fd08 ffffffff8112dd37 ffff88101240fc58 0000000000000282
ffff881012791098 ffff88101240ffd8 000000000000fb88 ffff881012791098
Call Trace:
[<ffffffff8150e3e3>] io_schedule+0x73/0xc0
[<ffffffff81119d3d>] sync_page+0x3d/0x50
[<ffffffff8150ed9f>] __wait_on_bit+0x5f/0x90
[<ffffffff81119f73>] wait_on_page_bit+0x73/0x80
[<ffffffff8111a39b>] wait_on_page_writeback_range+0xfb/0x190
[<ffffffff8111a568>] filemap_write_and_wait_range+0x78/0x90
[<ffffffff811b1ace>] vfs_fsync_range+0x7e/0xe0
[<ffffffff811b1b9d>] vfs_fsync+0x1d/0x20
[<ffffffffa03a6670>] nfs_file_flush+0x70/0xa0 [nfs]
...
- just prior to the process going blocked, we sometimes see the following message:
nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
- In addition, sometime prior to the problem, one or more bad sequence-id messages may be seen.
NFS: v4 server nfs-server returned a bad sequence-id error!
Environment
- Red Hat Enterprise Linux 6 (NFS Client)
- seen on kernels 2.6.32-358.12.1.el6 and 2.6.32-431.5.1.el6
- other kernels likely affected
- NetApp (NFS Server)
- Ontap 8.1.2P4
- delegations disabled
- NFSv3 and NFSv4 enabled and active on the same NetApp volume
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
