RHEL6.4: delayed NFS RENEW response from NetApp filer leads to temporarily expired lease, and repeated NFS4 WRITE with NFS4ERR_BAD_STATEID reply
Issue
- A TIBCO process got stuck and went into defunct mode making it a zombie process and won't die with any commands.
- The problem is the stuck process won't release its resources holding some ports and unable to release the lock on a file that is sitting on a NetApp NFSv4 share.
- The only solution to fix the problem is by rebooting this Redhat Linux VM.
- Here is a sample backtrace we see when the issue occurs
INFO: task tibemsd64:2455 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tibemsd64 D 0000000000000000 0 2455 1 0x00000080
ffff88023ae419c8 0000000000000082 ffff88023ae41948 ffffffffa028ecd0
ffff880237693400 ffff88023ae41978 ffff880238ba9c30 ffff88023a3067e0
ffff88023aa75ab8 ffff88023ae41fd8 000000000000fb88 ffff88023aa75ab8
Call Trace:
[<ffffffffa028ecd0>] ? rpc_execute+0x50/0xa0 [sunrpc]
[<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff81119e10>] ? sync_page+0x0/0x50
[<ffffffff8150e8c3>] io_schedule+0x73/0xc0
[<ffffffff81119e4d>] sync_page+0x3d/0x50
[<ffffffff8150f12a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81119de7>] __lock_page+0x67/0x70
[<ffffffff81096de0>] ? wake_bit_function+0x0/0x50
[<ffffffff81119c1e>] ? find_get_page+0x1e/0xa0
[<ffffffff8111ae90>] find_lock_page+0x50/0x80
[<ffffffff8111af0d>] grab_cache_page_write_begin+0x4d/0xc0
[<ffffffffa0325267>] nfs_write_begin+0x77/0x220 [nfs]
[<ffffffff8111a7b3>] generic_file_buffered_write+0x123/0x2e0
[<ffffffff8111c210>] __generic_file_aio_write+0x260/0x490
[<ffffffff81437b73>] ? sock_recvmsg+0x133/0x160
[<ffffffff8111c4c8>] generic_file_aio_write+0x88/0x100
[<ffffffffa0325f8e>] nfs_file_write+0xde/0x1f0 [nfs]
[<ffffffff8118106a>] do_sync_write+0xfa/0x140
[<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8121bed6>] ? security_file_permission+0x16/0x20
[<ffffffff81181368>] vfs_write+0xb8/0x1a0
[<ffffffff81181c61>] sys_write+0x51/0x90
[<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Environment
- Red Hat Enterprise Linux 6 (NFS Client)
- seen on kernel 2.6.32-358.18.1.el6
- NFS Server
- NetApp Ontap 8.1.4P2 7-mode
- NFS4 with (read and write) delegations enabled
- NFS4 lease time = 30 seconds
- NOTE: This is the default for NetApp options nfs.v4.lease_seconds according to NetApp library ECMM1278346 - Specifying the NFSv4 locking lease period
- Application
- seen with TIBCO EMS
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.