RHEL7.4: NFS4 client hung due to NFS4 state manager thread stuck inside nfs_reap_expired_delegations infinite loop
Issue
- After update to RHEL7.4 kernel, NFS4.1 client hangs with the NFS4 state manager thread inside nfs_reap_expired_delegations and a tcpdump shows a constant stream of TEST_STATEID with the same stateid sent, and a NFS4ERR_BAD_STATEID response.
- With NFS4.1 NFS client, after update to RHEL7.4, processes are hung and/or generate hung task error messages. Also, top shows the following process which has a name based on the IP address of the NFS server, is using the most CPU
(10.#.#.# is IP address of NAS):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12073 root 20 0 0 0 0 D 16.7 0.0 525:23.29 10.#.#.#-manag
- After update to RHEL7.4 kernel, NFS4.0 seeing soft lockups in nfs_reap_expired_delegations and the NFS4.0 client hangs requiring a reboot or panics due to soft lockup
[17596.853096] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [10.1.1.xx-ma:11637]
[17596.853853] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache vmw_vsock_vmci_transport vsock sb_edac edac_core coretemp iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev vmw_balloon joydev pcspkr sg parport_pc parport shpchp vmw_vmci i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi vmwgfx drm_kms_helper sd_mod syscopyarea crc_t10dif sysfillrect crct10dif_generic sysimgblt fb_sys_fops ttm drm crct10dif_pclmul ata_piix crct10dif_common crc32c_intel libata serio_raw vmxnet3 vmw_pvscsi i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod
[17596.853900] CPU: 1 PID: 11637 Comm: 172.32.xx.xx-ma Tainted: G L ------------ 3.10.0-693.1.1.el7.x86_64 #1
[17596.853901] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
[17596.853903] task: ffff8804242f5ee0 ti: ffff8802cc220000 task.ti: ffff8802cc220000
[17596.853904] RIP: 0010:[<ffffffffc058489a>] [<ffffffffc058489a>] nfs_reap_expired_delegations+0x9a/0x220 [nfsv4]
[17596.853921] RSP: 0018:ffff8802cc223df8 EFLAGS: 00000206
[17596.853922] RAX: 0000000000000004 RBX: ffff88041ce0d000 RCX: 0000000000000003
[17596.853923] RDX: 0000000000000000 RSI: ffff8800b769d848 RDI: ffff8800bb556000
[17596.853924] RBP: ffff8802cc223e58 R08: ffff88041be93540 R09: 0000000000000000
[17596.853925] R10: 0000000000000000 R11: 7fffffffffffffff R12: ffff88041ce0d000
[17596.853926] R13: ffffffffc0584a6d R14: ffff8802cc223d78 R15: ffff8800b769d7c0
[17596.853927] FS: 0000000000000000(0000) GS:ffff88043fc40000(0000) knlGS:0000000000000000
[17596.853928] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[17596.853929] CR2: 00007fd8449a7000 CR3: 00000000019f2000 CR4: 00000000000407e0
[17596.853932] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17596.853934] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[17596.853935] Stack:
[17596.853936] ffffffffc059d3c0 ffff88041be93540 ffff88041329a000 0000000000000000
[17596.853937] 04cdd20102072112 0000000400000000 00000000f389a2ad ffff88042c49a400
[17596.853939] ffff88042c49a400 ffff88042c49a4c8 ffff88042c49a530 0000000000000000
[17596.853940] Call Trace:
[17596.853949] [<ffffffffc0580c22>] nfs4_state_manager+0x5f2/0x8c0 [nfsv4]
[17596.853955] [<ffffffffc0580ef0>] ? nfs4_state_manager+0x8c0/0x8c0 [nfsv4]
[17596.853961] [<ffffffffc0580f0f>] nfs4_run_state_manager+0x1f/0x40 [nfsv4]
[17596.853964] [<ffffffff810b098f>] kthread+0xcf/0xe0
[17596.853966] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[17596.853970] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
[17596.853972] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[17596.853972] Code: 24 10 4c 8b 7c 24 10 49 39 df 75 1b e9 e8 00 00 00 49 8b 07 48 89 44 24 10 4c 8b 7c 24 10 49 39 df 0f 84 d2 00 00 00 49 8b 47 48 <a8> 10 75 e2 49 8b 47 48 a8 40 74 da 49 8b be 70 03 00 00 e8 8e
- Another example of this issue triggering softlockups on kernel 3.10.0-693.el7.x86_64
nfs4_state_manager() => nfs_reap_expired_delegations() => nfs_revoke_delegation()
[949664.745423] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [192.168.0.xx-m:17637]
[949664.745429] Modules linked in: binfmt_misc xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack n
f_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables ipta
ble_filter rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev crc32_pclmul sg ghash_clmulni_intel virtio_balloon joydev virtio_rng aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc i2c_piix4 parport pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_net virtio_console virtio_scsi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata i2c_core crct10dif_pclmul
[949664.745491] crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring floppy virtio dm_mirror dm_region_hash dm_log dm_mod
[949664.745501] CPU: 5 PID: 17637 Comm: 192.168.0.xx-m Tainted: G L ------------ 3.10.0-693.el7.x86_64 #1
[949664.745504] Hardware name: Red Hat RHEV Hypervisor, BIOS 1.9.1-5.el7_3.2 04/01/2014
[949664.745506] task: ffff880fb3299fa0 ti: ffff880fe12a8000 task.ti: ffff880fe12a8000
[949664.745508] RIP: 0010:[<ffffffffc0507429>] [<ffffffffc0507429>] nfs_mark_return_delegation.isra.4+0x19/0x20 [nfsv4]
[949664.745535] RSP: 0018:ffff880fe12abdc0 EFLAGS: 00000202
[949664.745537] RAX: ffff880fe8760800 RBX: 00000000f50f06d8 RCX: 000000000000000f
[949664.745539] RDX: 000000000000000f RSI: ffff880411667100 RDI: ffff880fe6a99800
[949664.745540] RBP: ffff880fe12abdc0 R08: 0000000000000000 R09: 0000000000000000
[949664.745541] R10: ffff880fff359c40 R11: ffffea003c3f8980 R12: 0000000000000010
[949664.745543] R13: ffffffffc05074de R14: ffffffffffffff10 R15: ffff880fe6a99800
[949664.745545] FS: 0000000000000000(0000) GS:ffff880fff340000(0000) knlGS:0000000000000000
[949664.745547] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[949664.745548] CR2: 0000000000000004 CR3: 00000000019f2000 CR4: 00000000000006e0
[949664.745555] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[949664.745556] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[949664.745557] Stack:
[949664.745559] ffff880fe12abde8 ffffffffc0507551 ffff880fb2f5aed0 ffff880fe87608c8
[949664.745561] ffff880fe8760800 ffff880fe12abe58 ffffffffc05089ae ffffffffc05213c0
[949664.745564] ffff880ffa8e2180 ffff880411667100 0000000000000000 00592902022a921d
[949664.745566] Call Trace:
[949664.745580] [<ffffffffc0507551>] nfs_revoke_delegation+0x71/0x90 [nfsv4]
[949664.745592] [<ffffffffc05089ae>] nfs_reap_expired_delegations+0x1ae/0x220 [nfsv4]
[949664.745603] [<ffffffffc0504c22>] nfs4_state_manager+0x5f2/0x8c0 [nfsv4]
[949664.745626] [<ffffffffc0504ef0>] ? nfs4_state_manager+0x8c0/0x8c0 [nfsv4]
[949664.745637] [<ffffffffc0504f0f>] nfs4_run_state_manager+0x1f/0x40 [nfsv4]
[949664.745643] [<ffffffff810b098f>] kthread+0xcf/0xe0
[949664.745647] [<ffffffff8108ddeb>] ? do_exit+0x6bb/0xa40
[949664.745649] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[949664.745654] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
[949664.745656] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[949664.745657] Code: 48 8b 07 f0 80 88 28 01 00 00 20 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 f0 80 4e 48 02 48 8b 07 f0 80 88 28 01 00 00 20 <5d> c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 55 4c 8d af
Environment
- Red Hat Enterprise Linux 7.4 (NFS client)
- kernels between 3.10.0-693.el7 and prior to 3.10.0-693.5.2.el7
- NFS4 with delegations enabled (both NFSv4.0 and NFSv4.1 are affected)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.