NFS over RDMA becomes unresponsive under heavy load in Red Hat Enterprise Linux 6
Issue
- Using NFS over RDMA. Can mount the exported NFS Share on the client and perform IO as long as load is light. When load gets heavy the client becomes unresponsive and will eventually produce a kernel oops. At the same time the server will log something like this in /var/log/messages:
May 10 14:33:58 hostname kernel: svcrdma: error fast registering xdr for xprt ffff88861fa83000svcrdma: Error fast registering memory for xprt ffff888fb6672800The client side will log something like this in /var/log/messages:
May 10 12:58:42 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 closed (-103) May 10 12:58:42 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 on mlx4_0, memreg 5 slots 32 ird 16 May 10 12:58:42 hostname kernel: ------------[ cut here ]------------ May 10 12:58:42 hostname kernel: WARNING: at kernel/softirq.c:143 local_bh_enable_ip+0x7b/0xa0() (Tainted: G W ---------------- ) May 10 12:58:42 hostname kernel: Hardware name: ProLiant DL980 G7 May 10 12:58:42 hostname kernel: Modules linked in: xprtrdma nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log power_meter hwmon serio_raw iTCO_wdt iTCO_vendor_support hpilo sg i7core_edac edac_core mlx4_ib ib_mad ib_core mlx4_en mlx4_core nx_nic(U) ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod [last unloaded: microcode] May 10 12:58:42 hostname kernel: Pid: 0, comm: swapper Tainted: G W ---------------- 2.6.32-71.el6.x86_64 #1 May 10 12:58:42 hostname kernel: Call Trace: May 10 12:58:42 hostname kernel: <IRQ> [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0 May 10 12:58:42 hostname kernel: [<ffffffff8106b8aa>] warn_slowpath_null+0x1a/0x20 May 10 12:58:42 hostname kernel: [<ffffffff8107431b>] local_bh_enable_ip+0x7b/0xa0 May 10 12:58:42 hostname kernel: [<ffffffff814cac1b>] _spin_unlock_bh+0x1b/0x20 May 10 12:58:42 hostname kernel: [<ffffffffa04530f2>] rpc_wake_up_status+0xa2/0xc0 [sunrpc] May 10 12:58:42 hostname kernel: [<ffffffffa044e22c>] xprt_wake_pending_tasks+0x2c/0x30 [sunrpc] May 10 12:58:42 hostname kernel: [<ffffffffa05be42c>] rpcrdma_conn_func+0x9c/0xb0 [xprtrdma] May 10 12:58:42 hostname kernel: [<ffffffffa05c1540>] rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma] May 10 12:58:42 hostname kernel: [<ffffffff8105c394>] ? try_to_wake_up+0x284/0x380 May 10 12:58:42 hostname kernel: [<ffffffffa030a0fb>] mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib] May 10 12:58:42 hostname kernel: [<ffffffffa02d83e4>] mlx4_qp_event+0x74/0xd0 [mlx4_core] May 10 12:58:42 hostname kernel: [<ffffffff8107d6c0>] ? process_timeout+0x0/0x10 May 10 12:58:42 hostname kernel: [<ffffffffa02cd858>] mlx4_eq_int+0x2a8/0x300 [mlx4_core] May 10 12:58:42 hostname kernel: [<ffffffff81095233>] ? hrtimer_get_next_event+0xc3/0x100 May 10 12:58:42 hostname kernel: [<ffffffffa02cd954>] mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core] May 10 12:58:42 hostname kernel: [<ffffffff810d8740>] handle_IRQ_event+0x60/0x170 May 10 12:58:42 hostname kernel: [<ffffffff810dae26>] handle_edge_irq+0xc6/0x160 May 10 12:58:42 hostname kernel: [<ffffffff810a1076>] ? tick_check_idle+0xb6/0xe0 May 10 12:58:42 hostname kernel: [<ffffffff81015fb9>] handle_irq+0x49/0xa0 May 10 12:58:42 hostname kernel: [<ffffffff814cf90c>] do_IRQ+0x6c/0xf0 May 10 12:58:42 hostname kernel: [<ffffffff81013ad3>] ret_from_intr+0x0/0x11 May 10 12:58:42 hostname kernel: <EOI> [<ffffffff8101bc01>] ? mwait_idle+0x71/0xd0 May 10 12:58:42 hostname kernel: [<ffffffff814cd80a>] ? atomic_notifier_call_chain+0x1a/0x20 May 10 12:58:42 hostname kernel: [<ffffffff81011e96>] cpu_idle+0xb6/0x110 May 10 12:58:42 hostname kernel: [<ffffffff814c17c8>] start_secondary+0x1fc/0x23f May 10 12:58:42 hostname kernel: ---[ end trace a7919e7f17c0a727 ]--- May 10 12:58:42 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 closed (-103) May 10 12:58:52 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 on mlx4_0, memreg 5 slots 32 ird 16 May 10 12:58:52 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 closed (-103)
Environment
- Red Hat Enterprise Linux 6 x86_64
- HP ProLiant DL980 G7 Servers
- InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.