NFS over RDMA becomes unresponsive under heavy load in Red Hat Enterprise Linux 6

Solution Verified - Updated -

Issue

  • Using NFS over RDMA.  Can mount the exported NFS Share on the client and perform IO as long as load is light.  When load gets heavy the client becomes unresponsive and will eventually produce a kernel oops.  At the same time the server will log something like this in /var/log/messages:
    May 10 14:33:58 hostname kernel: svcrdma: error fast registering xdr for xprt ffff88861fa83000svcrdma: Error fast registering memory for xprt ffff888fb6672800

    The client side will log something like this in /var/log/messages:

    May 10 12:58:42 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 closed (-103)
    May 10 12:58:42 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 on mlx4_0, memreg 5 slots 32 ird 16
    May 10 12:58:42 hostname kernel: ------------[ cut here ]------------
    May
    10 12:58:42 hostname kernel: WARNING: at kernel/softirq.c:143
    local_bh_enable_ip+0x7b/0xa0() (Tainted: G        W  ---------------- )
    May 10 12:58:42 hostname kernel: Hardware name: ProLiant DL980 G7
    May
    10 12:58:42 hostname kernel: Modules linked in: xprtrdma nfs fscache nfsd
    lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm
    ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa
    dm_mirror dm_region_hash dm_log power_meter hwmon serio_raw iTCO_wdt
    iTCO_vendor_support hpilo sg i7core_edac edac_core mlx4_ib ib_mad
    ib_core mlx4_en mlx4_core nx_nic(U) ext4 mbcache jbd2 sr_mod cdrom
    sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa qla2xxx
    scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit
    i2c_core dm_mod [last unloaded: microcode]
    May 10 12:58:42 hostname kernel: Pid: 0, comm: swapper Tainted: G        W  ----------------  2.6.32-71.el6.x86_64 #1
    May 10 12:58:42 hostname kernel: Call Trace:
    May 10 12:58:42 hostname kernel: <IRQ>  [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0
    May 10 12:58:42 hostname kernel: [<ffffffff8106b8aa>] warn_slowpath_null+0x1a/0x20
    May 10 12:58:42 hostname kernel: [<ffffffff8107431b>] local_bh_enable_ip+0x7b/0xa0
    May 10 12:58:42 hostname kernel: [<ffffffff814cac1b>] _spin_unlock_bh+0x1b/0x20
    May 10 12:58:42 hostname kernel: [<ffffffffa04530f2>] rpc_wake_up_status+0xa2/0xc0 [sunrpc]
    May 10 12:58:42 hostname kernel: [<ffffffffa044e22c>] xprt_wake_pending_tasks+0x2c/0x30 [sunrpc]
    May 10 12:58:42 hostname kernel: [<ffffffffa05be42c>] rpcrdma_conn_func+0x9c/0xb0 [xprtrdma]
    May 10 12:58:42 hostname kernel: [<ffffffffa05c1540>] rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma]
    May 10 12:58:42 hostname kernel: [<ffffffff8105c394>] ? try_to_wake_up+0x284/0x380
    May 10 12:58:42 hostname kernel: [<ffffffffa030a0fb>] mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib]
    May 10 12:58:42 hostname kernel: [<ffffffffa02d83e4>] mlx4_qp_event+0x74/0xd0 [mlx4_core]
    May 10 12:58:42 hostname kernel: [<ffffffff8107d6c0>] ? process_timeout+0x0/0x10
    May 10 12:58:42 hostname kernel: [<ffffffffa02cd858>] mlx4_eq_int+0x2a8/0x300 [mlx4_core]
    May 10 12:58:42 hostname kernel: [<ffffffff81095233>] ? hrtimer_get_next_event+0xc3/0x100
    May 10 12:58:42 hostname kernel: [<ffffffffa02cd954>] mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core]
    May 10 12:58:42 hostname kernel: [<ffffffff810d8740>] handle_IRQ_event+0x60/0x170
    May 10 12:58:42 hostname kernel: [<ffffffff810dae26>] handle_edge_irq+0xc6/0x160
    May 10 12:58:42 hostname kernel: [<ffffffff810a1076>] ? tick_check_idle+0xb6/0xe0
    May 10 12:58:42 hostname kernel: [<ffffffff81015fb9>] handle_irq+0x49/0xa0
    May 10 12:58:42 hostname kernel: [<ffffffff814cf90c>] do_IRQ+0x6c/0xf0
    May 10 12:58:42 hostname kernel: [<ffffffff81013ad3>] ret_from_intr+0x0/0x11
    May 10 12:58:42 hostname kernel: <EOI>  [<ffffffff8101bc01>] ? mwait_idle+0x71/0xd0
    May 10 12:58:42 hostname kernel: [<ffffffff814cd80a>] ? atomic_notifier_call_chain+0x1a/0x20
    May 10 12:58:42 hostname kernel: [<ffffffff81011e96>] cpu_idle+0xb6/0x110
    May 10 12:58:42 hostname kernel: [<ffffffff814c17c8>] start_secondary+0x1fc/0x23f
    May 10 12:58:42 hostname kernel: ---[ end trace a7919e7f17c0a727 ]---
    May 10 12:58:42 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 closed (-103)
    May 10 12:58:52 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 on mlx4_0, memreg 5 slots 32 ird 16
    May 10 12:58:52 hostname kernel: rpcrdma: connection to 192.168.230.181:2050 closed (-103)
    

Environment

  • Red Hat Enterprise Linux 6 x86_64
  • HP ProLiant DL980 G7 Servers
  • InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content