System hangs with losing InfiniBand (IB) network connection from a third-party mlx5_core driver

Solution In Progress - Updated -

Issue

  • A RHEL system using an InfiniBand (IB) network loses connectivity and becomes unresponsive (hangs).
  • A vmcore is generated (often via NMI) to analyze the hang. The vmcore-dmesg.txt file shows errors from network-dependent applications like Lustre, followed by a NETDEV WATCHDOG timeout warning for the mlx5_core driver.
[6597276.873103] Lustre: 4502:0:(client.c:2321:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
[6597280.009029] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Timed out tx for 10.10.0.29@o2ib: 0 seconds
[6597280.009062] LNetError: 4441:0:(lib-move.c:3952:lnet_handle_recovery_reply()) peer NI (10.10.0.29@o2ib) recovery failed with -110
[6597280.969006] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Timed out tx for 10.10.0.37@o2ib: 0 seconds
[6597280.969010] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Skipped 20 previous similar messages
[6597285.000918] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Timed out tx for 10.10.3.29@o2ib: 7 seconds
[6597285.000921] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Skipped 2 previous similar messages
[6597285.000948] LNetError: 4441:0:(lib-move.c:3952:lnet_handle_recovery_reply()) peer NI (10.10.3.29@o2ib) recovery failed with -110
[6597285.014142] LNetError: 4441:0:(lib-move.c:3952:lnet_handle_recovery_reply()) Skipped 4 previous similar messages
[6597286.600886] LNetError: 3697233:0:(lib-move.c:3952:lnet_handle_recovery_reply()) peer NI (10.10.0.35@o2ib) recovery failed with -113
[6597286.614378] LNetError: 3697233:0:(lib-move.c:3952:lnet_handle_recovery_reply()) Skipped 9 previous similar messages
[6597309.128358] ------------[ cut here ]------------
[6597309.128361] NETDEV WATCHDOG: ib0 (mlx5_core): transmit queue 13 timed out
[6597309.128373] WARNING: CPU: 26 PID: 3178942 at net/sched/sch_generic.c:482 dev_watchdog+0x29a/0x2b0
[6597309.128379] Modules linked in: nf_tables nfnetlink mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache uio_pci_generic uio vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse fuse rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) intel_rapl_msr intel_rapl_common sunrpc i10nm_edac nfit vfat fat libnvdimm x86_pkg_temp_thermal coretemp kvm_intel pmt_telemetry pmt_crashlog intel_sdsi pmt_class kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate mei_me intel_uncore pcspkr intel_th_gth idxd isst_if_mbox_pci mei isst_if_mmio intel_th_pci intel_th isst_if_common intel_vsec idxd_bus wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter binfmt_misc knem(OE) xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) raid1 sd_mod t10_pi sg drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt
[6597309.128435]  fb_sys_fops drm_ttm_helper ttm drm mlx5_core(OE) igb ahci libahci crc32c_intel libata mlxfw(OE) psample pci_hyperv_intf mlxdevm(OE) dca mlx_compat(OE) tls i2c_algo_bit pinctrl_emmitsburg xpmem(OE)
[6597309.128448] CPU: 26 PID: 3178942 Comm: LSDYNA.exe Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-477.10.1.el8_8.x86_64 #1
[6597309.128450] Hardware name: Lenovo ThinkSystem SD650 V3/SB27B45784, BIOS USE124L-3.30 10/27/2023
[6597309.128451] RIP: 0010:dev_watchdog+0x29a/0x2b0
[6597309.128453] Code: e4 eb 9a 4c 8b 3c 24 c6 05 13 2d 6d 01 01 4c 89 ff e8 5a 39 fa ff 89 d9 4c 89 fe 48 c7 c7 60 52 5a 84 48 89 c2 e8 a3 80 86 ff <0f> 0b e9 75 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
[6597309.128454] RSP: 0000:ff3124f6cd080e78 EFLAGS: 00010286
[6597309.128455] RAX: 0000000000000000 RBX: 000000000000000d RCX: 0000000000000027
[6597309.128456] RDX: 0000000000000027 RSI: 00000000ffff7fff RDI: ff181f63bf69e690
[6597309.128457] RBP: 000000000000001a R08: 0000000000000000 R09: c0000000ffff7fff
[6597309.128457] R10: 0000000000000001 R11: ff3124f6cd080c90 R12: ff181f24c808845c
[6597309.128458] R13: 00000000000001f8 R14: ff181f24c8088480 R15: ff181f24c8088000
[6597309.128458] FS:  000014efb3ae6780(0000) GS:ff181f63bf680000(0000) knlGS:0000000000000000
[6597309.128459] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6597309.128460] CR2: 000000001f118ff0 CR3: 00000021934be003 CR4: 0000000000771ee0
[6597309.128461] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[6597309.128461] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[6597309.128462] PKRU: 55555554
[6597309.128463] Call Trace:
[6597309.128465]  <IRQ>
[6597309.128467]  ? pfifo_fast_enqueue+0x150/0x150
[6597309.128469]  call_timer_fn+0x2e/0x130
[6597309.128473]  run_timer_softirq+0x1d8/0x410
[6597309.128476]  ? sched_clock+0x5/0x10
[6597309.128478]  __do_softirq+0xdc/0x2cf
[6597309.128482]  irq_exit_rcu+0xd5/0xe0
[6597309.128485]  irq_exit+0xa/0x10
[6597309.128486]  smp_apic_timer_interrupt+0x74/0x130
[6597309.128488]  apic_timer_interrupt+0xf/0x20
[6597309.128490]  </IRQ>
[6597309.128490] RIP: 0033:0x14efb17eb1e9
[6597309.128492] Code: dd 55 ae ff eb d9 0f 1f 40 00 0f 1f 80 00 00 00 00 41 54 41 55 41 56 41 57 53 55 56 48 8b 1d 1e 97 e6 00 48 8b 2d f7 96 e4 00 <48> 85 db 0f 84 f5 03 00 00 0f 31 48 c1 e2 20 48 0b c2 49 89 c4 48
[6597309.128493] RSP: 002b:00007ffd90774b20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[6597309.128494] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[6597309.128495] RDX: 0054613b00000002 RSI: 0000000000000000 RDI: 00007f20009ce380
[6597309.128496] RBP: 0000000000000000 R08: 00111751633a4eb1 R09: 00007ffd90774b90
[6597309.128496] R10: 000014efb2675fc8 R11: 0000000000000000 R12: 0000000000000000
[6597309.128497] R13: 0000000000000005 R14: 00007ffd90774ce0 R15: 000000000000072d
[6597309.128498] ---[ end trace 85f5f37a5901abf6 ]---
[6597309.128500] ib0: transmit timeout: latency 10605 msecs
[6597309.128501] ib0: queue stopped 0, tx_head 0, tx_tail 0, global_tx_head 0, global_tx_tail 0
[6597323.464077] LNetError: 3699350:0:(lib-move.c:3952:lnet_handle_recovery_reply()) peer NI (10.10.0.38@o2ib) recovery failed with -113
[6597323.477565] LNetError: 3699350:0:(lib-move.c:3952:lnet_handle_recovery_reply()) Skipped 3 previous similar messages
[6597351.614380] mlx5_core 0000:63:00.0 ib0: Failed to get min RX wqes on Channel[0] RQN[0xc01049] wq cur_sz(0) min_rx_wqes(128)
[6597351.614386] mlx5_core 0000:63:00.0 ib0: RX timeout on channel: 0, ICOSQ: 0x8a3e, RQ: 0xc01049, CQ: 0x1449
[6597351.625344] mlx5_core 0000:63:00.0 ib0: EQ 0x7: Cons = 0x3d362e, irqn = 0x42
[6597351.656381] mlx5_core 0000:63:00.0 ib0: Failed to get min RX wqes on Channel[1] RQN[0xc0104a] wq cur_sz(0) min_rx_wqes(128)
[6597351.656384] mlx5_core 0000:63:00.0 ib0: RX timeout on channel: 1, ICOSQ: 0x8a43, RQ: 0xc0104a, CQ: 0x144e
[6597351.667321] mlx5_core 0000:63:00.0 ib0: EQ 0x8: Cons = 0x76f7e, irqn = 0x43

Environment

  • Red Hat Enterprise Linux 8
  • Third-party (out-of-tree) mlx5_core driver

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content