System hangs with losing InfiniBand (IB) network connection from a third-party mlx5_core driver
Issue
- A RHEL system using an InfiniBand (IB) network loses connectivity and becomes unresponsive (hangs).
- A
vmcoreis generated (often via NMI) to analyze the hang. Thevmcore-dmesg.txtfile shows errors from network-dependent applications like Lustre, followed by aNETDEV WATCHDOGtimeout warning for themlx5_coredriver.
[6597276.873103] Lustre: 4502:0:(client.c:2321:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
[6597280.009029] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Timed out tx for 10.10.0.29@o2ib: 0 seconds
[6597280.009062] LNetError: 4441:0:(lib-move.c:3952:lnet_handle_recovery_reply()) peer NI (10.10.0.29@o2ib) recovery failed with -110
[6597280.969006] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Timed out tx for 10.10.0.37@o2ib: 0 seconds
[6597280.969010] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Skipped 20 previous similar messages
[6597285.000918] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Timed out tx for 10.10.3.29@o2ib: 7 seconds
[6597285.000921] LNet: 4441:0:(o2iblnd_cb.c:3456:kiblnd_check_conns()) Skipped 2 previous similar messages
[6597285.000948] LNetError: 4441:0:(lib-move.c:3952:lnet_handle_recovery_reply()) peer NI (10.10.3.29@o2ib) recovery failed with -110
[6597285.014142] LNetError: 4441:0:(lib-move.c:3952:lnet_handle_recovery_reply()) Skipped 4 previous similar messages
[6597286.600886] LNetError: 3697233:0:(lib-move.c:3952:lnet_handle_recovery_reply()) peer NI (10.10.0.35@o2ib) recovery failed with -113
[6597286.614378] LNetError: 3697233:0:(lib-move.c:3952:lnet_handle_recovery_reply()) Skipped 9 previous similar messages
[6597309.128358] ------------[ cut here ]------------
[6597309.128361] NETDEV WATCHDOG: ib0 (mlx5_core): transmit queue 13 timed out
[6597309.128373] WARNING: CPU: 26 PID: 3178942 at net/sched/sch_generic.c:482 dev_watchdog+0x29a/0x2b0
[6597309.128379] Modules linked in: nf_tables nfnetlink mgc(OE) lustre(OE) mdc(OE) fid(OE) lov(OE) osc(OE) lmv(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache uio_pci_generic uio vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse fuse rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) intel_rapl_msr intel_rapl_common sunrpc i10nm_edac nfit vfat fat libnvdimm x86_pkg_temp_thermal coretemp kvm_intel pmt_telemetry pmt_crashlog intel_sdsi pmt_class kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate mei_me intel_uncore pcspkr intel_th_gth idxd isst_if_mbox_pci mei isst_if_mmio intel_th_pci intel_th isst_if_common intel_vsec idxd_bus wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter binfmt_misc knem(OE) xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) raid1 sd_mod t10_pi sg drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt
[6597309.128435] fb_sys_fops drm_ttm_helper ttm drm mlx5_core(OE) igb ahci libahci crc32c_intel libata mlxfw(OE) psample pci_hyperv_intf mlxdevm(OE) dca mlx_compat(OE) tls i2c_algo_bit pinctrl_emmitsburg xpmem(OE)
[6597309.128448] CPU: 26 PID: 3178942 Comm: LSDYNA.exe Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.10.1.el8_8.x86_64 #1
[6597309.128450] Hardware name: Lenovo ThinkSystem SD650 V3/SB27B45784, BIOS USE124L-3.30 10/27/2023
[6597309.128451] RIP: 0010:dev_watchdog+0x29a/0x2b0
[6597309.128453] Code: e4 eb 9a 4c 8b 3c 24 c6 05 13 2d 6d 01 01 4c 89 ff e8 5a 39 fa ff 89 d9 4c 89 fe 48 c7 c7 60 52 5a 84 48 89 c2 e8 a3 80 86 ff <0f> 0b e9 75 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
[6597309.128454] RSP: 0000:ff3124f6cd080e78 EFLAGS: 00010286
[6597309.128455] RAX: 0000000000000000 RBX: 000000000000000d RCX: 0000000000000027
[6597309.128456] RDX: 0000000000000027 RSI: 00000000ffff7fff RDI: ff181f63bf69e690
[6597309.128457] RBP: 000000000000001a R08: 0000000000000000 R09: c0000000ffff7fff
[6597309.128457] R10: 0000000000000001 R11: ff3124f6cd080c90 R12: ff181f24c808845c
[6597309.128458] R13: 00000000000001f8 R14: ff181f24c8088480 R15: ff181f24c8088000
[6597309.128458] FS: 000014efb3ae6780(0000) GS:ff181f63bf680000(0000) knlGS:0000000000000000
[6597309.128459] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6597309.128460] CR2: 000000001f118ff0 CR3: 00000021934be003 CR4: 0000000000771ee0
[6597309.128461] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[6597309.128461] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[6597309.128462] PKRU: 55555554
[6597309.128463] Call Trace:
[6597309.128465] <IRQ>
[6597309.128467] ? pfifo_fast_enqueue+0x150/0x150
[6597309.128469] call_timer_fn+0x2e/0x130
[6597309.128473] run_timer_softirq+0x1d8/0x410
[6597309.128476] ? sched_clock+0x5/0x10
[6597309.128478] __do_softirq+0xdc/0x2cf
[6597309.128482] irq_exit_rcu+0xd5/0xe0
[6597309.128485] irq_exit+0xa/0x10
[6597309.128486] smp_apic_timer_interrupt+0x74/0x130
[6597309.128488] apic_timer_interrupt+0xf/0x20
[6597309.128490] </IRQ>
[6597309.128490] RIP: 0033:0x14efb17eb1e9
[6597309.128492] Code: dd 55 ae ff eb d9 0f 1f 40 00 0f 1f 80 00 00 00 00 41 54 41 55 41 56 41 57 53 55 56 48 8b 1d 1e 97 e6 00 48 8b 2d f7 96 e4 00 <48> 85 db 0f 84 f5 03 00 00 0f 31 48 c1 e2 20 48 0b c2 49 89 c4 48
[6597309.128493] RSP: 002b:00007ffd90774b20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[6597309.128494] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[6597309.128495] RDX: 0054613b00000002 RSI: 0000000000000000 RDI: 00007f20009ce380
[6597309.128496] RBP: 0000000000000000 R08: 00111751633a4eb1 R09: 00007ffd90774b90
[6597309.128496] R10: 000014efb2675fc8 R11: 0000000000000000 R12: 0000000000000000
[6597309.128497] R13: 0000000000000005 R14: 00007ffd90774ce0 R15: 000000000000072d
[6597309.128498] ---[ end trace 85f5f37a5901abf6 ]---
[6597309.128500] ib0: transmit timeout: latency 10605 msecs
[6597309.128501] ib0: queue stopped 0, tx_head 0, tx_tail 0, global_tx_head 0, global_tx_tail 0
[6597323.464077] LNetError: 3699350:0:(lib-move.c:3952:lnet_handle_recovery_reply()) peer NI (10.10.0.38@o2ib) recovery failed with -113
[6597323.477565] LNetError: 3699350:0:(lib-move.c:3952:lnet_handle_recovery_reply()) Skipped 3 previous similar messages
[6597351.614380] mlx5_core 0000:63:00.0 ib0: Failed to get min RX wqes on Channel[0] RQN[0xc01049] wq cur_sz(0) min_rx_wqes(128)
[6597351.614386] mlx5_core 0000:63:00.0 ib0: RX timeout on channel: 0, ICOSQ: 0x8a3e, RQ: 0xc01049, CQ: 0x1449
[6597351.625344] mlx5_core 0000:63:00.0 ib0: EQ 0x7: Cons = 0x3d362e, irqn = 0x42
[6597351.656381] mlx5_core 0000:63:00.0 ib0: Failed to get min RX wqes on Channel[1] RQN[0xc0104a] wq cur_sz(0) min_rx_wqes(128)
[6597351.656384] mlx5_core 0000:63:00.0 ib0: RX timeout on channel: 1, ICOSQ: 0x8a43, RQ: 0xc0104a, CQ: 0x144e
[6597351.667321] mlx5_core 0000:63:00.0 ib0: EQ 0x8: Cons = 0x76f7e, irqn = 0x43
Environment
- Red Hat Enterprise Linux 8
- Third-party (out-of-tree)
mlx5_coredriver
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.