RHEL8: Mellanox driver [mlx5_core] causing kernel stack overflow.
Issue
-
Mellanox driver
causing kernel stack overflow[3560714.847691] CIFS PidTable: buckets 64 [3560714.851549] CIFS BufTable: buckets 64 [3560716.170855] ------------[ cut here ]------------ [3560716.175662] NETDEV WATCHDOG: ens6f0 (mlx5_core): transmit queue 5 timed out [3560716.182855] WARNING: CPU: 74 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x272/0x280 [3560716.183813] Modules linked in: xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag nvidia_uvm(OE) nfsv3 nfs_acl nfs lockd grace fscache mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mmfs26(OE) mmfslinux(OE) tracedev(OE) nvidia_peermem(POE) uas usb_storage mpt3sas raid_class scsi_transport_sas xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_conntrack ipt_MASQUERADE nf_conntrack_netlink nft_counter xt_addrtype nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink br_netfilter tun bridge stp llc overlay rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) dell_rbu gdrdrv(POE) sunrpc vfat fat intel_rapl_msr dcdbas intel_rapl_common amd64_edac_mod edac_mce_amd amd_energy kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl pcspkr dell_smbios dell_wmi_descriptor wmi_bmof ipmi_ssif mgag200 mlx5_ib(OE) [3560716.183813] ib_uverbs(OE) ib_core(OE) ccp sp5100_tco k10temp i2c_piix4 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 sd_mod sg nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) nouveau mlx5_core(OE) video mxm_wmi mlxfw(OE) i2c_algo_bit pci_hyperv_intf drm_kms_helper tls syscopyarea ahci psample sysfillrect sysimgblt libahci mlxdevm(OE) nvme fb_sys_fops auxiliary(OE) megaraid_sas nvme_core libata crc32c_intel ttm tg3 t10_pi drm mlx_compat(OE) wmi dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: libcfs] [3560716.183813] CPU: 74 PID: 0 Comm: swapper/74 Kdump: loaded Tainted: P OEL --------- - - 4.18.0-305.25.1.el8_4.x86_64 #1 [3560716.183813] Hardware name: Dell Inc. PowerEdge XE8545/099K88, BIOS 2.6.6 01/13/2022 [3560716.183813] RIP: 0010:dev_watchdog+0x272/0x280 [3560716.183813] Code: 48 85 c0 75 e4 eb 9b 4c 89 f7 c6 05 b4 06 fe 00 01 e8 52 e4 fa ff 89 d9 4c 89 f6 48 c7 c7 40 c1 76 a8 48 89 c2 e8 e7 f2 8e ff <0f> 0b e9 7a ff ff ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 57 41 [3560716.183813] RSP: 0018:ffffa359cd784e88 EFLAGS: 00010282 [3560716.183813] RAX: 0000000000000000 RBX: 0000000000000005 RCX: 0000000000000000 [3560716.183813] RDX: ffff9469dfc267e0 RSI: ffff9469dfc16808 RDI: ffff9469dfc16808 [3560716.183813] RBP: ffff9488fff0045c R08: 0000000000001fda R09: 0000000000000050 [3560716.183813] R10: 0000000000000000 R11: ffffa359cd784d30 R12: 000000000000004a [3560716.183813] R13: ffff9488fff00480 R14: ffff9488fff00000 R15: 0000000000000400 [3560716.183813] FS: 0000000000000000(0000) GS:ffff9469dfc00000(0000) knlGS:0000000000000000 [3560716.183813] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3560716.183813] CR2: 00007ffe25598ff8 CR3: 000000650f210002 CR4: 0000000000770ee0 [3560716.183813] PKRU: 55555554 [3560716.183813] Call Trace: [3560716.183813] <IRQ> [3560716.183813] ? pfifo_fast_enqueue+0x140/0x140 [3560716.183813] call_timer_fn+0x2d/0x130 [3560716.183813] run_timer_softirq+0x1d8/0x410 [3560716.183813] ? __hrtimer_run_queues+0x130/0x280 [3560716.183813] ? ktime_get+0x36/0xa0 [3560716.183813] __do_softirq+0xd7/0x2d6 [3560716.183813] irq_exit+0xf7/0x100 [3560716.183813] smp_apic_timer_interrupt+0x74/0x130 [3560716.183813] apic_timer_interrupt+0xf/0x20 [3560716.183813] </IRQ> [3560716.183813] RIP: 0010:native_safe_halt+0xe/0x10 [3560716.183813] Code: ff ff 7f c3 65 48 8b 04 25 40 5c 01 00 f0 80 48 02 20 48 8b 00 a8 08 75 c4 eb 80 90 e9 07 00 00 00 0f 00 2d 96 b1 4b 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 86 b1 4b 00 f4 c3 90 90 0f 1f 44 00 [3560716.183813] RSP: 0018:ffffa359cc797ea0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13 [3560716.183813] RAX: ffffffffa7f4e580 RBX: 000000000000004a RCX: ffff94314dc6b300 [3560716.183813] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000ca6752b28b380 [3560716.183813] RBP: 000000000000004a R08: fffffffffff14239 R09: 0000000000029780 [3560716.183813] R10: 0021761cb6753275 R11: 0000000000000001 R12: ffffffffffffffff [3560716.183813] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9462064d9800 [3560716.183813] ? __sched_text_end+0x7/0x7 [3560716.183813] default_idle+0xa/0x10 [3560716.183813] default_idle_call+0x40/0xf0 [3560716.183813] do_idle+0x1f4/0x260 [3560716.183813] cpu_startup_entry+0x6f/0x80 [3560716.183813] start_secondary+0x199/0x1e0 [3560716.183813] secondary_startup_64_no_verify+0xc2/0xcb [3560716.183813] ---[ end trace 8645d071ba046d16 ]--- [3560716.584335] mlx5_core 0000:a1:00.0 ens6f0: TX timeout detected [3560716.590364] mlx5_core 0000:a1:00.0 ens6f0: TX timeout on queue: 5, SQ: 0x1201, CQ: 0x57c, SQ Cons: 0xd59a SQ Prod: 0xd669, usecs since last trans: 24988000 [3560716.605068] BUG: stack guard page was hit at 000000008665cdd0 (stack is 00000000e8ec2ebd..00000000ac5855c8) [3560716.606055] kernel stack overflow (page fault): 0000 [#1] SMP NOPTI <-- [3560716.606055] CPU: 85 PID: 417273 Comm: kworker/u192:0 Kdump: loaded Tainted: P W OEL --------- - - 4.18.0-305.25.1.el8_4.x86_64 #1 [3560716.606055] Hardware name: Dell Inc. PowerEdge XE8545/099K88, BIOS 2.6.6 01/13/2022 [3560716.606055] Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core] [3560716.606055] RIP: 0010:mlx5e_tx_reporter_dump_sq+0xe3/0x180 [mlx5_core] [3560716.606055] Code: c0 75 8a 48 c7 c6 18 fc 79 c0 48 89 ef e8 15 f8 ff ff 85 c0 0f 85 73 ff ff ff 48 89 ea 48 89 e6 48 89 df c7 04 24 06 00 00 00 <41> 8b 84 24 b8 03 00 00 c7 44 24 0c 01 00 00 00 89 44 24 04 e8 94 [3560716.606055] RSP: 0018:ffffa359e0d53c18 EFLAGS: 00010246 [3560716.606055] RAX: 0000000000000000 RBX: ffff9488fff00ac0 RCX: 0000000000000001 [3560716.606055] RDX: ffff943045d97500 RSI: ffffa359e0d53c18 RDI: ffff9488fff00ac0 [3560716.606055] RBP: ffff943045d97500 R08: ffff943045d97500 R09: ffff9430ee247a80 [3560716.606055] R10: 0000000000000100 R11: 0000000000000200 R12: ffffa359e0d53d30 [3560716.606055] R13: 0000000000000000 R14: ffffa359e0d53d40 R15: ffff9488fff00ac0 [3560716.606055] FS: 0000000000000000(0000) GS:ffff9489bfbc0000(0000) knlGS:0000000000000000 [3560716.606055] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3560716.606055] CR2: ffffa359e0d540e8 CR3: 000000650f210005 CR4: 0000000000770ee0 [3560716.606055] PKRU: 55555554 [3560716.606055] Call Trace: [3560716.606055] mlx5e_tx_reporter_dump+0x49/0x1f0 [mlx5_core] [3560716.606055] devlink_health_do_dump.part.73+0x5d/0xd0 [3560716.751170] devlink_health_report+0x174/0x1f0 [3560716.751170] mlx5e_reporter_tx_timeout+0xb9/0xf0 [mlx5_core] [3560716.751170] ? mlx5e_tx_reporter_err_cqe_recover+0x1d0/0x1d0 [mlx5_core] [3560716.751170] ? mlx5e_health_queue_dump+0xd0/0xd0 [mlx5_core] [3560716.751170] ? entry_SYSCALL_64_after_hwframe+0xb8/0xca [3560716.751170] ? __switch_to_asm+0x35/0x70 [3560716.751170] ? __switch_to_asm+0x41/0x70 [3560716.751170] ? __switch_to_asm+0x35/0x70 [3560716.751170] ? __switch_to_asm+0x41/0x70 [3560716.751170] ? __switch_to_asm+0x35/0x70 [3560716.751170] ? __switch_to_asm+0x41/0x70 [3560716.751170] ? __switch_to_asm+0x35/0x70 [3560716.751170] ? __switch_to_asm+0x41/0x70 [3560716.751170] ? __switch_to_asm+0x35/0x70 [3560716.751170] ? __switch_to_asm+0x41/0x70 [3560716.751170] ? __switch_to+0x183/0x480 [3560716.751170] mlx5e_tx_timeout_work+0x8b/0xb0 [mlx5_core] [3560716.751170] process_one_work+0x1a7/0x360 [3560716.751170] ? create_worker+0x1a0/0x1a0 [3560716.751170] worker_thread+0x30/0x390 [3560716.751170] ? create_worker+0x1a0/0x1a0 [3560716.751170] kthread+0x116/0x130 [3560716.751170] ? kthread_flush_work_fn+0x10/0x10 [3560716.751170] ret_from_fork+0x22/0x40
Environment
- Red Hat Enterprise Linux (RHEL)
- 8
- mlx5_core
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.