RHEL8: Mellanox driver [mlx5_core] causing kernel stack overflow.

Solution Verified - Updated -

Issue

  • Mellanox driver causing kernel stack overflow

    [3560714.847691] CIFS PidTable: buckets 64
    [3560714.851549] CIFS BufTable: buckets 64
    [3560716.170855] ------------[ cut here ]------------
    [3560716.175662] NETDEV WATCHDOG: ens6f0 (mlx5_core): transmit queue 5 timed out
    [3560716.182855] WARNING: CPU: 74 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x272/0x280
    [3560716.183813] Modules linked in: xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag nvidia_uvm(OE) nfsv3 nfs_acl nfs lockd grace fscache mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mmfs26(OE) mmfslinux(OE) tracedev(OE) nvidia_peermem(POE) uas usb_storage mpt3sas raid_class scsi_transport_sas xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_conntrack ipt_MASQUERADE nf_conntrack_netlink nft_counter xt_addrtype nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink br_netfilter tun bridge stp llc overlay rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) dell_rbu gdrdrv(POE) sunrpc vfat fat intel_rapl_msr dcdbas intel_rapl_common amd64_edac_mod edac_mce_amd amd_energy kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl pcspkr dell_smbios dell_wmi_descriptor wmi_bmof ipmi_ssif mgag200 mlx5_ib(OE)
    [3560716.183813]  ib_uverbs(OE) ib_core(OE) ccp sp5100_tco k10temp i2c_piix4 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 sd_mod sg nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) nouveau mlx5_core(OE) video mxm_wmi mlxfw(OE) i2c_algo_bit pci_hyperv_intf drm_kms_helper tls syscopyarea ahci psample sysfillrect sysimgblt libahci mlxdevm(OE) nvme fb_sys_fops auxiliary(OE) megaraid_sas nvme_core libata crc32c_intel ttm tg3 t10_pi drm mlx_compat(OE) wmi dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: libcfs]
    [3560716.183813] CPU: 74 PID: 0 Comm: swapper/74 Kdump: loaded Tainted: P           OEL   --------- -  - 4.18.0-305.25.1.el8_4.x86_64 #1
    [3560716.183813] Hardware name: Dell Inc. PowerEdge XE8545/099K88, BIOS 2.6.6 01/13/2022
    [3560716.183813] RIP: 0010:dev_watchdog+0x272/0x280
    [3560716.183813] Code: 48 85 c0 75 e4 eb 9b 4c 89 f7 c6 05 b4 06 fe 00 01 e8 52 e4 fa ff 89 d9 4c 89 f6 48 c7 c7 40 c1 76 a8 48 89 c2 e8 e7 f2 8e ff <0f> 0b e9 7a ff ff ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 57 41
    [3560716.183813] RSP: 0018:ffffa359cd784e88 EFLAGS: 00010282
    [3560716.183813] RAX: 0000000000000000 RBX: 0000000000000005 RCX: 0000000000000000
    [3560716.183813] RDX: ffff9469dfc267e0 RSI: ffff9469dfc16808 RDI: ffff9469dfc16808
    [3560716.183813] RBP: ffff9488fff0045c R08: 0000000000001fda R09: 0000000000000050
    [3560716.183813] R10: 0000000000000000 R11: ffffa359cd784d30 R12: 000000000000004a
    [3560716.183813] R13: ffff9488fff00480 R14: ffff9488fff00000 R15: 0000000000000400
    [3560716.183813] FS:  0000000000000000(0000) GS:ffff9469dfc00000(0000) knlGS:0000000000000000
    [3560716.183813] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [3560716.183813] CR2: 00007ffe25598ff8 CR3: 000000650f210002 CR4: 0000000000770ee0
    [3560716.183813] PKRU: 55555554
    [3560716.183813] Call Trace:
    [3560716.183813]  <IRQ>
    [3560716.183813]  ? pfifo_fast_enqueue+0x140/0x140
    [3560716.183813]  call_timer_fn+0x2d/0x130
    [3560716.183813]  run_timer_softirq+0x1d8/0x410
    [3560716.183813]  ? __hrtimer_run_queues+0x130/0x280
    [3560716.183813]  ? ktime_get+0x36/0xa0
    [3560716.183813]  __do_softirq+0xd7/0x2d6
    [3560716.183813]  irq_exit+0xf7/0x100
    [3560716.183813]  smp_apic_timer_interrupt+0x74/0x130
    [3560716.183813]  apic_timer_interrupt+0xf/0x20
    [3560716.183813]  </IRQ>
    [3560716.183813] RIP: 0010:native_safe_halt+0xe/0x10
    [3560716.183813] Code: ff ff 7f c3 65 48 8b 04 25 40 5c 01 00 f0 80 48 02 20 48 8b 00 a8 08 75 c4 eb 80 90 e9 07 00 00 00 0f 00 2d 96 b1 4b 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 86 b1 4b 00 f4 c3 90 90 0f 1f 44 00
    [3560716.183813] RSP: 0018:ffffa359cc797ea0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
    [3560716.183813] RAX: ffffffffa7f4e580 RBX: 000000000000004a RCX: ffff94314dc6b300
    [3560716.183813] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000ca6752b28b380
    [3560716.183813] RBP: 000000000000004a R08: fffffffffff14239 R09: 0000000000029780
    [3560716.183813] R10: 0021761cb6753275 R11: 0000000000000001 R12: ffffffffffffffff
    [3560716.183813] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9462064d9800
    [3560716.183813]  ? __sched_text_end+0x7/0x7
    [3560716.183813]  default_idle+0xa/0x10
    [3560716.183813]  default_idle_call+0x40/0xf0
    [3560716.183813]  do_idle+0x1f4/0x260
    [3560716.183813]  cpu_startup_entry+0x6f/0x80
    [3560716.183813]  start_secondary+0x199/0x1e0
    [3560716.183813]  secondary_startup_64_no_verify+0xc2/0xcb
    [3560716.183813] ---[ end trace 8645d071ba046d16 ]---
    [3560716.584335] mlx5_core 0000:a1:00.0 ens6f0: TX timeout detected
    [3560716.590364] mlx5_core 0000:a1:00.0 ens6f0: TX timeout on queue: 5, SQ: 0x1201, CQ: 0x57c, SQ Cons: 0xd59a SQ Prod: 0xd669, usecs since last trans: 24988000
    [3560716.605068] BUG: stack guard page was hit at 000000008665cdd0 (stack is 00000000e8ec2ebd..00000000ac5855c8)
    [3560716.606055] kernel stack overflow (page fault): 0000 [#1] SMP NOPTI  <--
    [3560716.606055] CPU: 85 PID: 417273 Comm: kworker/u192:0 Kdump: loaded Tainted: P        W  OEL   --------- -  - 4.18.0-305.25.1.el8_4.x86_64 #1
    [3560716.606055] Hardware name: Dell Inc. PowerEdge XE8545/099K88, BIOS 2.6.6 01/13/2022
    [3560716.606055] Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
    [3560716.606055] RIP: 0010:mlx5e_tx_reporter_dump_sq+0xe3/0x180 [mlx5_core]
    [3560716.606055] Code: c0 75 8a 48 c7 c6 18 fc 79 c0 48 89 ef e8 15 f8 ff ff 85 c0 0f 85 73 ff ff ff 48 89 ea 48 89 e6 48 89 df c7 04 24 06 00 00 00 <41> 8b 84 24 b8 03 00 00 c7 44 24 0c 01 00 00 00 89 44 24 04 e8 94
    [3560716.606055] RSP: 0018:ffffa359e0d53c18 EFLAGS: 00010246
    [3560716.606055] RAX: 0000000000000000 RBX: ffff9488fff00ac0 RCX: 0000000000000001
    [3560716.606055] RDX: ffff943045d97500 RSI: ffffa359e0d53c18 RDI: ffff9488fff00ac0
    [3560716.606055] RBP: ffff943045d97500 R08: ffff943045d97500 R09: ffff9430ee247a80
    [3560716.606055] R10: 0000000000000100 R11: 0000000000000200 R12: ffffa359e0d53d30
    [3560716.606055] R13: 0000000000000000 R14: ffffa359e0d53d40 R15: ffff9488fff00ac0
    [3560716.606055] FS:  0000000000000000(0000) GS:ffff9489bfbc0000(0000) knlGS:0000000000000000
    [3560716.606055] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [3560716.606055] CR2: ffffa359e0d540e8 CR3: 000000650f210005 CR4: 0000000000770ee0
    [3560716.606055] PKRU: 55555554
    [3560716.606055] Call Trace:
    [3560716.606055]  mlx5e_tx_reporter_dump+0x49/0x1f0 [mlx5_core]
    [3560716.606055]  devlink_health_do_dump.part.73+0x5d/0xd0
    [3560716.751170]  devlink_health_report+0x174/0x1f0
    [3560716.751170]  mlx5e_reporter_tx_timeout+0xb9/0xf0 [mlx5_core]
    [3560716.751170]  ? mlx5e_tx_reporter_err_cqe_recover+0x1d0/0x1d0 [mlx5_core]
    [3560716.751170]  ? mlx5e_health_queue_dump+0xd0/0xd0 [mlx5_core]
    [3560716.751170]  ? entry_SYSCALL_64_after_hwframe+0xb8/0xca
    [3560716.751170]  ? __switch_to_asm+0x35/0x70
    [3560716.751170]  ? __switch_to_asm+0x41/0x70
    [3560716.751170]  ? __switch_to_asm+0x35/0x70
    [3560716.751170]  ? __switch_to_asm+0x41/0x70
    [3560716.751170]  ? __switch_to_asm+0x35/0x70
    [3560716.751170]  ? __switch_to_asm+0x41/0x70
    [3560716.751170]  ? __switch_to_asm+0x35/0x70
    [3560716.751170]  ? __switch_to_asm+0x41/0x70
    [3560716.751170]  ? __switch_to_asm+0x35/0x70
    [3560716.751170]  ? __switch_to_asm+0x41/0x70
    [3560716.751170]  ? __switch_to+0x183/0x480
    [3560716.751170]  mlx5e_tx_timeout_work+0x8b/0xb0 [mlx5_core]
    [3560716.751170]  process_one_work+0x1a7/0x360
    [3560716.751170]  ? create_worker+0x1a0/0x1a0
    [3560716.751170]  worker_thread+0x30/0x390
    [3560716.751170]  ? create_worker+0x1a0/0x1a0
    [3560716.751170]  kthread+0x116/0x130
    [3560716.751170]  ? kthread_flush_work_fn+0x10/0x10
    [3560716.751170]  ret_from_fork+0x22/0x40
    

Environment

  • Red Hat Enterprise Linux (RHEL)
    • 8
  • mlx5_core

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content