ena driver TX timeouts leading to softirq hangs

Solution Verified - Updated -

Issue

  • AWS EC2 instances running are intermittently going into a hung state. Network transmit timeouts start occurring and the NIC is hung:
ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 492. 13094000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 493. 13097000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 494. 13101000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 495. 13104000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
ena 0000:00:05.0 eth0: The number of lost tx completions is above the threshold (132 > 128). Reset the device
ena 0000:00:05.0 eth0: Trigger reset is on
ena 0000:00:05.0 eth0: tx_timeout: 9
ena 0000:00:05.0 eth0: suspend: 0
ena 0000:00:05.0 eth0: resume: 0
ena 0000:00:05.0 eth0: wd_expired: 0
  • The kernel log (dmesg) shows multiple watchdog: BUG: soft lockup - CPU#1 stuck for 23s! errors, which then causes ena to restart and connection loss.

Environment

  • Red Hat Enterprise Linux 9.4 or earlier
  • Red Hat Enterprise Linux 8.10 with kernel-4.18.0-553.el8_10 or earlier
  • Amazon (AWS) instances using ena network driver

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content