ena driver TX timeouts leading to softirq hangs
Issue
- AWS EC2 instances running are intermittently going into a hung state. Network transmit timeouts start occurring and the NIC is hung:
ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 492. 13094000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 493. 13097000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 494. 13101000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 495. 13104000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
ena 0000:00:05.0 eth0: The number of lost tx completions is above the threshold (132 > 128). Reset the device
ena 0000:00:05.0 eth0: Trigger reset is on
ena 0000:00:05.0 eth0: tx_timeout: 9
ena 0000:00:05.0 eth0: suspend: 0
ena 0000:00:05.0 eth0: resume: 0
ena 0000:00:05.0 eth0: wd_expired: 0
- The kernel log (
dmesg) shows multiplewatchdog: BUG: soft lockup - CPU#1 stuck for 23s!errors, which then causesenato restart and connection loss.
Environment
- Red Hat Enterprise Linux 9.4 or earlier
- Red Hat Enterprise Linux 8.10 with
kernel-4.18.0-553.el8_10or earlier - Amazon (AWS) instances using
enanetwork driver
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.