irqbalance does not balance the IRQ correctly if the underlying network device resets

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 6.9 and earlier
  • Red Hat Enterprise Linux (RHEL) 7.4 and earlier
  • irqbalance-1.0.7-8
  • Network device reset
    • Common on Amazon AWS with enavf or ena driver
    • Any other NIC driver could produce it with the right conditions

Issue

  • If an interrupt channel (IRQ) disappears and reappears later (as happens frequently in AWS using the ena driver) the IRQ is not balanced correctly due to overflow in irq_count as the counter got smaller and difference cause overflow.
  • The issue is very much reproducible when a VM has high network load on Amazon AWS VMs with an enavf device.
  • The following messages are logged:

    kernel: [4293535.378166] ena 0000:00:03.0: eth0: Transmit time out
    
    kernel: [4293551.684567] ena: ena device version: 0.10
    kernel: [4293551.686344] ena: ena controller version: 0.0.1 implementation version 1
    kernel: [4293553.104073] ena 0000:00:03.0: irq 48 for MSI/MSI-X
    
    
    kernel: [4293553.104916] ena 0000:00:03.0: irq 56 for MSI/MSI-X
    kernel: [4293553.111858] ena 0000:00:03.0: Device reset completed successfully
    

Resolution

Root Cause

This issue was fixed in upstream irqbalance with commit 93ed801.

This was backported to RHEL 7.5 with Red Hat Bug 1536373, to RHEL 7.4.z with Red Hat Bug 1542450, to RHEL 6.10 with Red Hat Bug 1536370, to RHEL 6.9.z with Red Hat Bug 1541290, and to RHEL 6.7.z EUS with Red Hat Bug 1541293.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments