Why does the Broadcom NetXtreme 5709 NIC stop receiving packets intermittently on RHEL 5.3 and newer?

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5.3 to 5.5

  • Network Interface Cards (NIC) using the bnx2 driver including:

    • Broadcom Corporation NetXtreme II BCM5709S Gigabit Ethernet

Issue

  • In certain situations under heavy loads, the network interface card can stop accepting packets from remote devices.
  • This problem has been reported on Red Hat Enterprise Linux 5.3 (RHEL 5.3) and newer when using a Broadcom NetXtreme 5709 network interface card.

Resolution

  • Red Hat has released kernel-2.6.18-194.3.1.el5 which will address this issue in RHEL 5. It can be downloaded from the following link:
    https://access.redhat.com/errata/RHSA-2010:0398
  • in certain circumstances, under heavy load, certain network interface cards using the bnx2 driver and configured to use MSI-X(extended MSI), could stop processing interrupts and then network connectivity would cease. Bugzilla 587799

If upgrading the kernel is not an option, review the following workarounds

  • Disable MSI-X in the bnx2 driver. To do this, add the following line to /etc/modprobe.conf
options bnx2 disable_msi=1
  • Disable MSI(message signalled interrupt) completely by booting with the pci=nomsi boot parameter. Obviously, this will disable MSI on all devices that are able to utilize it.
    Note: MSI-X increases network performance, so disabling it means that the performance will return to the level available before MSI-X was introduced.

  • Disable C-States in BIOS. Refer to the vendor system documentation in order to learn how to do this.

Root Cause

  • The kernel gets out  of sync with interrupts generated by the network  interface card which results in an inability to process interrupts,  causing packets to be dropped and ultimately, lost connectivity.
    • When this situation  occurs, the rx_fw_discards counter will  keep increasing as remote devices unsuccessfully attempt to  communicate with the system via the NIC.
  • It has been reported that under certain heavy traffic conditions in MSI-X mode, the bnx2 driver can lose an MSI-X vector causing all packets in the associated rx/tx ring pair to be dropped.  The problem is caused by the chip dropping the write to unmask the MSI-X vector by the kernel (when migrating the IRQ for example).This can be prevented by increasing the GRC timeout value for these register read and write operations.
  • The upstream patch resolving this issue is available here:
    Commit id : c441b8d2cb2194b05550a558d6d95d8944e56a84

Diagnostic Steps

  • The kernel gets out  of  sync with regard to the interrupts generated by the network  interface card which prevents the reception of packets on  this network device  which results in no processing of interrupts,  dropped packets and  ultimately, lost connectivity.
    • When this situation  occurs, the rx_fw_discards counter displayed by the ethtool utility will  keep increasing in value as remote devices unsuccessfully attempt to  communicate with the system via the NIC.
    • It should be noted  that packets are occasionally dropped by the NIC as part of normal  operation which causes rx_fw_discards to increment, but this does not  necessarily indicate the issue in question has manifested.

The keys to  determining that this specific problem has occurred are:

  1. Confirm that all packets sent to the NIC are dropped by repeatedly using this command:
    # ethtool -S eth0 | grep rx_fw_discards
    

    (Replace "eth0" with the interface that appears to be having trouble receiving)

    Each time this command is executed, the value returned should increase from the previous run as a result of remote devices attempting to communicate with the NIC in question.  The numbers should increase similar to this:

         rx_fw_discards: 53843
         rx_fw_discards: 55467
         rx_fw_discards: 57071
         rx_fw_discards: 58791
         rx_fw_discards: 60596
         rx_fw_discards: 62481
         rx_fw_discards: 64285
         rx_fw_discards: 66069
    
  2. Confirm that the number of interrupts processed does not increase  on the IRQs assigned to the NIC by repeatedly using this command:
    # grep eth0 /proc/interrupts
    

    (Modify  "eth0" with the name of the interface where trouble is suspected.)

    The command should be run while remote devices are attempting to transmit to the failing system. Normally, each counter for the interrupts listed for that interface (e.g. eth0) should increase as packets are received from remote devices. In this situation being described here, the interrupt counter(s) should stop incrementing. In severe cases, the counters for all interrupts can remain constant and then the interface will receive no packets from any remote device.

  3. Typically there is no syslog or dmesg output to indicate the issue has  occurred.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments