NIC card BCM5720 randomly stops receiving packets

Latest response

Dear community guys,

DELL server R640
Redaht 7.3 ( maipo )
NIC card : BCM5720 ( 4 ports )

Issue:
Randomly NIC stop receiving packets.
No error messages in dmesg and syslog
Only errors reported by the DELL server:

A fatal error was detected on a component at bus 23 device 0 function 1.

23=0x17

which points actually to the NIC card according to lspci

17:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe

The issue is very similar to the Redhat Verified Solution:
https://access.redhat.com/solutions/26765

As per DELL Support I have applied the last firmware ( 20.6.58).
As I don't have a RedHat Support I am thinking about disabling the MSI-X feature.

Let me know your thoughts.

Many thanks.

JJ.

Attachments

Responses

You may try to enable debug mode for network module and check extra logs that you get to diagnose further. Check this article on how to enable debug https://access.redhat.com/solutions/45950

The knowledgebase solution you've referenced is very old, that issue should have been resolved years ago. Disabling MSI for the card probably won't resolve it, but it's worth a try, the performance impact of a 1Gbps NIC on modern hardware probably won't be huge.

Considering the hardware logs contain the bus error, and the NIC apparently still responds to the drivers (as we have no watchdog messages logged in the OS), I would look into BIOS/EFI updates or hardware replacement of the NIC or systemboard.

The NIC has already been replaced ( same model ) once but the issue remains. I will disable MSI and enable the debug mode on the network module.

I am struggling to enable debug mode on BCM5720 following https://access.redhat.com/solutions/45950 with RHEL 7 any suggestions ?

No need to use the module option, this command should do the same thing:

ethtool -s ethX msglvl 7fff

That just turns on every message option. See man ethtool for the full syntax.

I went through the man page but no indication of the value to pass . "ethtool -s em4 msglvl 0x7fff" but no "debug event" appears in dmesg or syslog BCM5720 is using the tg3 driver . any suggestions ?

That value is all of the message levels added up.

You'll probably need to wait for the issue to occur again and see if anything is logged. However, the kernel has other methods to detect unresponsive hardware and none of those have triggered, so it seems unlikely you'll get any useful debugging messages.

I think it's more likely this is caused by something outside the OS. BIOS/EFI update, or faulty hardware like systemboard or power supply or CPU or memory or something like that.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.