Serious Networking Weirdness

Latest response

Yesterday, I patched a RHEL 5 host up from 5.6 to 5.9. After doing so, my networking went wonky.

System:
* HP DL380G6
* on-board BroadComm NeteXtreme II quad-port 1Gbps interface
* add-on Mellanox ConnectX dual-port 10Gbps interface
* Two bonded interfaces:
* bond0: Asymmetrical 10Gbps/1Gbps Active/Passive pair (working just fine)
* bond1: Asymmetrical 10Gbps/1Gbps Active/Passive pair (not quite working)

I'm reasonably sure it's not strictly a driver issue, as both bonds have the same composition (one port off the Mellanox card as the primary link; one port off the BroadComm as the standby link) and one of the bonds works. I've used ifenslave on both bonds to force/change the active-link to verify all NICs' functionality (or lack thereof)

When I attempt to ping out from bond1 to its local LAN segment, I get ICMP "host unreachable" errors. This happens for selected targets (its application-partner, a NAS and the segment's default gateway device). When I look at the ARP table entries associated with that interface, it shows the IPs for the pinged systems, but shows the associated MAC entries as "".

I tried backing out bond1 back down to its constituent interfaces (specifically, the Mellenox 10Gbps interface at "eth1"). The ping and ARP results were the same.

I finally installed tcpdump on both application partners. I fired it up on the problematic host (against eth1) and then pinged from the partner. inbound-ping, which had (also) previously been not working, started to respond. The afflicted host pinged just fine, right up until I stopped tcpdump against eth1. Weird.

I decided that eth1 was probably (sorta) good, so I recomposed bond1 from the Mellanox (eth1) and the BroadComm (eth3) that it had been previously composed of. I then started ping on the partner system - getting the expected failures. I then started tcpdump on the bond and pings started working. I stopped tcpdump and the pings again failed. I changed my tcpdump to reference eth1 and pings started working again. I changed my tcpdump to reference eth3 and pings continued to fail. I used ifenslave to make eth3 the active link and started the tcpdump against eth3 - pings started working.

Concurrent to the tcpdumps of bond1/eth1/eth3, I looked at my ARP tables. While tcpdump was active and the partner system was able to ping the wonky host, my ARP table entries looked normal. Within a couple seconds of turning tcpdump off, the ARP table entries would again change to "".

I feel like I'm really close to figuring out what's wrong, but need a final push. If anyone here has any suggestions to get me over the hump, it'd be greatly appreciated.

[EDIT]
As a temporary workaround, I ended up turning off tcpdump and doing an ifconfig bond1 promisc. Obviously can't leave things this way as the security folks will have a fit if they ever scan the box.
[/EDIT]

Thanks in advance.

Responses