Serious Networking Weirdness
Yesterday, I patched a RHEL 5 host up from 5.6 to 5.9. After doing so, my networking went wonky.
System:
* HP DL380G6
* on-board BroadComm NeteXtreme II quad-port 1Gbps interface
* add-on Mellanox ConnectX dual-port 10Gbps interface
* Two bonded interfaces:
* bond0: Asymmetrical 10Gbps/1Gbps Active/Passive pair (working just fine)
* bond1: Asymmetrical 10Gbps/1Gbps Active/Passive pair (not quite working)
I'm reasonably sure it's not strictly a driver issue, as both bonds have the same composition (one port off the Mellanox card as the primary link; one port off the BroadComm as the standby link) and one of the bonds works. I've used ifenslave on both bonds to force/change the active-link to verify all NICs' functionality (or lack thereof)
When I attempt to ping out from bond1 to its local LAN segment, I get ICMP "host unreachable" errors. This happens for selected targets (its application-partner, a NAS and the segment's default gateway device). When I look at the ARP table entries associated with that interface, it shows the IPs for the pinged systems, but shows the associated MAC entries as "
I tried backing out bond1 back down to its constituent interfaces (specifically, the Mellenox 10Gbps interface at "eth1"). The ping and ARP results were the same.
I finally installed tcpdump on both application partners. I fired it up on the problematic host (against eth1) and then pinged from the partner. inbound-ping, which had (also) previously been not working, started to respond. The afflicted host pinged just fine, right up until I stopped tcpdump against eth1. Weird.
I decided that eth1 was probably (sorta) good, so I recomposed bond1 from the Mellanox (eth1) and the BroadComm (eth3) that it had been previously composed of. I then started ping on the partner system - getting the expected failures. I then started tcpdump on the bond and pings started working. I stopped tcpdump and the pings again failed. I changed my tcpdump to reference eth1 and pings started working again. I changed my tcpdump to reference eth3 and pings continued to fail. I used ifenslave to make eth3 the active link and started the tcpdump against eth3 - pings started working.
Concurrent to the tcpdumps of bond1/eth1/eth3, I looked at my ARP tables. While tcpdump was active and the partner system was able to ping the wonky host, my ARP table entries looked normal. Within a couple seconds of turning tcpdump off, the ARP table entries would again change to "
I feel like I'm really close to figuring out what's wrong, but need a final push. If anyone here has any suggestions to get me over the hump, it'd be greatly appreciated.
[EDIT]
As a temporary workaround, I ended up turning off tcpdump and doing an ifconfig bond1 promisc. Obviously can't leave things this way as the security folks will have a fit if they ever scan the box.
[/EDIT]
Thanks in advance.
Responses
Hello.
The issue you described seems related to name resolution. That beacuse bond0 is the interface you use to reach the default gateway and the ping command uses name resolution no matter if you provide just the IP host.
You can use ping -n and ping -U with -I.
-n Numeric output only. No attempt will be made to lookup symbolic names for host addresses.
-U Print full user-to-user latency (the old behaviour). Normally ping prints network round trip
time, which can be different f.e. due to DNS failures.
-I interface address
Set source address to specified interface address. Argument may be numeric IP address or
name of device. When pinging IPv6 link-local address this option is required.
Example:
# ping -U 192.168.122.1 -I eth0
PING 192.168.122.1 (192.168.122.1) from 192.168.122.72 eth0: 56(84) bytes of data.
64 bytes from 192.168.122.1: icmp_seq=1 ttl=64 time=0.459 ms
64 bytes from 192.168.122.1: icmp_seq=2 ttl=64 time=0.545 ms
64 bytes from 192.168.122.1: icmp_seq=3 ttl=64 time=0.402 ms
Let bond1 as promisc is indeed not recommended.
The tcpdump may show to you (if you sniff the four slaves - of bond0 and bond1) which interfaces are been used.
Please let me know if there is anything else I can do to assist you in this issue. I will be glad to help.
Thanks!
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
