The LACP of the two interfaces is not stable. Ping falls to 20%
The LACP of the two interfaces is not stable.
Ping falls to 20%
Hello,
I am trying to connect RHEL 7.6 to Switch Juniper EX4300 in LACP mode, I mean bond mode = 4 / 802.3ad.
The switch has the following configuration:
show configuration interfaces ae10
description DESC;
aggregated-ether-options {
lacp {
active;
periodic slow;
}
}
unit 0 {
family ethernet-switching {
interface-mode trunk;
vlan {
members [ OOB_MGNT_DCE_RHEW OOB_MGNT_DCE_Monitoring OOB_MGNT_DCE_HW ];
}
}
}
On the server side, this configuration:
# cat /etc/sysconfig/network-scripts/ifcfg-bond0_slave_1
HWADDR=AA:BB:CC:DD:EE:FF
TYPE=Ethernet
NAME="bond0 slave 1"
#UUID=488e980b-2145-4b37-a151-15be8cf2dee0
DEVICE=ens3f0
ONBOOT=yes
MASTER=bond0
SLAVE=yes
ZONE=public
UUID=1c4a0d15-e00d-a2fa-60ef-a83a0f504e79
# cat /etc/sysconfig/network-scripts/ifcfg-bond0_slave_2
HWADDR=00:11:22:33:44:55
TYPE=Ethernet
NAME="bond0 slave 2"
#UUID=504f0fb2-211e-4b38-8fe6-24a7b63dc1ef
DEVICE=ens3f1
ONBOOT=yes
MASTER=bond0
SLAVE=yes
ZONE=public
UUID=9df58229-b454-00ec-03ba-78099296b37f
# cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Generated by VDSM version 4.30.17.1.git0de043f
DEVICE=bond0
BONDING_OPTS="downdelay=0 lacp_rate=fast miimon=1 mode=802.3ad updelay=0"
MACADDR=AA:BB:CC:DD:EE:FF
ONBOOT=yes
MTU=1500
DEFROUTE=no
IPV6INIT=no
TYPE=Bond
BONDING_MASTER=yes
PROXY_METHOD=none
BROWSER_ONLY=no
NAME="Bond bond0"
UUID=ad33d8b0-1f7b-cab9-9447-ba07f855b143
# cat /etc/sysconfig/network-scripts/ifcfg-bond0.343
# Generated by VDSM version 4.30.17.1.git0de043f
DEVICE=bond0.343
VLAN=yes
BRIDGE=ovirtmgmt
ONBOOT=yes
MTU=1500
DEFROUTE=no
NM_CONTROLLED=yes
TYPE=Vlan
PHYSDEV=bond0
VLAN_ID=343
REORDER_HDR=yes
GVRP=no
MVRP=no
NAME="Vlan bond0.343"
UUID=183cec78-9fe0-8701-58de-59bc537df364
# cat /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt
# Generated by VDSM version 4.30.17.1.git0de043f
DEVICE=ovirtmgmt
TYPE=Bridge
STP=no
ONBOOT=yes
IPADDR=10.x.x.19
NETMASK=255.255.255.192
GATEWAY=10.x.x.1
BOOTPROTO=none
MTU=1500
DEFROUTE=yes
NM_CONTROLLED=yes
IPV6INIT=no
DNS1=10.x.x.11
PROXY_METHOD=none
BROWSER_ONLY=no
PREFIX=26
IPV4_FAILURE_FATAL=no
NAME="Bridge ovirtmgmt"
UUID=9a0b07c0-2983-fe97-ec7f-ad2b51c3a3f0
Bond looks healthy, but ping responses have about 20% loss.
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 1
Up Delay (ms): 0
Down Delay (ms): 0
802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: aa:bb:cc:dd:ee:ff
Active Aggregator Info:
Aggregator ID: 20
Number of ports: 2
Actor Key: 9
Partner Key: 11
Partner Mac Address: aa:aa:aa:aa:aa:aa
Slave Interface: ens3f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: aa:bb:cc:dd:ee:ff
Slave queue ID: 0
Aggregator ID: 20
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: aa:bb:cc:dd:ee:ff
port key: 9
port priority: 255
port number: 1
port state: 63
details partner lacp pdu:
system priority: 127
system mac address: aa:aa:aa:aa:aa:aa
oper key: 11
port priority: 127
port number: 22
port state: 63
Slave Interface: ens3f1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:11:22:33:44:55
Slave queue ID: 0
Aggregator ID: 20
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: aa:bb:cc:dd:ee:ff
port key: 9
port priority: 255
port number: 2
port state: 63
details partner lacp pdu:
system priority: 127
system mac address: aa:aa:aa:aa:aa:aa
oper key: 11
port priority: 127
port number: 23
port state: 63
I thought by default it would work without the need to try different bonding parameters.
Please help find the mistake.
Thank you and best regards Jiří Kameník
Attachments
Responses
If you look in /proc/net/bonding/bond0
does it now say LACP rate: slow
confirming the setting is applied?
If not, you will have to apply the new config file with nmcli con reload
then put the bond down and up with nmcli con down bond0
and nmcli con up bond0
(if using NetworkManager), or ifdown bond0; ifup bond0
(if using network initscript).
If you can confirm the bond is running in slow mode, then I'd troubleshoot the physical layer.
Change cables. Try each bond slave individually. Try each switchport individually. If the Juniper interface crosses chassis (MC-LAG) then try just the one chassis.
You could packet capture on each bond slave at the same time (tcpdump -n -i ens3f0 -w /tmp/$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S")_slave1.pcap
and tcpdump -n -i ens3f1 -w /tmp/$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S")_slave2.pcap
) and on each switchport at the same time (I presume the switch has a way to do this) and see where the lack of ICMP traffic occurs.
Keep in mind lack ping doesn't necessarily mean there's a problem. The device you're pinging could have strict ICMP Ratelimiting or just be too busy to respond to all those pings.
You could also check for packet loss on the RHEL NICs, the easiest way is probably the xsos tool and run xsos --net
.
You're very welcome to open a support case if you'd like our help looking at all that data.
If you filter those on icmp
in Wireshark, you can see there are several pings where the remote end never replies. You can merge the two slave captures with mergecap
to make that analysis easier.
So the problem is on the switch, or on the network between the switch and the ping target, or on the ping target itself. The cause could be something "bad" like packet loss or it could be something "harmless" like ICMP Ratelimiting.
btw you might not want to attach your organization's binary packet captures here where everyone can access them. I've been sanitising hostnames, MACs, and IPs out of your messages above too.
Yes, you had that one uploaded earlier. It also shows the ping target not replying.
If RHEL never receives the ping replies, then we will report ping packet loss.
As you say, that packet loss seems to also apply to actual traffic like SSH and other TCP.
So I think you have quantified that given the following network:
System (a) --- (b) switch (c) --- (d) ping target (default gateway)
You have quantified that traffic leaves (a)
and reaches (b)
successfully, but replies are not delivered back from (b)
to (a)
.
The loss appears to be either within the switch which owns (b)
and (c)
, or on the network between (c)
and (d)
, or on (d)
itself.
Hi,
I had the same issue. When i suspect in the loss of ping traffic, I have checked the switch ports from the console. Half of the ports were configured dynamic, and half of the ports were in static configuration.
So the static configuration sometimes doesn't work with LACP. Everything went back to normal after re-configuring the static ports to dynamic.
Also I would recommend to check the SSh connections with your LACP switch between the servers. See if you can successfully connect directly from one server to another server via SSH. 1- For some reasons (I haven't figured that part yet) I can jump via ssh from one server to another directly without LACP (I mean when i remove the 1 cable from each server) 2- But when I connect the second cable to switch, I can't do that. I have to launch individual CMD\CLI for ssh to each server. The reason is something with the switch settings or routes. It can't be firewall because is airgapped DC, all firewall are disabled.
best regards