Bonded interface is unreachable after reboot but shortly after can be reached

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 8

    • Specifically with NetworkManager versioned below NetworkManager-1.30.0-7.el8
  • Network interfaces in an active-backup bond configuration

Issue

  • Network bond sending incorrect MAC in gARP on boot
  • When the system boots, it is sending out a bad MAC address in the gARP. The correct MAC is shown in ip link, and when checking the bond status, so this appears to happen sometime during boot.
  • The bond first comes up with random MAC and assigns that MAC to the first sub-interface rather than using the MAC of the first sub-interface.

Resolution

  • Update NetworkManager to at least version NetworkManager-1.30.0-7.el8 or above as per errata RHSA-2021:1574

Workaround

  • Set cloned-mac-address on the bond interface;

    nmcli connection modify "bond0" "802-3-ethernet.cloned-mac-address" <MAC ADDR>
    dracut -f    <--- required for the change to be applied on boot as well.
    
    • Example of setting cloned-mac-address on a bond within a VLAN;

      nmcli connection modify "bond0" "802-3-ethernet.cloned-mac-address" 11:22:33:44:55:66
      nmcli connection modify "bond0.1234" "802-3-ethernet.cloned-mac-address" 11:22:33:44:55:66 <--- applies to VLAN interface
      dracut -f
      

Root Cause

NetworkManager attempted to restore a prior MAC address when a bond configuration changed and ended up assigning the random MAC assigned on creation to its sub-interfaces. To elaborate;

  • On bond creation, the empty bond is assigned a random MAC address by the kernel and NetworkManager remembers this.
  • The default behavior of bonds when a sub-interface is added to the bond is for the bond to use the sub-interface's MAC address as its own MAC address.
  • When fail_over_mac is used with the bond configuration, this changes the default MAC address assignment behavior. With fail_over_mac being set (E.G. fail_over_mac=1 or fail_over_mac=follow) on an active-backup bond, the kernel will update any network link connectivity changes (IE is the bond connected to the network or not) but does not update the MAC address.
  • NetworkManager recognized the random MAC address assigned on bond creation as the legitimate MAC address. When a sub-interface was added to the bond, NetworkManager assigned the random MAC address to the sub-interface as well.
  • The expected behavior is for the bond to pick up the first active sub-interface's MAC address. NetworkManager watches for the kernel to update the MAC address, but the kernel does not under the aforementioned scenario, so the bond and its sub-interfaces are not assigned the correct MAC addresses and are instead assigned the random address provided to the initially empty bond.

Diagnostic Steps

Steps to reproduce

  1. On a system (physical or virtual) with two or more network adapters, create an active-backup bond interface and add two or more of the network adapters as sub-interfaces to the bond.
  2. Review the MAC addresses of the devices. Below is an example;

    # ip link 
    [...]
    2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond state UP group default qlen 1000
        link/ether 11:11:11:11:11:11 brd ff:ff:ff:ff:ff:ff permaddr 55:55:55:55:55:55
    3: bond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
        link/ether 11:11:11:11:11:11 brd ff:ff:ff:ff:ff:ff
    4: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond state UP group default qlen 1000
        link/ether 22:22:22:22:22:22 brd ff:ff:ff:ff:ff:ff
    
  3. Reboot and review the MAC addresses

    # ip link 
    [...]
    2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond state UP group default qlen 1000
        link/ether 55:55:55:55:55:55 brd ff:ff:ff:ff:ff:ff
    3: bond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
        link/ether 55:55:55:55:55:55 brd ff:ff:ff:ff:ff:ff
    4: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond state UP group default qlen 1000
        link/ether 22:22:22:22:22:22 brd ff:ff:ff:ff:ff:ff
    
  4. The MAC address should retain the MAC address in further reboots.

    # ip link 
    [...]
    2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond state UP group default qlen 1000
        link/ether 55:55:55:55:55:55 brd ff:ff:ff:ff:ff:ff
    3: bond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
        link/ether 55:55:55:55:55:55 brd ff:ff:ff:ff:ff:ff
    4: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond state UP group default qlen 1000
        link/ether 22:22:22:22:22:22 brd ff:ff:ff:ff:ff:ff
    

What to look for

For example purposes, the desired bond MAC address is 55:55:55:55:55:55 for eth0 and 55:55:55:55:55:56 for eth1

  • Review dmesg, rsyslog logs (/var/log/messages), or system journal logs (journalctl -k) for instances of an incorrect MAC address being assigned to the bond on boot;

    $ grep -ir 'set new mac address' -B 3 var/log/messages
    Sep  1 21:09:01 hostname kernel: device eth0 left promiscuous mode
    Sep  1 21:09:01 hostname kernel: bond0: (slave eth0): making interface the new active one
    Sep  1 21:09:01 hostname kernel: device eth1 entered promiscuous mode
    Sep  1 21:09:01 hostname kernel: i40e 0000:19:00.0 eth0: set new mac address 55:55:55:55:55:55
    --
    Sep  1 21:38:24 hostname kernel: 8021q: adding VLAN 0 to HW filter on device eth0
    Sep  1 21:38:24 hostname kernel: bond0: (slave eth0): making interface the new active one       <---
    Sep  1 21:38:24 hostname kernel: device eth0 entered promiscuous mode
    Sep  1 21:38:24 hostname kernel: i40e 0000:19:00.0 eth0: set new mac address de:ad:00:00:be:ef  <---
    --
    Sep  1 21:40:27 hostname kernel: device eth0 left promiscuous mode
    Sep  1 21:40:27 hostname kernel: bond0: (slave eth1): making interface the new active one       <---
    Sep  1 21:40:27 hostname kernel: device eth1 entered promiscuous mode
    Sep  1 21:40:27 hostname kernel: i40e 0000:19:00.1 eth1: set new mac address ab:12:ab:12:ab:12  <---
    --
    Sep  1 21:40:30 hostname kernel: 8021q: adding VLAN 0 to HW filter on device eth0
    Sep  1 21:40:30 hostname kernel: bond0: (slave eth0): making interface the new active one       <---
    Sep  1 21:40:30 hostname kernel: device eth0 entered promiscuous mode
    Sep  1 21:40:30 hostname kernel: i40e 0000:19:00.0 eth0: set new mac address so:me:ra:nd:om     <---
    --
    Sep  2 01:20:53 hostname kernel: bond0: (slave eth1): making interface the new active one       <---
    Sep  2 01:20:53 hostname kernel: device eth0 left promiscuous mode
    Sep  2 01:20:53 hostname kernel: device eth1 entered promiscuous mode
    Sep  2 01:20:53 hostname kernel: i40e 0000:19:00.1 eth1: set new mac address 12:34:56:78:90:10  <---
    Sep  2 01:20:53 hostname kernel: i40e 0000:19:00.0 eth0: set new mac address 55:55:55:55:55:55
    --
    Sep  2 01:25:51 hostname kernel: 8021q: adding VLAN 0 to HW filter on device eth1
    Sep  2 01:25:51 hostname kernel: bond0: (slave eth1): making interface the new active one
    Sep  2 01:25:51 hostname kernel: device eth1 entered promiscuous mode
    Sep  2 01:25:51 hostname kernel: i40e 0000:19:00.1 eth1: set new mac address 55:55:55:55:55:56
    
    • In the above output, the first active sub-interface comes online and joins the bond but is assigned assigned the random MAC address upon joining
  • Review packet captures of the system as it comes online from your preferred package capture and analysis tool (tcpdump, Cisco EPC, etc) and check the MAC addresses of the bond interface within the Gratuitous ARP (gARP) sent from the problem system on boot. If the sending MAC address in the gARP packet does not match the origin MAC address as determined by the packet analysis tool, then this issue may be occurring.

  • Review the ARP table in the switch. If the MAC address of the bond interface does not match the MAC address in the ARP table while the problem system is unreachable, the issue may be occurring.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments