Red Hat OpenShift Container Platform 4.6.25 update to 4.7.7 is failing when running on vSphere

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (OCP) 4.6 and 4.7
  • Red Hat Enterprise Linux 8.4

Issue

Resolution

OpenShift Container Platform

  • This issue has been resolved in Red Hat OpenShift Container Platform 4.7.11 via RHBA-2021:1550 to address Red Hat OpenShift Container Platform installation done on vSphere. To raise questions or obtain further information, contact Red Hat Technical Support.
  • In case updating to Red Hat OpenShift Container Platform 4.7.11 or later is not possible, the below workarounds can be applied.

Red Hat Enterprise Linux 8

  • The issue has been resolved with kernel-4.18.0-348.el8 in Red Hat Enterprise Linux 8.5 GA via RHSA-2021:4356.
  • The issue was tracked at private bug 1941714.

Red Hat Enterprise Linux 8.4.z

  • The issue has been resolved with kernel-4.18.0-305.7.1.el8_4 in Red Hat Enterprise Linux 8.4.z via RHSA-2021:2570.
  • The issue was tracked at private bug 1960702.

Workaround 1

  • When stuck in the update from Red Hat OpenShift Container Platform 4.6.25 to 4.7.7, applying the following workaround on all Red Hat OpenShift Container Platform - Node(s) should resolve the problem.

    • ethtool -K <primary-interface> tx-udp_tnl-segmentation off
    • ethtool -K <primary-interface> tx-udp_tnl-csum-segmentation off
  • Important this workaround is not persistent and will vanish during the next Red Hat OpenShift Container Platform - Node(s) reboot. It's therefore only recommended to be used, to unblock an upgrade that is stuck. For planned upgrades, workaround 2 is recommended to be used.

Workaround 2

  • Before attempting to update from Red Hat OpenShift Container Platform 4.6.25 to 4.7.7, the below MachineConfig can be applied to all Red Hat OpenShift Container Platform - Node(s) to run the ethtool command during Red Hat OpenShift Container Platform - Node(s) start-up.
  1. The below script needs to be put in place in /etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl on each Red Hat OpenShift Container Platform - Node (worker and Control-Plane) using the MachineConfig as shown in step 2.

    #!/bin/bash
    # Workaround:
    # https://bugzilla.redhat.com/show_bug.cgi?id=1941714
    # https://bugzilla.redhat.com/show_bug.cgi?id=1935539
    # https://access.redhat.com/solutions/5997331
    
    driver=$(nmcli -t -m tabular -f general.driver dev show "${DEVICE_IFACE}")
    
    if [[ "$2" == "up" && "${driver}" == "vmxnet3" && -f /usr/sbin/ethtool ]]; then
      logger -s "99-vsphere-disable-tx-udp-tnl triggered by ${2} on device ${DEVICE_IFACE}."
      ethtool -K ${DEVICE_IFACE} tx-udp_tnl-segmentation off
      ethtool -K ${DEVICE_IFACE} tx-udp_tnl-csum-segmentation off
    fi
    
  2. The MachineConfig to be created should look as following for the Red Hat OpenShift Container Platform - Control-Plane Node(s).

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
      labels:
        machineconfiguration.openshift.io/role: master
      name: 99-vsphere-networking-fix-master
    spec:
      config:
        ignition:
          config: {}
          security:
            tls: {}
          timeouts: {}
          version: 3.1.0
        networkd: {}
        passwd: {}
        storage:
          files:
          - contents:
              source: data:text/plain;charset=utf-8;base64,IyEvYmluL2Jhc2gKIyBXb3JrYXJvdW5kOgojIGh0dHBzOi8vYnVnemlsbGEucmVkaGF0LmNvbS9zaG93X2J1Zy5jZ2k/aWQ9MTk0MTcxNAojIGh0dHBzOi8vYnVnemlsbGEucmVkaGF0LmNvbS9zaG93X2J1Zy5jZ2k/aWQ9MTkzNTUzOQojIGh0dHBzOi8vYWNjZXNzLnJlZGhhdC5jb20vc29sdXRpb25zLzU5OTczMzEKCmRyaXZlcj0kKG5tY2xpIC10IC1tIHRhYnVsYXIgLWYgZ2VuZXJhbC5kcml2ZXIgZGV2IHNob3cgIiR7REVWSUNFX0lGQUNFfSIpCgppZiBbWyAiJDIiID09ICJ1cCIgJiYgIiR7ZHJpdmVyfSIgPT0gInZteG5ldDMiICYmIC1mIC91c3Ivc2Jpbi9ldGh0b29sIF1dOyB0aGVuCiAgbG9nZ2VyIC1zICI5OS12c3BoZXJlLWRpc2FibGUtdHgtdWRwLXRubCB0cmlnZ2VyZWQgYnkgJHsyfSBvbiBkZXZpY2UgJHtERVZJQ0VfSUZBQ0V9LiIKICBldGh0b29sIC1LICR7REVWSUNFX0lGQUNFfSB0eC11ZHBfdG5sLXNlZ21lbnRhdGlvbiBvZmYKICBldGh0b29sIC1LICR7REVWSUNFX0lGQUNFfSB0eC11ZHBfdG5sLWNzdW0tc2VnbWVudGF0aW9uIG9mZgpmaQo=
            mode: 484
            overwrite: true
            path: /etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl
      osImageURL: ""
    
  3. More details about MachineConfig and how to apply them to all Red Hat OpenShift Container Platform - Node(s) can be found in Using MachineConfig objects to configure nodes.

  4. The workaround 2 needs to remain in place even after a successful upgrade to Red Hat OpenShift Container Platform 4.7 completed and can only be removed, once RHBZ #1952358 is solved or Red Hat Technical Support does advise accordingly.

Installation of Red Hat OpenShift Container Platform 4.7

It is not recommended to install an older 4.7.z version. If possible, please install the latest errata. However, if it is required to install a 4.7.z version lower than 4.7.11 with platform set to none or bare metal installation method on vSphere hardware version greater than 13, it is then necessary to apply the below MachineConfig manifest during the installation, following the procedure documented in Customizing nodes on day 1.

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-vsphere-networking-fix-master
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 3.1.0
    networkd: {}
    passwd: {}
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,IyEvYmluL2Jhc2gKIyBXb3JrYXJvdW5kOgojIGh0dHBzOi8vYnVnemlsbGEucmVkaGF0LmNvbS9zaG93X2J1Zy5jZ2k/aWQ9MTk0MTcxNAojIGh0dHBzOi8vYnVnemlsbGEucmVkaGF0LmNvbS9zaG93X2J1Zy5jZ2k/aWQ9MTkzNTUzOQojIGh0dHBzOi8vYWNjZXNzLnJlZGhhdC5jb20vc29sdXRpb25zLzU5OTczMzEKCmRyaXZlcj0kKG5tY2xpIC10IC1tIHRhYnVsYXIgLWYgZ2VuZXJhbC5kcml2ZXIgZGV2IHNob3cgIiR7REVWSUNFX0lGQUNFfSIpCgppZiBbWyAiJDIiID09ICJ1cCIgJiYgIiR7ZHJpdmVyfSIgPT0gInZteG5ldDMiICYmIC1mIC91c3Ivc2Jpbi9ldGh0b29sIF1dOyB0aGVuCiAgbG9nZ2VyIC1zICI5OS12c3BoZXJlLWRpc2FibGUtdHgtdWRwLXRubCB0cmlnZ2VyZWQgYnkgJHsyfSBvbiBkZXZpY2UgJHtERVZJQ0VfSUZBQ0V9LiIKICBldGh0b29sIC1LICR7REVWSUNFX0lGQUNFfSB0eC11ZHBfdG5sLXNlZ21lbnRhdGlvbiBvZmYKICBldGh0b29sIC1LICR7REVWSUNFX0lGQUNFfSB0eC11ZHBfdG5sLWNzdW0tc2VnbWVudGF0aW9uIG9mZgpmaQo=
        mode: 484
        overwrite: true
        path: /etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl
  osImageURL: ""

Important: To apply the change to all OpenShift Container Platform - Node(s) (worker and Control-Plane) two files need to be created, to cover the predefined roles (master and worker) created by the Red Hat OpenShift Container Platform 4 - Installer.

Root Cause

A change in Red Hat Enterprise Linux 8.3 in the vmxnet3 driver is causing VXLAN packages that are required for the Software Defined Network (SDN) to be dropped.

Diagnostic Steps

  • The Red Hat OpenShift Container Platform 4.6.25 to 4.7.7 update is stuck on vSphere and hardware version for virtual machine is not 13 or below showing the following error reported by oc get clusterversion:

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.6.25    True        True          34h     Unable to apply 4.7.7: the control plane is reporting an internal error
    
  • Various Red Hat OpenShift Container Platform - Cluster Operator are stuck in degraded state as shown below:

    $ oc get clusteroperator
    NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
    authentication                             4.7.7     False       True          True       33h
    baremetal                                  4.7.7     True        False         False      33h
    cloud-credential                           4.7.7     True        False         False      88d
    cluster-autoscaler                         4.7.7     True        False         False      88d
    config-operator                            4.7.7     True        False         False      88d
    console                                    4.7.7     False       False         True       32h
    csi-snapshot-controller                    4.7.7     True        False         False      125m
    dns                                        4.7.7     True        False         False      88d
    etcd                                       4.7.7     True        False         False      88d
    image-registry                             4.7.7     True        False         True       33h
    ingress                                    4.7.7     True        False         False      28h
    insights                                   4.7.7     True        False         False      88d
    kube-apiserver                             4.7.7     True        False         False      88d
    kube-controller-manager                    4.7.7     True        False         False      88d
    kube-scheduler                             4.7.7     True        False         False      88d
    kube-storage-version-migrator              4.7.7     True        False         False      75m
    machine-api                                4.7.7     True        False         False      88d
    machine-approver                           4.7.7     True        False         False      88d
    machine-config                             4.7.7     False       False         True       32h
    marketplace                                4.7.7     True        False         False      91m
    monitoring                                 4.7.7     False       False         True       33h
    network                                    4.7.7     True        False         False      88d
    node-tuning                                4.7.7     True        False         False      33h
    openshift-apiserver                        4.7.7     False       False         False      75m
    openshift-controller-manager               4.7.7     True        False         False      33h
    openshift-samples                          4.7.7     True        False         False      33h
    operator-lifecycle-manager                 4.7.7     True        False         False      88d
    operator-lifecycle-manager-catalog         4.7.7     True        False         False      88d
    operator-lifecycle-manager-packageserver   4.7.7     True        False         False      6m11s
    service-ca                                 4.7.7     True        False         False      88d
    storage                                    4.7.7     True        False         False      88d
    
  • The file /etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl is not present on any Red Hat OpenShift Container Platform - Node.

    $ oc debug node/worker-0
    sh-4.4# chroot /host
    sh-4.4# ls -l /etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl
    ls: cannot access '/etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl': No such file or directory
    
  • Check Hardware Version of Virtual Machines in vCenter:

    Using VMware PowerCLI for Powershell:
    
    $ Get-Folder "<OCP_Folder_Path>" | get-VM | Select-Object Name, HardwareVersion
    
    Name                    HardwareVersion
    ----                    ---------------
    openshift-x5mg6-worker2 vmx-15
    openshift-x5mg6-worker3 vmx-15
    openshift-x5mg6-worker1 vmx-15
    openshift-x5mg6-master3 vmx-15
    openshift-x5mg6-master2 vmx-15
    openshift-x5mg6-master1 vmx-15
    
    Using VMware govc:
    
    $ for i in $(govc find -type m -name 'openshift-x5mg6*'); do govc vm.info -json $i | jq -r '[.VirtualMachines[].Name, .VirtualMachines[].Config.Version] | join(" ")'; done
    
    openshift-x5mg6-worker2 vmx-15
    openshift-x5mg6-worker3 vmx-15
    openshift-x5mg6-worker1 vmx-15
    openshift-x5mg6-master3 vmx-15
    openshift-x5mg6-master2 vmx-15
    openshift-x5mg6-master1 vmx-15
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments