ARO - SystemDLooping Update Risk

Solution Verified - Updated -

Environment

  • Clusters created at ARO (Azure Red Hat OpenShift) version:

    • less than 4.12.z
    • 4.11.z
    • 4.12.z
  • Update from (Current ARO version):

    • 4.12.z
  • Update to (Desired ARO version):

    • 4.13.46
  • Fixed in ARO versions:

    • 4.13.48
    • 4.14.35
    • 4.15.28
    • 4.16.8

Issue

Red Hat has identified an update risk for clusters updating into OpenShift 4.13.46 on ARO. Since these nodes are unable to reboot this stalls out the entire update process for ARO.

Red Hat recommends not to update into OpenShift 4.13.46 on ARO.

Resolution

For Azure Red Hat OpenShift (ARO), create a ticket with Red Hat Support or Microsoft Customer Support for further guidance on unblocking any upgrade issues. Do not attempt manual remediation.

NOTE: For self-managed OpenShift Container Platform on Microsoft Azure, refer here.

Root Cause

In OpenShift 4.13.46 it was detected systemd dependency loop. Which causes systemd to delete some dependencies causing kubelet and crio to never start on the affected nodes. For more information:OCPBUGS-33694.

Diagnostic Steps

For customers looking to start an update:

  • Run oc describe clusterversion. This will list all the versions of the cluster, the version in at the bottom of the this list will be the version the cluster was created at. If your cluster was created in a version 4.12.z or older (lower number), do not update into OpenShift 4.13.46 on ARO.

For customers experiencing stalled update:

The following commands can be used to diagnose the issue.

  1. Verify that the cluster version shows a failed state for upgrade to 4.13.46.
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.25   True        True          3h8m    Unable to apply 4.13.46: wait has exceeded 40 minutes for these operators: etcd, kube-apiserver
  1. oc get mcp The status of the MachineConfigPool should show that the pools are updating but not progressing past the first node.
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-96b90a0c2d5dea265139c1ce8bd7106e   False     True       False      3              0                   0                     0                      4h33m
worker   rendered-worker-349d41ff99e322a88d498b0ccc628eff   False     True       False     
  1. oc get nodes Should show one worker and one master node Not Ready.
$ oc get nodes
NAME                             STATUS                        ROLES                  AGE     VERSION
aro-8zkjg-master-0               NotReady,SchedulingDisabled   control-plane,master   4h41m   v1.25.11+1485cc9
aro-8zkjg-master-1               Ready                         control-plane,master   4h41m   v1.25.11+1485cc9
aro-8zkjg-master-2               Ready                         control-plane,master   4h42m   v1.25.11+1485cc9
aro-8zkjg-worker-eastus1-44flv   NotReady,SchedulingDisabled   worker                 4h26m   v1.25.11+1485cc9
aro-8zkjg-worker-eastus2-kcklc   Ready                         worker                 4h26m   v1.25.11+1485cc9
aro-8zkjg-worker-eastus3-7kwlk   Ready                         worker                 4h26m   v1.25.11+1485cc9

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments