ARO cluster or OCP cluster in Azure upgraded but Machine Config Pools Degraded with error: Marking Degraded due to: unexpected on-disk state

Solution Unverified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Azure Red Hat OpenShift (ARO)
    • 4
  • OpenShift Managed (Azure)
    • 4
  • Azure Key Vault Provider for Secrets Store CSI Driver (secrets-store-csi-driver-provider-azure)
  • Azure

Issue

  • After upgrading an ARO cluster or and OCP cluster installed in Azure, the node versions are inconsistent:

    $ oc get nodes
    NAME                      STATUS                    ROLES   AGE    VERSION
    aro-master-2              Ready                     master  403d   v1.21.1+6438632
    aro-master-0              Ready                     master  403d   v1.21.1+6438632
    aro-master-1              Ready                     master  403d   v1.21.1+6438632
    aro-worker-regionx-xxxxx  Ready                     worker  6h4m   v1.20.0+bbbc079
    aro-worker-regionx-xxxxx  Ready                     worker  3h20m  v1.20.0+bbbc079
    aro-worker-regionx-xxxxx  Ready,SchedulingDisabled  worker  2h2m   v1.20.0+bbbc079
    aro-worker-regionx-xxxxx  Ready                     worker  23d    v1.20.0+bbbc079
    aro-worker-regionx-xxxxx  Ready                     worker  5h31m  v1.20.0+bbbc079
    
  • A machine config pool is degraded and shows the errors specified in "Diagnostic Steps" section:

    Marking Degraded due to: unexpected on-disk state validating against rendered-worker-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
    
  • The machine-config-daemon shows the following error for the azure.sock file:

    error: During /etc merge: Copying azure.sock: Cannot copy non-regular/non-symlink file: azure.sock
    

Resolution

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Follow the "Diagnostic Steps" section to check if the issue is related to the azure.sock file. In other case, refer to KCS 5598401.

If the issue is with the azure.sock file, check if the "Azure Key Vault Provider for Secrets Store CSI Driver" is installed, and remove/uninstall it. Follow the steps in KCS 5598401 to allow the stuck node to continue with the upgrade.

Root Cause

The "Azure Key Vault Provider for Secrets Store CSI Driver" creates a privileged container mounting the host file system and writing a the azure.sock file to the /etc directory in the node, preventing the MCO to upgrade the node.

Diagnostic Steps

Check if any machineconfigpool is degraded:

$ oc get mcp
NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
worker  rendered-worker-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  False    True      True      5             0                  0                    1                     403d
master  rendered-master-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy  True     False     False     3             3                  3                    0                     403d

Check for a node NotReady or SchedulingDisabled:

$ oc get nodes
NAME                      STATUS                    ROLES   AGE    VERSION
[...]
aro-worker-regionx-xxxxx  Ready,SchedulingDisabled  worker  2h2m   v1.20.0+bbbc079
[...]

Check the machine-config-daemon logs for the error expected target osImageURL:

$ oc get pods -n openshift-machine-config-operator -o wide
NAME                                        READY  STATUS   RESTARTS  AGE  IP             NODE
machine-config-daemon-yyyyy                 2/2    Running  0         14h  10.253.137.9   aro-worker-regionx-xxxxx

$ oc logs -n openshift-machine-config-operator -c machine-config-daemon machine-config-daemon-yyyyy
[...]
2021-10-10T00:00:01.930763153Z E1010 00:00:01.930743    2326 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
[...]

Check for an error related to the file azure.sock:

$ oc logs -n openshift-machine-config-operator -c machine-config-daemon machine-config-daemon-yyyyy
[...]
2021-10-10T00:00:01.943571275Z Warning: failed to finalize previous deployment
2021-10-10T00:00:01.943571275Z          error: During /etc merge: Copying azure.sock: Cannot copy non-regular/non-symlink file: azure.sock
2021-10-10T00:00:01.943571275Z          check `journalctl -b -1 -u ostree-finalize-staged.service`
[...]

Check if there are pods from the csi-secrets-store-provider-azure:

$ oc get pods -A | grep "csi-secrets-store-provider-azure"
[...]
my_namespace             csi-secrets-store-provider-azure-xxxxx               1/1    Running    0         5d
[...]

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments