ARO cluster or OCP cluster in Azure upgraded but Machine Config Pools Degraded with error: Marking Degraded due to: unexpected on-disk state
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Azure Red Hat OpenShift (ARO)
- 4
- OpenShift Managed (Azure)
- 4
- Azure Key Vault Provider for Secrets Store CSI Driver (
secrets-store-csi-driver-provider-azure
) - Azure
Issue
-
After upgrading an ARO cluster or and OCP cluster installed in Azure, the node versions are inconsistent:
$ oc get nodes NAME STATUS ROLES AGE VERSION aro-master-2 Ready master 403d v1.21.1+6438632 aro-master-0 Ready master 403d v1.21.1+6438632 aro-master-1 Ready master 403d v1.21.1+6438632 aro-worker-regionx-xxxxx Ready worker 6h4m v1.20.0+bbbc079 aro-worker-regionx-xxxxx Ready worker 3h20m v1.20.0+bbbc079 aro-worker-regionx-xxxxx Ready,SchedulingDisabled worker 2h2m v1.20.0+bbbc079 aro-worker-regionx-xxxxx Ready worker 23d v1.20.0+bbbc079 aro-worker-regionx-xxxxx Ready worker 5h31m v1.20.0+bbbc079
-
A machine config pool is degraded and shows the errors specified in "Diagnostic Steps" section:
Marking Degraded due to: unexpected on-disk state validating against rendered-worker-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
-
The
machine-config-daemon
shows the following error for theazure.sock
file:error: During /etc merge: Copying azure.sock: Cannot copy non-regular/non-symlink file: azure.sock
Resolution
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
Follow the "Diagnostic Steps" section to check if the issue is related to the azure.sock
file. In other case, refer to KCS 5598401.
If the issue is with the azure.sock
file, check if the "Azure Key Vault Provider for Secrets Store CSI Driver" is installed, and remove/uninstall it. Follow the steps in KCS 5598401 to allow the stuck node to continue with the upgrade.
Root Cause
The "Azure Key Vault Provider for Secrets Store CSI Driver" creates a privileged container mounting the host file system and writing a the azure.sock
file to the /etc
directory in the node, preventing the MCO to upgrade the node.
Diagnostic Steps
Check if any machineconfigpool
is degraded:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx False True True 5 0 0 1 403d
master rendered-master-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy True False False 3 3 3 0 403d
Check for a node NotReady
or SchedulingDisabled
:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
[...]
aro-worker-regionx-xxxxx Ready,SchedulingDisabled worker 2h2m v1.20.0+bbbc079
[...]
Check the machine-config-daemon
logs for the error expected target osImageURL
:
$ oc get pods -n openshift-machine-config-operator -o wide
NAME READY STATUS RESTARTS AGE IP NODE
machine-config-daemon-yyyyy 2/2 Running 0 14h 10.253.137.9 aro-worker-regionx-xxxxx
$ oc logs -n openshift-machine-config-operator -c machine-config-daemon machine-config-daemon-yyyyy
[...]
2021-10-10T00:00:01.930763153Z E1010 00:00:01.930743 2326 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
[...]
Check for an error related to the file azure.sock
:
$ oc logs -n openshift-machine-config-operator -c machine-config-daemon machine-config-daemon-yyyyy
[...]
2021-10-10T00:00:01.943571275Z Warning: failed to finalize previous deployment
2021-10-10T00:00:01.943571275Z error: During /etc merge: Copying azure.sock: Cannot copy non-regular/non-symlink file: azure.sock
2021-10-10T00:00:01.943571275Z check `journalctl -b -1 -u ostree-finalize-staged.service`
[...]
Check if there are pods from the csi-secrets-store-provider-azure
:
$ oc get pods -A | grep "csi-secrets-store-provider-azure"
[...]
my_namespace csi-secrets-store-provider-azure-xxxxx 1/1 Running 0 5d
[...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments