Cluster Updates Without Error but Machine Config Pools Degraded with `Marking Degraded due to: unexpected on-disk state` on OCP 4.6 and newer
Issue
-
After performing an update to a newer version of OpenShift Container Platform, not all nodes are upgraded. For example:
$ oc get node NAME STATUS ROLES AGE VERSION master-0.ocp.example.net Ready master 34d v1.17.1+9d33dd3 master-1.ocp.example.net Ready master 34d v1.17.1+9d33dd3 master-2.ocp.example.net Ready master 34d v1.17.1+9d33dd3 worker-0.ocp.example.net Ready worker 34d v1.17.1+9d33dd3 worker-1.ocp.example.net Ready worker 34d v1.17.1+9d33dd3 worker-2.ocp.example.net Ready, SchedulingDisabled worker 34d v1.17.1+912792b <----------
-
After performing an update to a newer version of OpenShift Container Platform, the MachineConfigOperator is reporting degraded pools:
$ oc describe co/machine-config ... 'Failed to resync $VERSION because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool $POOL is not ready, retrying. Status: (pool degraded: true total: x, ready y, updated: y, unavailable: 1)]'
-
A machine config pool is degraded, and in the MachineConfigOperator clusteroperator extensions, we see an error similar to:
worker: 'pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node worker0 is reporting: \"unexpected on-disk state validating against rendered-worker-abc: expected target osImageURL \\\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:xxx\\\", have \\\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:yyy\\\" (\\\"zzz\\\")\""'
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4.6+
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.