Cluster operator image-registry degraded in multiple availability zones not supported region
Environment
- Azure Red Hat OpenShift
- Multiple availability zones not supported region
Issue
- Cluster operator image-registry due to one of image-registry pod in pending status with following warning:
0/9 nodes are available: 3 node(s) didn't match pod topology spread constraints (missing required label), 3 node(s) had untolerated taint {node-role.kubernetes.io/dev: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't match pod topology spread constraints. preemption: 0/9 nodes are available: 3 node(s) didn't match pod topology spread constraints, 6 Preemption is not helpful for scheduling.
Resolution
Step 1
Scale-up the machine set to create new machines which will be added to the availability set and at least 1 of them will be assigned the faultDomain of 1
resulting in one of them having the missing topology.kubernetes.io/zone
of 1
allowing the pod to be scheduled.
Step 2
Delete the old machines to make sure that the only workers left around are part of the availabilitySet.
Step 3
Scale-down the machineset back.
Root Cause
- In clusters that are Installed from OpenShift v4.10 onwards in a region with a single availability zone, the
cluster-api-azure plugin
in the machine-api controller will set one worker to have atopology.kubernetes.io/zone
of "0" and another to "1" - This happens because the machines that are added to the machineset are automatically added to an availabilitySet. - The issue is that in clusters created prior to OpenShift v4.10, the existing nodes in the machineset haven't been deleted/recreated to be added to an availability zone. This means the only nodes that these pods can be scheduled on all have
topology.kubernetes.io/zone: 0
.
Diagnostic Steps
- Check cluster operator:
$ oc get co
image-registry 4.11.28 True True True 2y
- Check pod in openshift-image-registry project:
$ oc get pod -n openshift-image-registry |grep -i image
cluster-image-registry-operator-fd7c9fb9d-fbx5c 1/1 Running 0 20d
image-registry-7df957fc6b-l4sws 0/1 Pending 0 6d
image-registry-7df957fc6b-l5dqb 1/1 Running 0 19d
- Check events in in openshift-image-registry project:
$ oc get event -n openshift-image-registry
LAST SEEN TYPE REASON OBJECT MESSAGE
13d Warning FailedScheduling pod/image-registry-7df957fc6b-crcdk 0/9 nodes are available: 3 node(s) didn't match pod topology spread constraints (missing required label), 3 node(s) had untolerated taint {node-role.kubernetes.io/dev: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't match pod topology spread constraints. preemption: 0/9 nodes are available: 3 node(s) didn't match pod
topology spread constraints, 6 Preemption is not helpful for scheduling.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments