How to troubleshoot nodes getting stuck in "SchedulingDisabled" state?
Environment
- Red Hat Openshift Container Platform (OCP 4)
- Red Hat OpenShift Service on AWS (ROSA 4)
- Red Hat Openshift Dedicated 4 (OSD 4)
- Azure Red Hat Openshift (ARO 4)
Issue
- Nodes stuck in "SchedulingDisabled" state after scaling down the machineset
- Machines got stuck in "Deleting" phase after scaling down the machineset
- Cluster autoscaler is unable to scale down a node.
Resolution
Delete the associated PDBs (if needed, make a backup copy of the PDB policy).
After you've deleted the PDB policy, the nodes will proceed to drain. However, it can take some time (up to 30 minutes) for the node to be fully deleted, so continue to check the node status.
Root Cause
In this case, machine controller was unable to drain the node associated with the machine that is stuck in the "Deleting" phase due to pod's disruption budget.
Following error is observed from machine-controller container of machine-api-controller pod:
failed to drain node for machine. Cannot evict pod <pod-name> as it would violate the pod's disruption budget.
Additionally, the cluster autoscaler considers a node for removal if the following conditions apply:
- The sum of CPU and memory requests of all pods running on the node is less than 50% of the allocated resources on the node.
- The cluster autoscaler can move all pods running on the node to the other nodes.
- The cluster autoscaler does not have scale down disabled annotation.
If the following types of pods are present on a node, the cluster autoscaler will not remove the node:
- Pods with restrictive pod disruption budgets (PDBs).
- Kube-system pods that do not run on the node by default.
- Kube-system pods that do not have a PDB or have a PDB that is too restrictive.
- Pods that are not backed by a controller object such as a deployment, replica set, or stateful set.
- Pods with local storage.
- Pods that cannot be moved elsewhere because of a lack of resources, incompatible node selectors or affinity, matching anti-affinity, and so on.
- Unless they also have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation, pods that have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation.
Diagnostic Steps
(1) Check the machines that are stuck in the "Deleting" phase:
oc get machines -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
ocp-XXX-demo-machine-worker-us-east-xx-xxxxx Deleting m5.xlarge us-east-2 us-east-2a 83d
(2) Check logs from machine-api-controller:
oc logs machine-api-controllers-xxxxxxxxxx-xxxxx -c machine-controller
I0705 16:16:10.316805 1 controller.go:175] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: reconciling Machine
I0705 16:16:10.329202 1 controller.go:219] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: reconciling machine triggers delete
I0705 16:16:10.479565 1 controller.go:709] evicting pod <namespace>/<pod-name>
E0705 16:16:10.486631 1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0705 16:16:15.487449 1 controller.go:709] evicting pod <namespace>/<pod-name>
E0705 16:16:15.496589 1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0705 16:16:20.496843 1 controller.go:709] evicting pod <namespace>/<pod-name>
E0705 16:16:20.502997 1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0705 16:16:25.503998 1 controller.go:709]evicting pod <namespace>/<pod-name>
E0705 16:16:25.510273 1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0705 16:16:30.510364 1 controller.go:709] evicting pod <namespace>/<pod-name>
W0705 16:16:30.510421 1 controller.go:456] drain failed for machine "ocp-XXX-demo-machine-worker-us-east-xx-xxxxx": error when evicting pods/"<pod-name>" -n "<namespace>": global timeout reached: 20s
E0705 16:16:30.510437 1 controller.go:234] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: failed to drain node for machine: requeue in: 20s
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments