Troubleshooting nodes stuck in "SchedulingDisabled" when scaling down?
Environment
- Red Hat Openshift Container Platform (RHOCP)
- 4
- PodDisruptionBudget (PDB)
- Cluster Autoscaler
Issue
- Nodes stuck in
SchedulingDisabled
state after scaling down theMachineSet
. - Machines got stuck in
Deleting
phase after scaling down theMachineSet
. - Cluster Autoscaler is unable to scale down a node.
-
Following errors are observed in
machine-controller
container ofmachine-api-controller
pod when trying to scale down a node:failed to drain node for machine: requeue in: 20s
error when evicting pods/"[pod_name]" -n "[namespace_name]" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Resolution
Refer to PodDisruptionBudget (PDB) could cause Machine Config Operator (MCO) to be degraded in case a PDB doesn't allow the node to be evicted.
After following any of the procedures in the above linked solution, the node will proceed to drain. However, it can take some time (up to 30 minutes) for the node to be fully deleted, so continue to check the node status.
Root Cause
In this case, Machine Controller was unable to drain the node associated with the machine that is stuck in the Deleting
phase due to Pod Disruption Budget.
Refer also to how does scale down works in Cluster Autoscaler in OCP 4 for information about how nodes are consider for removal and what could prevent a node to be removed.
Diagnostic Steps
-
Check the machines that are stuck in the
Deleting
phase:$ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ocp-XXX-demo-machine-worker-us-east-xx-xxxxx Deleting m5.xlarge us-east-2 us-east-2a 83d
-
Check logs from
machine-api-controller
:$ oc logs machine-api-controllers-xxxxxxxxxx-xxxxx -c machine-controller [...] I0705 16:16:10.316805 1 controller.go:175] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: reconciling Machine I0705 16:16:10.329202 1 controller.go:219] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: reconciling machine triggers delete I0705 16:16:10.479565 1 controller.go:709] evicting pod <namespace>/<pod-name> E0705 16:16:10.486631 1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0705 16:16:15.487449 1 controller.go:709] evicting pod <namespace>/<pod-name> E0705 16:16:15.496589 1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0705 16:16:20.496843 1 controller.go:709] evicting pod <namespace>/<pod-name> E0705 16:16:20.502997 1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0705 16:16:25.503998 1 controller.go:709]evicting pod <namespace>/<pod-name> E0705 16:16:25.510273 1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0705 16:16:30.510364 1 controller.go:709] evicting pod <namespace>/<pod-name> W0705 16:16:30.510421 1 controller.go:456] drain failed for machine "ocp-XXX-demo-machine-worker-us-east-xx-xxxxx": error when evicting pods/"<pod-name>" -n "<namespace>": global timeout reached: 20s E0705 16:16:30.510437 1 controller.go:234] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: failed to drain node for machine: requeue in: 20s
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments