Troubleshooting nodes stuck in "SchedulingDisabled" when scaling down?

Environment

Red Hat Openshift Container Platform (RHOCP)
- 4
PodDisruptionBudget (PDB)
Cluster Autoscaler

Issue

Nodes stuck in SchedulingDisabled state after scaling down the MachineSet.
Machines got stuck in Deleting phase after scaling down the MachineSet.
Cluster Autoscaler is unable to scale down a node.

Following errors are observed in machine-controller container of machine-api-controller pod when trying to scale down a node:

failed to drain node for machine: requeue in: 20s

error when evicting pods/"[pod_name]" -n "[namespace_name]" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Resolution

Refer to PodDisruptionBudget (PDB) could cause Machine Config Operator (MCO) to be degraded in case a PDB doesn't allow the node to be evicted.

After following any of the procedures in the above linked solution, the node will proceed to drain. However, it can take some time (up to 30 minutes) for the node to be fully deleted, so continue to check the node status.

Root Cause

In this case, Machine Controller was unable to drain the node associated with the machine that is stuck in the Deleting phase due to Pod Disruption Budget.

Refer also to how does scale down works in Cluster Autoscaler in OCP 4 for information about how nodes are consider for removal and what could prevent a node to be removed.

Diagnostic Steps

Check the machines that are stuck in the Deleting phase:

$ oc get machines -n openshift-machine-api
NAME                                                  PHASE      TYPE          REGION      ZONE        AGE
ocp-XXX-demo-machine-worker-us-east-xx-xxxxx          Deleting   m5.xlarge    us-east-2   us-east-2a   83d

Check logs from machine-api-controller:

$ oc logs machine-api-controllers-xxxxxxxxxx-xxxxx -c machine-controller
[...]
I0705 16:16:10.316805       1 controller.go:175] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: reconciling Machine
I0705 16:16:10.329202       1 controller.go:219] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: reconciling machine triggers delete
I0705 16:16:10.479565       1 controller.go:709] evicting pod <namespace>/<pod-name>
E0705 16:16:10.486631       1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0705 16:16:15.487449       1 controller.go:709] evicting pod <namespace>/<pod-name>
E0705 16:16:15.496589       1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0705 16:16:20.496843       1 controller.go:709] evicting pod <namespace>/<pod-name>
E0705 16:16:20.502997       1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0705 16:16:25.503998       1 controller.go:709]evicting pod <namespace>/<pod-name>
E0705 16:16:25.510273       1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0705 16:16:30.510364       1 controller.go:709] evicting pod <namespace>/<pod-name>
W0705 16:16:30.510421       1 controller.go:456] drain failed for machine "ocp-XXX-demo-machine-worker-us-east-xx-xxxxx": error when evicting pods/"<pod-name>" -n "<namespace>": global timeout reached: 20s
E0705 16:16:30.510437       1 controller.go:234] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: failed to drain node for machine: requeue in: 20s

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

Troubleshooting nodes stuck in "SchedulingDisabled" when scaling down?

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links