Troubleshooting nodes stuck in "SchedulingDisabled" when scaling down?

Solution Verified - Updated -

Environment

  • Red Hat Openshift Container Platform (RHOCP)
    • 4
  • PodDisruptionBudget (PDB)
  • Cluster Autoscaler

Issue

  • Nodes stuck in SchedulingDisabled state after scaling down the MachineSet.
  • Machines got stuck in Deleting phase after scaling down the MachineSet.
  • Cluster Autoscaler is unable to scale down a node.
  • Following errors are observed in machine-controller container of machine-api-controller pod when trying to scale down a node:

    failed to drain node for machine: requeue in: 20s
    
    error when evicting pods/"[pod_name]" -n "[namespace_name]" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    
    

Resolution

Refer to PodDisruptionBudget (PDB) could cause Machine Config Operator (MCO) to be degraded in case a PDB doesn't allow the node to be evicted.

After following any of the procedures in the above linked solution, the node will proceed to drain. However, it can take some time (up to 30 minutes) for the node to be fully deleted, so continue to check the node status.

Root Cause

In this case, Machine Controller was unable to drain the node associated with the machine that is stuck in the Deleting phase due to Pod Disruption Budget.

Refer also to how does scale down works in Cluster Autoscaler in OCP 4 for information about how nodes are consider for removal and what could prevent a node to be removed.

Diagnostic Steps

  1. Check the machines that are stuck in the Deleting phase:

    $ oc get machines -n openshift-machine-api
    NAME                                                  PHASE      TYPE          REGION      ZONE        AGE
    ocp-XXX-demo-machine-worker-us-east-xx-xxxxx          Deleting   m5.xlarge    us-east-2   us-east-2a   83d
    
  2. Check logs from machine-api-controller:

    $ oc logs machine-api-controllers-xxxxxxxxxx-xxxxx -c machine-controller
    [...]
    I0705 16:16:10.316805       1 controller.go:175] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: reconciling Machine
    I0705 16:16:10.329202       1 controller.go:219] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: reconciling machine triggers delete
    I0705 16:16:10.479565       1 controller.go:709] evicting pod <namespace>/<pod-name>
    E0705 16:16:10.486631       1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    I0705 16:16:15.487449       1 controller.go:709] evicting pod <namespace>/<pod-name>
    E0705 16:16:15.496589       1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    I0705 16:16:20.496843       1 controller.go:709] evicting pod <namespace>/<pod-name>
    E0705 16:16:20.502997       1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    I0705 16:16:25.503998       1 controller.go:709]evicting pod <namespace>/<pod-name>
    E0705 16:16:25.510273       1 controller.go:709] error when evicting pods/"<pod-name>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    I0705 16:16:30.510364       1 controller.go:709] evicting pod <namespace>/<pod-name>
    W0705 16:16:30.510421       1 controller.go:456] drain failed for machine "ocp-XXX-demo-machine-worker-us-east-xx-xxxxx": error when evicting pods/"<pod-name>" -n "<namespace>": global timeout reached: 20s
    E0705 16:16:30.510437       1 controller.go:234] ocp-XXX-demo-machine-worker-us-east-xx-xxxxx: failed to drain node for machine: requeue in: 20s
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments