Poison Pill Operator not re-creating the stopped nodes

Solution In Progress - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • AWS provider
  • Poison Pill (PP) Operator

Issue

  • After triggering the Machine Health Check and Poison Pill Operator the Worker nodes remain stopped in NOTREADY STATUS

Resolution

The default MachineHealthCheck (MHC) recovery strategy is to destroy the machines and recreate them. This is an “active” mechanism that does not require cooperation from the failed node, and the cluster would always regain its full capacity. So for the case when the node is stopped or shutdown , it just need default MachineHealth Check recovery strategy to bring it back . Using Poison pill remediation wont work as the node wont be responding. Hence remove the "remediation" spec to point to Poison pill remediation template and set "AllowedRemediation" . Hence when the node is stopped the MHC pick the node and reboot it automatically.

Root Cause

The behavior described here is unfortunately expected as Poison Pill is a completely passive mechanism, which can guarantee that the node will enter a safe state (stopped) in order to allow workloads to move elsewhere, but restoring capacity is only best-effort
When the node is stopped manually, there is no Poison Pill (PP) process running there to trigger the reboot.

Diagnostic Steps

Stop worker node from AWS console

$ oc get nodes
$ oc get ppr -A
$ oc get machineset -n openshift-machine-api 
$ oc get machine -n openshift-machine-api -o wide

After deletion and recreation of node the node remain in NOTREADY Status

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments