Poison Pill Operator not re-creating the stopped nodes
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- AWS provider
- Poison Pill (PP) Operator
Issue
- After triggering the Machine Health Check and Poison Pill Operator the Worker nodes remain stopped in NOTREADY STATUS
Resolution
The default MachineHealthCheck (MHC) recovery strategy is to destroy the machines and recreate them. This is an “active” mechanism that does not require cooperation from the failed node, and the cluster would always regain its full capacity. So for the case when the node is stopped or shutdown , it just need default MachineHealth Check recovery strategy to bring it back . Using Poison pill remediation wont work as the node wont be responding. Hence remove the "remediation" spec to point to Poison pill remediation template and set "AllowedRemediation" . Hence when the node is stopped the MHC pick the node and reboot it automatically.
Root Cause
The behavior described here is unfortunately expected as Poison Pill is a completely passive mechanism, which can guarantee that the node will enter a safe state (stopped) in order to allow workloads to move elsewhere, but restoring capacity is only best-effort
When the node is stopped manually, there is no Poison Pill (PP) process running there to trigger the reboot.
Diagnostic Steps
Stop worker node from AWS console
$ oc get nodes
$ oc get ppr -A
$ oc get machineset -n openshift-machine-api
$ oc get machine -n openshift-machine-api -o wide
After deletion and recreation of node the node remain in NOTREADY Status
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments