Chapter 4. Using Machine Deletion Remediation

You can use the Machine Deletion Remediation Operator to reprovision unhealthy nodes using the Machine API. You can use the Machine Deletion Remediation Operator in conjunction with the Node Health Check Operator.

4.1. About the Machine Deletion Remediation Operator

The Machine Deletion Remediation (MDR) operator works with the NodeHealthCheck controller, to reprovision unhealthy nodes using the Machine API. MDR follows the annotation on the node to the associated machine object, confirms that it has an owning controller (for example, MachineSetController), and deletes it. Once the machine CR is deleted, the owning controller creates a replacement.

The prerequisites for MDR include:

  • a Machine API-based cluster that is able to programmatically destroy and create cluster nodes,
  • nodes that are associated with machines, and
  • declaratively managed machines.

You can then modify the NodeHealthCheck CR to use MDR as its remediator. An example MDR template object and NodeHealthCheck configuration are provided in the documentation.

The MDR process works as follows:

  • the Node Health Check Operator detects an unhealthy node and creates a MDR CR.
  • the MDR Operator watches for the MDR CR associated with the unhealthy node and deletes it, if the machine has an owning controller.
  • when the node is healthy again, the MDR CR is deleted by the NodeHealthCheck controller.

4.2. Installing the Machine Deletion Remediation Operator by using the web console

You can use the Red Hat OpenShift web console to install the Machine Deletion Remediation Operator.

Prerequisites

  • Log in as a user with cluster-admin privileges.

Procedure

  1. In the Red Hat OpenShift web console, navigate to OperatorsOperatorHub.
  2. Select the Machine Deletion Remediation Operator, or MDR, from the list of available Operators, and then click Install.
  3. Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the openshift-workload-availability namespace.
  4. Click Install.

Verification

To confirm that the installation is successful:

  1. Navigate to the OperatorsInstalled Operators page.
  2. Check that the Operator is installed in the openshift-workload-availability namespace and its status is Succeeded.

If the Operator is not installed successfully:

  1. Navigate to the OperatorsInstalled Operators page and inspect the Status column for any errors or failures.
  2. Navigate to the WorkloadsPods page and check the log of the pod in the openshift-workload-availability project for any reported issues.

4.3. Installing the Machine Deletion Remediation Operator by using the CLI

You can use the OpenShift CLI (oc) to install the Machine Deletion Remediation Operator.

You can install the Machine Deletion Remediation Operator in your own namespace or in the openshift-workload-availability namespace.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a Namespace custom resource (CR) for the Machine Deletion Remediation Operator:

    1. Define the Namespace CR and save the YAML file, for example, workload-availability-namespace.yaml:

      apiVersion: v1
      kind: Namespace
      metadata:
        name: openshift-workload-availability
    2. To create the Namespace CR, run the following command:

      $ oc create -f workload-availability-namespace.yaml
  2. Create an OperatorGroup CR:

    1. Define the OperatorGroup CR and save the YAML file, for example, workload-availability-operator-group.yaml:

      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        name: workload-availability-operator-group
        namespace: openshift-workload-availability
    2. To create the OperatorGroup CR, run the following command:

      $ oc create -f workload-availability-operator-group.yaml
  3. Create a Subscription CR:

    1. Define the Subscription CR and save the YAML file, for example, machine-deletion-remediation-subscription.yaml:

      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
          name: machine-deletion-remediation-operator
          namespace: openshift-workload-availability 1
      spec:
          channel: stable
          name: machine-deletion-remediation-operator
          source: redhat-operators
          sourceNamespace: openshift-marketplace
          package: machine-deletion-remediation
      1
      Specify the Namespace where you want to install the Machine Deletion Remediation Operator. When installing the Machine Deletion Remediation Operator in the openshift-workload-availability Subscription CR, the Namespace and OperatorGroup CRs will already exist.
    2. To create the Subscription CR, run the following command:

      $ oc create -f machine-deletion-remediation-subscription.yaml

Verification

  1. Verify that the installation succeeded by inspecting the CSV resource:

    $ oc get csv -n openshift-workload-availability

    Example output

    NAME                               DISPLAY                          VERSION   REPLACES   PHASE
    machine-deletion-remediation.v0.3.0      Machine Deletion Remediation Operator   0.3.0   machine-deletion-remediation.v0.2.1           Succeeded

4.4. Configuring the Machine Deletion Remediation Operator

You can use the Machine Deletion Remediation Operator, with the Node Health Check Operator, to create the MachineDeletionRemediationTemplate Custom Resource (CR). This CR defines the remediation strategy for the nodes.

The MachineDeletionRemediationTemplate CR resembles the following YAML file:

apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
kind: MachineDeletionRemediationTemplate
metadata:
  name: machinedeletionremediationtemplate-sample
  namespace: openshift-workload-availability
spec:
  template:
    spec: {}

4.5. Troubleshooting the Machine Deletion Remediation Operator

4.5.1. General troubleshooting

Issue
You want to troubleshoot issues with the Machine Deletion Remediation Operator.
Resolution

Check the Operator logs.

$ oc logs <machine-deletion-remediation-controller-manager-name> -c manager -n <namespace-name>

4.5.2. Unsuccessful remediation

Issue
An unhealthy node was not remediated.
Resolution

Verify that the MachineDeletionRemediation CR was created by running the following command:

$ oc get mdr -A

If the NodeHealthCheck controller did not create the MachineDeletionRemediation CR when the node turned unhealthy, check the logs of the NodeHealthCheck controller. Additionally, ensure that the NodeHealthCheck CR includes the required specification to use the remediation template.

If the MachineDeletionRemediation CR was created, ensure that its name matches the unhealthy node object.

4.5.3. Machine Deletion Remediation Operator resources exist even after uninstalling the Operator

Issue
The Machine Deletion Remediation Operator resources, such as the remediation CR and the remediation template CR, exist even after uninstalling the Operator.
Resolution

To remove the Machine Deletion Remediation Operator resources, you can delete the resources by selecting the Delete all operand instances for this operator checkbox before uninstalling. This checkbox feature is only available in Red Hat OpenShift since version 4.13. For all versions of Red Hat OpenShift, you can delete the resources by running the following relevant command for each resource type:

$ oc delete mdr <machine-deletion-remediation> -n <namespace>
$ oc delete mdrt <machine-deletion-remediation-template> -n <namespace>

The remediation CR mdr must be created and deleted by the same entity, for example, NHC. If the remediation CR mdr is still present, it is deleted, together with the MDR operator.

The remediation template CR mdrt only exists if you use MDR with NHC. When the MDR operator is deleted using the web console, the remediation template CR mdrt is also deleted.

4.6. Gathering data about the Machine Deletion Remediation Operator

To collect debugging information about the Machine Deletion Remediation Operator, use the must-gather tool. For information about the must-gather image for the Machine Deletion Remediation Operator, see Gathering data about specific features.

4.7. Additional resources