Language:
Format:

Replacing nodes

Red Hat OpenShift Data Foundation 4.10

Instructions for how to safely replace a node in an OpenShift Data Foundation cluster.

Red Hat Storage Documentation Team

Abstract

This document explains how to safely replace a node in a Red Hat OpenShift Data Foundation cluster.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. We are beginning with these four terms: master, slave, blacklist, and whitelist. Because of the enormity of this endeavor, these changes will be implemented gradually over several upcoming releases. For more details, see our CTO Chris Wright’s message.

Providing feedback on Red Hat documentation

We appreciate your input on our documentation. Do let us know how we can make it better.

To give feedback, create a Bugzilla ticket:

Go to the Bugzilla website.
In the Component section, choose documentation.
Fill in the Description field with your suggestion for improvement. Include a link to the relevant part(s) of documentation.
Click Submit Bug.

Preface

For OpenShift Data Foundation, node replacement can be performed proactively for an operational node and reactively for a failed node for the following deployments:

For Amazon Web Services (AWS)
- User-provisioned infrastructure
- Installer-provisioned infrastructure
For VMware
- User-provisioned infrastructure
- Installer-provisioned infrastructure
For Red Hat Virtualization
- Installer-provisioned infrastructure
For Microsoft Azure
- Installer-provisioned infrastructure
For local storage devices
- Bare metal
- VMware
- Red Hat Virtualization
- IBM Power
For replacing your storage nodes in external mode, see Red Hat Ceph Storage documentation.

Chapter 1. OpenShift Data Foundation deployed using dynamic devices

1.1. OpenShift Data Foundation deployed on AWS

1.1.1. Replacing an operational AWS node on user-provisioned infrastructure

Perform this procedure to replace an operational node on AWS user-provisioned infrastructure.

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Identify the node that needs to be replaced.
Mark the node as unschedulable using the following command:
```
$ oc adm cordon <node_name>
```
Drain the node using the following command:
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Delete the node using the following command:
```
$ oc delete nodes <node_name>
```
Create a new AWS machine instance with the required infrastructure. See Platform requirements.
Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
$ oc get csr
```
Approve all required OpenShift Container Platform CSRs for the new node:
```
$ oc adm certificate approve <Certificate_Name>
```
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node.
From the web user interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From the command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.1.2. Replacing an operational AWS node on installer-provisioned infrastructure

Use this procedure to replace an operational node on AWS installer-provisioned infrastructure (IPI).

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
```
$ oc adm cordon <node_name>
```
Drain the node using the following command:
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes or more.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.1.3. Replacing a failed AWS node on user-provisioned infrastructure

Perform this procedure to replace a failed node which is not operational on AWS user-provisioned infrastructure (UPI) for OpenShift Data Foundation.

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Identify the AWS machine instance of the node that needs to be replaced.
Log in to AWS and terminate the identified AWS machine instance.
Create a new AWS machine instance with the required infrastructure. See platform requirements.
Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
$ oc get csr
```
Approve all required OpenShift Container Platform CSRs for the new node:
```
$ oc adm certificate approve <Certificate_Name>
```
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.1.4. Replacing a failed AWS node on installer-provisioned infrastructure

Perform this procedure to replace a failed node which is not operational on AWS installer-provisioned infrastructure (IPI) for OpenShift Data Foundation.

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the faulty node and click on its Machine Name.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
[Optional]: If the failed AWS instance is not removed automatically, terminate the instance from AWS console.

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.2. OpenShift Data Foundation deployed on VMware

To replace an operational node, see:
- Section 1.2.1, “Replacing an operational VMware node on user-provisioned infrastructure”
- Section 1.2.2, “Replacing an operational VMware node on installer-provisioned infrastructure”
To replace a failed node, see:
- Section 1.2.3, “Replacing a failed VMware node on user-provisioned infrastructure”
- Section 1.2.4, “Replacing a failed VMware node on installer-provisioned infrastructure”

1.2.1. Replacing an operational VMware node on user-provisioned infrastructure

Perform this procedure to replace an operational node on VMware user-provisioned infrastructure (UPI).

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Identify the node and its VM that needs to be replaced.
Mark the node as unschedulable using the following command:
```
$ oc adm cordon <node_name>
```
Drain the node using the following command:
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Delete the node using the following command:
```
$ oc delete nodes <node_name>
```
Log in to vSphere and terminate the identified VM.
Important
VM should be deleted only from the inventory and not from the disk.
Create a new VM on vSphere with the required infrastructure. See Platform requirements.
Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
$ oc get csr
```
Approve all required OpenShift Container Platform CSRs for the new node:
```
$ oc adm certificate approve <Certificate_Name>
```
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.2.2. Replacing an operational VMware node on installer-provisioned infrastructure

Use this procedure to replace an operational node on VMware installer-provisioned infrastructure (IPI).

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
```
$ oc adm cordon <node_name>
```
Drain the node using the following command:
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes or more.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.2.3. Replacing a failed VMware node on user-provisioned infrastructure

Perform this procedure to replace a failed node on VMware user-provisioned infrastructure (UPI).

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Identify the node and its VM that needs to be replaced.
Delete the node using the following command:
```
$ oc delete nodes <node_name>
```
Log in to vSphere and terminate the identified VM.
Important
VM should be deleted only from the inventory and not from the disk.
Create a new VM on vSphere with the required infrastructure. See Platform requirements.
Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
$ oc get csr
```
Approve all required OpenShift Container Platform CSRs for the new node:
```
$ oc adm certificate approve <Certificate_Name>
```
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.2.4. Replacing a failed VMware node on installer-provisioned infrastructure

Perform this procedure to replace a failed node which is not operational on VMware installer-provisioned infrastructure (IPI) for OpenShift Data Foundation.

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the faulty node and click on its Machine Name.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
[Optional]: If the failed VM is not removed automatically, terminate the VM from vSphere.

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.3. OpenShift Data Foundation deployed on Red Hat Virtualization

1.3.1. Replacing an operational Red Hat Virtualization node on installer-provisioned infrastructure

Use this procedure to replace an operational node on Red Hat Virtualization installer-provisioned infrastructure (IPI).

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
```
$ oc adm cordon <node_name>
```
Drain the node using the following command:
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes or more.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
```
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
```

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.3.2. Replacing a failed Red Hat Virtualization node on installer-provisioned infrastructure

Perform this procedure to replace a failed node which is not operational on Red Hat Virtualization installer-provisioned infrastructure (IPI) for OpenShift Data Foundation.

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the faulty node. Take a note of its Machine Name.
Log in to Red Hat Virtualization Administration Portal and remove the virtual disks associated with mon and OSDs from the failed Virtual Machine.
This step is required so that the disks are not deleted when the VM instance is deleted as part of the Delete machine step.
Important
Do not select the Remove Permanently option when removing the disk(s).
In the OpenShift Web Console, click Compute → Machines. Search for the required machine.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Optional: If the failed VM is not removed automatically, remove the VM from Red Hat Virtualization Administration Portal.

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.4. OpenShift Data Foundation deployed on Microsoft Azure

1.4.1. Replacing operational nodes on Azure installer-provisioned infrastructure

Use this procedure to replace an operational node on Azure installer-provisioned infrastructure (IPI).

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
```
$ oc adm cordon <node_name>
```
Drain the node using the following command:
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes or more.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

1.4.2. Replacing failed nodes on Azure installer-provisioned infrastructure

Perform this procedure to replace a failed node which is not operational on Azure installer-provisioned infrastructure (IPI) for OpenShift Data Foundation.

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the faulty node and click on its Machine Name.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
[Optional]: If the failed Azure instance is not removed automatically, terminate the instance from Azure console.

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

Chapter 2. OpenShift Data Foundation deployed using local storage devices

2.1. Replacing storage nodes on bare metal infrastructure

To replace an operational node, see Section 2.1.1, “Replacing an operational node on bare metal user-provisioned infrastructure”
To replace a failed node, see Section 2.1.2, “Replacing a failed node on bare metal user-provisioned infrastructure”

2.1.1. Replacing an operational node on bare metal user-provisioned infrastructure

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Identify the NODE and get labels on the node to be replaced.
```
$ oc get nodes --show-labels | grep <node_name>
```
Identify the mon (if any) and OSDs that are running in the node to be replaced.
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments of the pods identified in the previous step.

For example:

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Mark the node as unschedulable.
```
$ oc adm cordon <node_name>
```

Drain the node.

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Delete the node.
```
$ oc delete node <node_name>
```
Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
Important
For information about how to replace a master node when you have installed OpenShift Data Foundation on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.
Create a new OpenShift Container Platform node using the new bare metal machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
$ oc get csr
```
Approve all required OpenShift Container Platform CSRs for the new node:
```
$ oc adm certificate approve <Certificate_Name>
```
Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

For example:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
echo $local_storage_project
openshift-local-storage

Add a new worker node to localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Determine which localVolumeSet to edit.

# oc get -n $local_storage_project localvolumeset
NAME          AGE
localblock   25h

Update the localVolumeSet definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Verify that the new localblock PV is available.

$oc get pv | grep localblock | grep Available
local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Change to the openshift-storage project.
```
$ oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
```
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Ensure that the OSD removal is completed.

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Delete the ocs-osd-removal-job.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

$ oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

OSD and Mon might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.1.2. Replacing a failed node on bare metal user-provisioned infrastructure

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Identify the NODE and get labels on the node to be replaced.
```
$ oc get nodes --show-labels | grep <node_name>
```
Identify the mon (if any) and OSDs that are running in the node to be replaced.
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments of the pods identified in the previous step.

For example:

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Mark the node as unschedulable.
```
$ oc adm cordon <node_name>
```

Remove the pods which are in Terminating state.

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Drain the node.

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Delete the node.
```
$ oc delete node <node_name>
```
Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
Important
For information about how to replace a master node when you have installed OpenShift Data Foundation on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.
Create a new OpenShift Container Platform node using the new bare metal machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
$ oc get csr
```
Approve all required OpenShift Container Platform CSRs for the new node:
```
$ oc adm certificate approve <Certificate_Name>
```
Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

For example:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
echo $local_storage_project
openshift-local-storage

Add a new worker node to localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Determine which localVolumeSet to edit.

# oc get -n $local_storage_project localvolumeset
NAME          AGE
localblock   25h

Update the localVolumeSet definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Verify that the new localblock PV is available.

$oc get pv | grep localblock | grep Available
local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Change to the openshift-storage project.
```
$ oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
```
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Ensure that the OSD removal is completed.

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Delete the ocs-osd-removal-job.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

$ oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

OSD and Mon might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.2. Replacing storage nodes on IBM Z or LinuxONE infrastructure

You can choose one of the following procedures to replace storage nodes:

Section 2.2.1, “Replacing operational nodes on IBM Z or LinuxONE infrastructure”
Section 2.2.2, “Replacing failed nodes on IBM Z or LinuxONE infrastructure”

2.2.1. Replacing operational nodes on IBM Z or LinuxONE infrastructure

Use this procedure to replace an operational node on IBM Z or LinuxONE infrastructure.

Procedure

Log in to OpenShift Web Console.
Click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
```
$ oc adm cordon <node_name>
```
Drain the node using the following command:
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Important
This activity may take at least 5-10 minutes. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.2.2. Replacing failed nodes on IBM Z or LinuxONE infrastructure

Perform this procedure to replace a failed node which is not operational on IBM Z or LinuxONE infrastructure for OpenShift Data Foundation.

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the faulty node and click on its Machine Name.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
Important
This activity may take at least 5-10 minutes. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the web user interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From the command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= | cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all other required OpenShift Data Foundation pods are in Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.3. Replacing storage nodes on IBM Power infrastructure

For OpenShift Data Foundation, node replacement can be performed proactively for an operational node and reactively for a failed node for the IBM Power related deployments.

2.3.1. Replacing an operational or failed storage node on IBM Power

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Identify the node and get labels on the node to be replaced.
```
$ oc get nodes --show-labels | grep <node_name>
```
Identify the mon (if any) and object storage device (OSD) pods that are running in the node to be replaced.
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments of the pods identified in the previous step.

For example:

$ oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage

Mark the node as unschedulable.
```
$ oc adm cordon <node_name>
```

Remove the pods which are in Terminating state.

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Drain the node.

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Delete the node.
```
$ oc delete node <node_name>
```
Get a new IBM Power machine with required infrastructure. See Installing a cluster on IBM Power.
Create a new OpenShift Container Platform node using the new IBM Power machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
$ oc get csr
```
Approve all required OpenShift Container Platform CSRs for the new node:
```
$ oc adm certificate approve <Certificate_Name>
```
Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
```
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=''
```

Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

For example:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
echo $local_storage_project
openshift-local-storage

Add a newly added worker node to localVolume.

Determine which localVolume to edit.

# oc get -n $local_storage_project localvolume
NAME           AGE
localblock    25h

Update the localVolume definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolume localblock
[...]
    nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            #- worker-0
            - worker-1
            - worker-2
            - worker-3
[...]

Remember to save before exiting the editor.

In the above example, worker-0 was removed and worker-3 is the new node.

Verify that the new localblock PV is available.

$ oc get pv | grep localblock
NAME              CAPACITY   ACCESSMODES RECLAIMPOLICY STATUS     CLAIM             STORAGECLASS                 AGE
local-pv-3e8964d3    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-2-data-0-mdbg9  localblock     25h
local-pv-414755e0    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-1-data-0-4cslf  localblock     25h
local-pv-b481410    500Gi     RWO        Delete       Available                                            localblock     3m24s
local-pv-5c9b8982    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-0-data-0-g2mmc  localblock     25h

Change to the openshift-storage project.
```
$ oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
1. Identify the PVC as afterwards we need to delete PV associated with that specific PVC.
```
$ osd_id_to_remove=1
$ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
```
  where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-1.
  Example output:
```
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
    ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
```
  In this example, the PVC name is ocs-deviceset-localblock-0-data-0-g2mmc.
2. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
```
  <failed_osd_id>
  Is the integer in the pod name immediately after the rook-ceph-osd prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
  Warning
  This step results in the OSD being completely removed from the cluster. Ensure that the correct value of osd_id_to_remove is provided.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Ensure that the OSD removal is completed.

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Delete the PV associated with the failed node.

Identify the PV associated with the PVC.

The PVC name must be identical to the name that is obtained while removing the failed OSD from the cluster.

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-5c9b8982  500Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc  localblock  24h  worker-0

If there is a PV in Released state, delete it.

# oc delete pv <persistent-volume>

For example:

# oc delete pv local-pv-5c9b8982
persistentvolume "local-pv-5c9b8982" deleted

Identify the crashcollector pod deployment.

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

If there is an existing crashcollector pod deployment, delete it.

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Delete the ocs-osd-removal-job.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

$ oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-b-74f6dc9dd6-4llzq                                   1/1     Running     0          6h14m
rook-ceph-mon-c-74948755c-h7wtx                                    1/1     Running     0          4h24m
rook-ceph-mon-d-598f69869b-4bv49                                   1/1     Running     0          162m

OSD and Mon might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.4. Replacing storage nodes on VMware infrastructure

To replace an operational node, see:
- Section 2.4.1, “Replacing an operational node on VMware user-provisioned infrastructure”
- Section 2.4.2, “Replacing an operational node on VMware installer-provisioned infrastructure”
To replace a failed node,see:
- Section 2.4.3, “Replacing a failed node on VMware user-provisioned infrastructure”
- Section 2.4.4, “Replacing a failed node on VMware installer-provisioned infrastructure”

2.4.1. Replacing an operational node on VMware user-provisioned infrastructure

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Identify the NODE and get labels on the node to be replaced.
```
$ oc get nodes --show-labels | grep <node_name>
```
Identify the mon (if any) and OSDs that are running in the node to be replaced.
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments of the pods identified in the previous step.

For example:

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Mark the node as unschedulable.
```
$ oc adm cordon <node_name>
```

Drain the node.

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Delete the node.
```
$ oc delete node <node_name>
```
Log in to vSphere and terminate the identified VM.
Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
$ oc get csr
```
Approve all required OpenShift Container Platform CSRs for the new node:
```
$ oc adm certificate approve <Certificate_Name>
```
Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

For example:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
echo $local_storage_project
openshift-local-storage

Add a new worker node to localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Determine which localVolumeSet to edit.

# oc get -n $local_storage_project localvolumeset
NAME          AGE
localblock   25h

Update the localVolumeSet definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Verify that the new localblock PV is available.

$oc get pv | grep localblock | grep Available
local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Change to the openshift-storage project.
```
$ oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
```
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Ensure that the OSD removal is completed.

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Delete the ocs-osd-removal-job.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

$ oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

OSD and Mon might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.4.2. Replacing an operational node on VMware installer-provisioned infrastructure

Prerequisites

Replacement nodes must be configured with similar infrastructure, resources, and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.

Get labels on the node to be replaced.

$ oc get nodes --show-labels | grep <node_name>

Identify the mon (if any) and OSDs that are running in the node to be replaced.
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments of the pods identified in the previous step.

For example:

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Mark the node as unschedulable.
```
$ oc adm cordon <node_name>
```

Drain the node.

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes or more.
Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Physically add a new device to the node.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

For example:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
echo $local_storage_project
openshift-local-storage

Add a new worker node to localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Determine which localVolumeSet to edit.

# oc get -n $local_storage_project localvolumeset
NAME          AGE
localblock   25h

Update the localVolumeSet definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Verify that the new localblock PV is available.

$oc get pv | grep localblock | grep Available
local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Change to the openshift-storage project.
```
$ oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
```
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Ensure that the OSD removal is completed.

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Identify the PV associated with the PVC.

#oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

If there is a PV in Released state, delete it.

# oc delete pv <persistent-volume>

For example:

#oc delete pv local-pv-d6bf175b
persistentvolume "local-pv-d9c5cbd6" deleted

Identify the crashcollector pod deployment.

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

If there is an existing crashcollector pod deployment, delete it.

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

Delete the ocs-osd-removal-job.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

$ oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

OSD and Mon might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.4.3. Replacing a failed node on VMware user-provisioned infrastructure

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Identify the NODE and get labels on the node to be replaced.
```
$ oc get nodes --show-labels | grep <node_name>
```
Identify the mon (if any) and OSDs that are running in the node to be replaced.
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments of the pods identified in the previous step.

For example:

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Mark the node as unschedulable.
```
$ oc adm cordon <node_name>
```

Remove the pods which are in Terminating state.

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Drain the node.

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Delete the node.
```
$ oc delete node <node_name>
```
Log in to vSphere and terminate the identified VM.
Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
$ oc get csr
```
Approve all required OpenShift Container Platform CSRs for the new node:
```
$ oc adm certificate approve <Certificate_Name>
```
Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

For example:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
echo $local_storage_project
openshift-local-storage

Add a new worker node to localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Determine which localVolumeSet to edit.

# oc get -n $local_storage_project localvolumeset
NAME          AGE
localblock   25h

Update the localVolumeSet definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Verify that the new localblock PV is available.

$oc get pv | grep localblock | grep Available
local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Change to the openshift-storage project.
```
$ oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
```
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Ensure that the OSD removal is completed.

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Delete the ocs-osd-removal-job.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

$ oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

OSD and Mon might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.4.4. Replacing a failed node on VMware installer-provisioned infrastructure

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.

Get labels on the node to be replaced.

$ oc get nodes --show-labels | grep <node_name>

Identify the mon (if any) and OSDs that are running in the node to be replaced.
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments of the pods identified in the previous step.

For example:

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Mark the node as unschedulable.
```
$ oc adm cordon <node_name>
```

Remove the pods which are in Terminating state.

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Drain the node.

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes or more.
Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Physically add a new device to the node.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

For example:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
echo $local_storage_project
openshift-local-storage

Add a new worker node to localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Determine which localVolumeSet to edit.

# oc get -n $local_storage_project localvolumeset
NAME          AGE
localblock   25h

Update the localVolumeSet definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Verify that the new localblock PV is available.

$oc get pv | grep localblock | grep Available
local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Change to the openshift-storage project.
```
$ oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
```
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Ensure that the OSD removal is completed.

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Identify the PV associated with the PVC.

#oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

If there is a PV in Released state, delete it.

# oc delete pv <persistent-volume>

For example:

#oc delete pv local-pv-d6bf175b
persistentvolume "local-pv-d9c5cbd6" deleted

Identify the crashcollector pod deployment.

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

If there is an existing crashcollector pod deployment, delete it.

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

Delete the ocs-osd-removal-job.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

$ oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

OSD and Mon might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.5. Replacing storage nodes on Red Hat Virtualization infrastructure

To replace an operational node, see Section 2.5.1, “Replacing an operational node on Red Hat Virtualization installer-provisioned infrastructure”
To replace a failed node, see Section 2.5.2, “Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure”

2.5.1. Replacing an operational node on Red Hat Virtualization installer-provisioned infrastructure

Use this procedure to replace an operational node on Red Hat Virtualization installer-provisioned infrastructure (IPI).

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure, resources and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.

Get labels on the node to be replaced.

$ oc get nodes --show-labels | grep <node_name>

Identify the mon (if any) and OSDs that are running in the node to be replaced.
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments of the pods identified in the previous step.

For example:

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Mark the nodes as unschedulable.
```
$ oc adm cordon <node_name>
```

Drain the node.

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for the new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes or more.
Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
Physically add the new device(s) to the node.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
```
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
```

Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

For example:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
echo $local_storage_project
openshift-local-storage

Add a new worker node to localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Determine which localVolumeSet to edit.

# oc get -n $local_storage_project localvolumeset
NAME          AGE
localblock   25h

Update the localVolumeSet definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Verify that the new localblock PV is available.

$oc get pv | grep localblock | grep Available
local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Change to the openshift-storage project.
```
$ oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
```
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Ensure that the OSD removal is completed.

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Identify the PV associated with the PVC.

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-d6bf175b  512Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  server3.example.com

If there is a PV in Released state, delete it.

# oc delete pv <persistent-volume>

For example:

# oc delete pv local-pv-d6bf175b
persistentvolume "local-pv-d6bf175b" deleted

Identify the crashcollector pod deployment.

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

If there is an existing crashcollector pod, delete it.

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

Delete the ocs-osd-removal job.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

$ oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running  0  38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running  0  38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running  0  4m8s

OSD and Mon might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

2.5.2. Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure

Perform this procedure to replace a failed node which is not operational on Red Hat Virtualization installer-provisioned infrastructure (IPI) for OpenShift Data Foundation.

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure, resources and disks to the node being replaced.
You must be logged into the OpenShift Container Platform (RHOCP) cluster.

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.

Get the labels on the node to be replaced.

$ oc get nodes --show-labels | grep <node_name>

Identify the mon (if any) and OSDs that are running in the node to be replaced.
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments of the pods identified in the previous step.

For example:

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Mark the node as unschedulable.
```
$ oc adm cordon <node_name>
```

Remove the pods which are in the Terminating state.

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Drain the node.

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for the new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes or more.
Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
Physically add the new device(s) to the node.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
```
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
```

Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

For example:

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
echo $local_storage_project
openshift-local-storage

Add a new worker node to localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Determine which localVolumeSet to edit.

# oc get -n $local_storage_project localvolumeset
NAME          AGE
localblock   25h

Update the localVolumeSet definition to include the new node and remove the failed node.

# oc edit -n $local_storage_project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Verify that the new localblock PV is available.

$oc get pv | grep localblock | grep Available
local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Change to the openshift-storage project.
```
$ oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
```
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Ensure that the OSD removal is completed.

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Identify the PV associated with the PVC.

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-d6bf175b  512Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  server3.example.com

If there is a PV in Released state, delete it.

# oc delete pv <persistent-volume>

For example:

# oc delete pv local-pv-d6bf175b
persistentvolume "local-pv-d6bf175b" deleted

Identify the crashcollector pod deployment.

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

If there is an existing crashcollector pod deployment, delete it.

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

Delete the ocs-osd-removal job.

# oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Verification steps

Execute the following command and verify that the new node is present in the output:

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

$ oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running  0   38m

rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running  0   38m

rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running  0   4m8s

OSD and Mon might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node.

$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the selected host(s).
```
$ oc debug node/<node name>
$ chroot /host
```
2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)
```
$ lsblk
```
If verification steps fail, contact Red Hat Support.

Language and Page Formatting Options

Replacing nodes

Instructions for how to safely replace a node in an OpenShift Data Foundation cluster.

Making open source more inclusive

Providing feedback on Red Hat documentation

Preface

Chapter 1. OpenShift Data Foundation deployed using dynamic devices

1.1. OpenShift Data Foundation deployed on AWS

1.1.1. Replacing an operational AWS node on user-provisioned infrastructure

1.1.2. Replacing an operational AWS node on installer-provisioned infrastructure

1.1.3. Replacing a failed AWS node on user-provisioned infrastructure

1.1.4. Replacing a failed AWS node on installer-provisioned infrastructure

1.2. OpenShift Data Foundation deployed on VMware

1.2.1. Replacing an operational VMware node on user-provisioned infrastructure

1.2.2. Replacing an operational VMware node on installer-provisioned infrastructure

1.2.3. Replacing a failed VMware node on user-provisioned infrastructure

1.2.4. Replacing a failed VMware node on installer-provisioned infrastructure

1.3. OpenShift Data Foundation deployed on Red Hat Virtualization

1.3.1. Replacing an operational Red Hat Virtualization node on installer-provisioned infrastructure

1.3.2. Replacing a failed Red Hat Virtualization node on installer-provisioned infrastructure

1.4. OpenShift Data Foundation deployed on Microsoft Azure

1.4.1. Replacing operational nodes on Azure installer-provisioned infrastructure

1.4.2. Replacing failed nodes on Azure installer-provisioned infrastructure

Chapter 2. OpenShift Data Foundation deployed using local storage devices

2.1. Replacing storage nodes on bare metal infrastructure

2.1.1. Replacing an operational node on bare metal user-provisioned infrastructure

2.1.2. Replacing a failed node on bare metal user-provisioned infrastructure

2.2. Replacing storage nodes on IBM Z or LinuxONE infrastructure

2.2.1. Replacing operational nodes on IBM Z or LinuxONE infrastructure

2.2.2. Replacing failed nodes on IBM Z or LinuxONE infrastructure

2.3. Replacing storage nodes on IBM Power infrastructure

2.3.1. Replacing an operational or failed storage node on IBM Power

2.4. Replacing storage nodes on VMware infrastructure

2.4.1. Replacing an operational node on VMware user-provisioned infrastructure

2.4.2. Replacing an operational node on VMware installer-provisioned infrastructure

2.4.3. Replacing a failed node on VMware user-provisioned infrastructure

2.4.4. Replacing a failed node on VMware installer-provisioned infrastructure

2.5. Replacing storage nodes on Red Hat Virtualization infrastructure

2.5.1. Replacing an operational node on Red Hat Virtualization installer-provisioned infrastructure

2.5.2. Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links