Chapter 2. OpenShift Container Storage deployed using local storage devices
2.1. Replacing storage nodes on bare metal infrastructure
- To replace an operational node, see Section 2.1.1, “Replacing an operational node on bare metal user-provisioned infrastructure”
- To replace a failed node, see Section 2.1.2, “Replacing a failed node on bare metal user-provisioned infrastructure”
2.1.1. Replacing an operational node on bare metal user-provisioned infrastructure
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created a
LocalVolumeSet
object to enable automatic provisioning of devices, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage. -
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created the
LocalVolumeDiscovery
object, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
Identify the NODE and get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the
mon
(if any) and OSDs that are running in the node to be replaced.$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
$ oc adm cordon <node_name>
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
$ oc delete node <node_name>
- Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
- Create a new OpenShift Container Platform node using the new bare metal machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>
- Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add a new worker node to
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.Determine which
localVolumeSet
to edit.# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
Update the
localVolumeSet
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -
<failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example,FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
OSD and Mon might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.1.2. Replacing a failed node on bare metal user-provisioned infrastructure
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created a
LocalVolumeSet
object to enable automatic provisioning of devices, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage. -
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created the
LocalVolumeDiscovery
object, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
Identify the NODE and get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the
mon
(if any) and OSDs that are running in the node to be replaced.$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state.
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
$ oc delete node <node_name>
- Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
- Create a new OpenShift Container Platform node using the new bare metal machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>
- Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add a new worker node to
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.Determine which
localVolumeSet
to edit.# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
Update the
localVolumeSet
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -
<failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example,FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
OSD and Mon might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.2. Replacing storage nodes on IBM Z or LinuxONE infrastructure
You can choose one of the following procedures to replace storage nodes:
2.2.1. Replacing operational nodes on IBM Z or LinuxONE infrastructure
Use this procedure to replace an operational node on IBM Z or LinuxONE infrastructure.
Procedure
- Log in to OpenShift Web Console.
- Click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
$ oc adm cordon <node_name>
Drain the node using the following command:
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
ImportantThis activity may take at least 5-10 minutes. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
Identify the Persistent Volume Claim bound to a given OSD.
$ oc describe pod/rook-ceph-osd-0-544db49d7f-qrgqm|grep pvc ceph.rook.io/pvc=ocs-deviceset-thin-0-data-0lg6zp
Identify where the OSD pod runs.
$ oc get -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
Create a debug pod and open a
chroot
environment for the host.$ oc debug node/<node name> $ chroot /host
Verify the devices are encrypted.
$ dmsetup ls | grep ocs-deviceset ocs-deviceset-0-data-0-57snx-block-dmcrypt (253:1)
$ lsblk | grep ocs-deviceset `-ocs-deviceset-0-data-0-57snx-block-dmcrypt 253:1 0 512G 0 crypt
- If verification steps fail, contact Red Hat Support.
2.2.2. Replacing failed nodes on IBM Z or LinuxONE infrastructure
Perform this procedure to replace a failed node which is not operational on IBM Z or LinuxONE infrastructure for OpenShift Container Storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the faulty node and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-draining
and click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From the web user interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From the command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= | cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
Identify the Persistent Volume Claim bound to a given OSD.
$ oc describe pod/rook-ceph-osd-0-544db49d7f-qrgqm|grep pvc ceph.rook.io/pvc=ocs-deviceset-thin-0-data-0lg6zp
Identify where the OSD pod runs.
$ oc get -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
Create a debug pod and open a
chroot
environment for the host.$ oc debug node/<node name> $ chroot /host
Verify the devices are encrypted.
$ dmsetup ls | grep ocs-deviceset ocs-deviceset-0-data-0-57snx-block-dmcrypt (253:1)
$ lsblk | grep ocs-deviceset `-ocs-deviceset-0-data-0-57snx-block-dmcrypt 253:1 0 512G 0 crypt
- If verification steps fail, contact Red Hat Support.
2.3. Replacing storage nodes on Amazon EC2 infrastructure
To replace an operational Amazon EC2 node on user-provisioned and installer provisioned infrastructures, see:
To replace a failed Amazon EC2 node on user-provisioned and installer provisioned infrastructures, see:
2.3.1. Replacing an operational Amazon EC2 node on user-provisioned infrastructure
Perform this procedure to replace an operational node on Amazon EC2 I3 user-provisioned infrastructure (UPI).
Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the nodes as unschedulable.
$ oc adm cordon <node_name>
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
$ oc delete node <node_name>
- Create a new Amazon EC2 I3 machine instance with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform node using the new Amazon EC2 I3 machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>
- Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolume
CR. You can either remove or comment out the failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
.$ oc get -n $local_storage_project localvolume
Example output:
NAME AGE local-block 25h
$ oc edit -n $local_storage_project localvolume local-block
Example output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
You can see that in this CR the below two new devices using by-id have been added.
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
-
Display PVs with
localblock
.$ oc get pv | grep localblock
Example output:
local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
Delete the storage resources associated with the failed node.
Identify the DeviceSet associated with the OSD to be replaced.
$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
Identify the PV associated with the PVC.
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in the DeviceSet identified in an earlier step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
In this example, the associated PV is
local-pv-8176b2bf
.Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal-job
pod. A status ofCompleted
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Delete the PV which was identified in earlier steps. In this example, the PV name is
local-pv-8176b2bf
.$ oc delete pv local-pv-8176b2bf
Example output:
persistentvolume "local-pv-8176b2bf" deleted
Delete
crashcollector
pod deployment identified in an earlier step.$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
OSDs and mon’s might take several minutes to get to the Running state.
Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.3.2. Replacing an operational Amazon EC2 node on installer-provisioned infrastructure
Use this procedure to replace an operational node on Amazon EC2 I3 installer-provisioned infrastructure (IPI).
Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the nodes as unschedulable.
$ oc adm cordon <node_name>
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolume
CR. You can either remove or comment out the failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
.$ oc get -n $local_storage_project localvolume
Example output:
NAME AGE local-block 25h
$ oc edit -n $local_storage_project localvolume local-block
Example output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
You can see that in this CR the below two new devices using by-id have been added.
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
-
Display PVs with
localblock
.$ oc get pv | grep localblock
Example output:
local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
Delete the storage resources associated with the failed node.
Identify the DeviceSet associated with the OSD to be replaced.
$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
Identify the PV associated with the PVC.
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in the DeviceSet identified in an earlier step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
In this example, the associated PV is
local-pv-8176b2bf
.Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal-job
pod. A status ofCompleted
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Delete the PV which was identified in earlier steps. In this example, the PV name is
local-pv-8176b2bf
.$ oc delete pv local-pv-8176b2bf
Example output:
persistentvolume "local-pv-8176b2bf" deleted
Delete
crashcollector
pod deployment identified in an earlier step.$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Delete the
rook-ceph-operator
.$ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Verify that the
rook-ceph-operator
pod is restarted.$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD may take several minutes after the operator starts.
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
OSDs and mon’s might take several minutes to get to the Running state.
Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.3.3. Replacing a failed Amazon EC2 node on user-provisioned infrastructure
The ephemeral storage of Amazon EC2 I3 for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Amazon EC2 infrastructure.
Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the nodes as unschedulable.
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state.
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
$ oc delete node <node_name>
- Create a new Amazon EC2 I3 machine instance with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform node using the new Amazon EC2 I3 machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>
- Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolume
CR. You can either remove or comment out the failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
.$ oc get -n $local_storage_project localvolume
Example output:
NAME AGE local-block 25h
$ oc edit -n $local_storage_project localvolume local-block
Example output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
You can see that in this CR the below two new devices using by-id have been added.
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
-
Display PVs with
localblock
.$ oc get pv | grep localblock
Example output:
local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
Delete the storage resources associated with the failed node.
Identify the DeviceSet associated with the OSD to be replaced.
$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
Identify the PV associated with the PVC.
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in the DeviceSet identified in an earlier step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
In this example, the associated PV is
local-pv-8176b2bf
.Change into the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_ids_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal-job
pod. A status ofCompleted
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Delete the PV which was identified in earlier steps. In this example, the PV name is
local-pv-8176b2bf
.$ oc delete pv local-pv-8176b2bf
Example output:
persistentvolume "local-pv-8176b2bf" deleted
Delete
crashcollector
pod deployment identified in an earlier step.$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
OSDs and mon’s might take several minutes to get to the Running state.
Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.3.4. Replacing a failed Amazon EC2 node on installer-provisioned infrastructure
The ephemeral storage of Amazon EC2 I3 for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Amazon EC2 infrastructure.
Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get the labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state.
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolume
CR. You can either remove or comment out the failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
.$ oc get -n $local_storage_project localvolume
Example output:
NAME AGE local-block 25h
$ oc edit -n $local_storage_project localvolume local-block
Example output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
You can see that in this CR the below two new devices using by-id have been added.
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
-
Display PVs with
localblock
.$ oc get pv | grep localblock
Example output:
local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
Delete the storage resources associated with the failed node.
Identify the DeviceSet associated with the OSD to be replaced.
$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
Identify the PV associated with the PVC.
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in the DeviceSet identified in an earlier step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
In this example, the associated PV is
local-pv-8176b2bf
.Change into the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_ids_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal-job
pod. A status ofCompleted
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Delete the PV which was identified in earlier steps. In this example, the PV name is
local-pv-8176b2bf
.$ oc delete pv local-pv-8176b2bf
Example output:
persistentvolume "local-pv-8176b2bf" deleted
Delete
crashcollector
pod deployment identified in an earlier step.$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
OSDs and mon’s might take several minutes to get to the Running state.
Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.4. Replacing storage nodes on VMware infrastructure
To replace an operational node, see:
To replace a failed node,see:
2.4.1. Replacing an operational node on VMware user-provisioned infrastructure
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created a
LocalVolumeSet
object to enable automatic provisioning of devices, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage. -
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created the
LocalVolumeDiscovery
object, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
Identify the NODE and get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the
mon
(if any) and OSDs that are running in the node to be replaced.$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
$ oc adm cordon <node_name>
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
$ oc delete node <node_name>
- Log in to vSphere and terminate the identified VM.
- Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>
- Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add a new worker node to
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.Determine which
localVolumeSet
to edit.# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
Update the
localVolumeSet
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -
<failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example,FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
OSD and Mon might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.4.2. Replacing an operational node on VMware installer-provisioned infrastructure
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created a
LocalVolumeSet
object to enable automatic provisioning of devices, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage. -
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created the
LocalVolumeDiscovery
object, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the
mon
(if any) and OSDs that are running in the node to be replaced.$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
$ oc adm cordon <node_name>
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
- Physically add a new device to the node.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add a new worker node to
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.Determine which
localVolumeSet
to edit.# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
Update the
localVolumeSet
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -
<failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example,FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Identify the PV associated with the PVC.
#oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
If there is a PV in
Released
state, delete it.# oc delete pv <persistent-volume>
For example:
#oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d9c5cbd6" deleted
Identify the
crashcollector
pod deployment.$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
If there is an existing
crashcollector
pod deployment, delete it.$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
OSD and Mon might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.4.3. Replacing a failed node on VMware user-provisioned infrastructure
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created a
LocalVolumeSet
object to enable automatic provisioning of devices, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage. -
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created the
LocalVolumeDiscovery
object, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
Identify the NODE and get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the
mon
(if any) and OSDs that are running in the node to be replaced.$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state.
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
$ oc delete node <node_name>
- Log in to vSphere and terminate the identified VM.
- Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>
- Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add a new worker node to
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.Determine which
localVolumeSet
to edit.# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
Update the
localVolumeSet
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -
<failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example,FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
OSD and Mon might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.4.4. Replacing a failed node on VMware installer-provisioned infrastructure
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created a
LocalVolumeSet
object to enable automatic provisioning of devices, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage. -
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created the
LocalVolumeDiscovery
object, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the
mon
(if any) and OSDs that are running in the node to be replaced.$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state.
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
- Physically add a new device to the node.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add a new worker node to
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.Determine which
localVolumeSet
to edit.# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
Update the
localVolumeSet
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -
<failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example,FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Identify the PV associated with the PVC.
#oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
If there is a PV in
Released
state, delete it.# oc delete pv <persistent-volume>
For example:
#oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d9c5cbd6" deleted
Identify the
crashcollector
pod deployment.$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
If there is an existing
crashcollector
pod deployment, delete it.$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
OSD and Mon might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.5. Replacing storage nodes on Red Hat Virtualization infrastructure
- To replace an operational node, see Section 2.5.1, “Replacing an operational node on Red Hat Virtualization installer-provisioned infrastructure”
- To replace a failed node, see Section 2.5.2, “Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure”
2.5.1. Replacing an operational node on Red Hat Virtualization installer-provisioned infrastructure
Use this procedure to replace an operational node on Red Hat Virtualization installer-provisioned infrastructure (IPI).
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created a
LocalVolumeSet
object to enable automatic provisioning of devices, you can do it now by following the procedure in Post-update configuration changes for clusters backed by local storage. -
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created the
LocalVolumeDiscovery
object, you can do it now by following the procedure in Post-update configuration changes for clusters backed by local storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the nodes as unschedulable.
$ oc adm cordon <node_name>
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
- Physically add the new device(s) to the node.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add a new worker node to
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.Determine which
localVolumeSet
to edit.# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
Update the
localVolumeSet
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=_<failed_osd_id>_ FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -
<failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example,FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Identify the PV associated with the PVC.
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 512Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h server3.example.com
If there is a PV in
Released
state, delete it.# oc delete pv <persistent-volume>
For example:
# oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d6bf175b" deleted
Identify the
crashcollector
pod deployment.$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
If there is an existing
crashcollector
pod, delete it.$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
Delete the
ocs-osd-removal
job.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
OSD and Mon might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.5.2. Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure
Perform this procedure to replace a failed node which is not operational on Red Hat Virtualization installer-provisioned infrastructure (IPI) for OpenShift Container Storage.
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created a
LocalVolumeSet
object to enable automatic provisioning of devices, you can do it now by following the procedure in Post-update configuration changes for clusters backed by local storage. -
If you upgraded to OpenShift Container Storage 4.7 from a previous version and have not already created the
LocalVolumeDiscovery
object, you can do it now by following the procedure in Post-update configuration changes for clusters backed by local storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get the labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
$ oc adm cordon <node_name>
Remove the pods which are in the
Terminating
state.$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
- Physically add the new device(s) to the node.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
- Add cluster.ocs.openshift.io/openshift-storage and click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add a new worker node to
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.Determine which
localVolumeSet
to edit.# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
Update the
localVolumeSet
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server1.example.com - server2.example.com #- server3.example.com - newnode.example.com [...]
Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=_<failed_osd_id>_ FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -
<failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example,FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Identify the PV associated with the PVC.
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 512Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h server3.example.com
If there is a PV in Released state, delete it.
# oc delete pv <persistent-volume>
For example:
# oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d6bf175b" deleted
Identify the
crashcollector
pod deployment.$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
If there is an existing crashcollector pod deployment, delete it.
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
Delete the
ocs-osd-removal
job.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
OSD and Mon might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
(Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
2.6. Replacing storage nodes on IBM Power Systems infrastructure
For OpenShift Container Storage, node replacement can be performed proactively for an operational node and reactively for a failed node for the IBM Power Systems related deployments.
2.6.1. Replacing an operational or failed storage node on IBM Power Systems
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into OpenShift Container Platform (RHOCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>
Identify the
mon
(if any) and OSDs that are running in the node to be replaced.$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
$ oc delete node <node_name>
- Get a new IBM Power machine with required infrastructure. See Installing a cluster on IBM Power Systems.
- Create a new OpenShift Container Platform node using the new IBM Power Systems machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in
Pending
state:$ oc get csr
Approve all required OpenShift Container Storage CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>
- Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using your preferred interface:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_project
variable:$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
For example:
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage
Add a newly added worker node to localVolumeSet.
Determine which
localVolumeSet
to edit.# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
Update the
localVolumeSet
definition to include the new node and remove the failed node.# oc edit -n $local_storage_project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: #- worker-0 - worker-1 - worker-2 - worker-3 [...]
Remember to save before exiting the editor.
In the above example,
worker-0
was removed andworker-3
is the new node.
Verify that the new
localblock
PV is available.$ oc get pv | grep localblock NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS AGE local-pv-3e8964d3 500Gi RWO Delete Bound ocs-deviceset-localblock-2-data-0-mdbg9 localblock 25h local-pv-414755e0 500Gi RWO Delete Bound ocs-deviceset-localblock-1-data-0-4cslf localblock 25h local-pv-b481410 500Gi RWO Delete Available localblock 3m24s local-pv-5c9b8982 500Gi RWO Delete Bound ocs-deviceset-localblock-0-data-0-g2mmc localblock 25h
Change to the
openshift-storage
project.$ oc project openshift-storage
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
Identify the PVC as afterwards we need to delete PV associated with that specific PVC.
$ osd_id_to_remove=1 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd prefix
. In this example, the deployment name isrook-ceph-osd-1
.Example output:
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
In this example, the PVC name is
ocs-deviceset-localblock-0-data-0-g2mmc
.Remove the failed OSD from the cluster.
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> FORCE_OSD_REMOVAL=false | oc create -n openshift-storage -f -
You can remove more than one OSD by adding comma separated OSD IDs in the command. (For example: FAILED_OSD_IDS=0,1,2)
The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.WarningThis step results in OSD being completely removed from the cluster. Ensure that the correct value of
osd_id_to_remove
is provided.
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NoteIf
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
Delete the PV associated with the failed node.
Identify the PV associated with the PVC. PVC name should be identical to what we obtained in Step 16(a).
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-5c9b8982 500Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc localblock 24h worker-0
Delete the PV.
# oc delete pv <persistent-volume>
For example:
# oc delete pv local-pv-5c9b8982 persistentvolume "local-pv-5c9b8982" deleted
Delete the
crashcollector
pod deployment.$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> -n openshift-storage
Delete the
ocs-osd-removal-job
.# oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in theRunning
state.$ oc get pod -n openshift-storage | grep mon
Example output:
rook-ceph-mon-b-74f6dc9dd6-4llzq 1/1 Running 0 6h14m rook-ceph-mon-c-74948755c-h7wtx 1/1 Running 0 4h24m rook-ceph-mon-d-598f69869b-4bv49 1/1 Running 0 162m
OSD and Mon might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
- If verification steps fail, contact Red Hat Support.