Chapter 12. Replacing a failed disk

If one of the disks fails in your Ceph cluster, complete the following procedures to replace it:

  1. Determining if there is a device name change, see Section 12.1, “Determining if there is a device name change”.
  2. Ensuring that the OSD is down and destroyed, see Section 12.2, “Ensuring that the OSD is down and destroyed”.
  3. Removing the old disk from the system and installing the replacement disk, see Section 12.3, “Removing the old disk from the system and installing the replacement disk”.
  4. Verifying that the disk replacement is successful, see Section 12.4, “Verifying that the disk replacement is successful”.

12.1. Determining if there is a device name change

Before you replace the disk, determine if the replacement disk for the replacement OSD has a different name in the operating system than the device that you want to replace. If the replacement disk has a different name, you must update Ansible parameters for the devices list so that subsequent runs of ceph-ansible, including when director runs ceph-ansible, do not fail as a result of the change. For an example of the devices list that you must change when you use director, see Section 5.3, “Mapping the Ceph Storage node disk layout”.

Warning

If the device name changes and you use the following procedures to update your system outside of ceph-ansible or director, there is a risk that the configuration management tools are out of sync with the system that they manage until you update the system definition files and the configuration is reasserted without error.

Persistent naming of storage devices

Storage devices that the sd driver manages might not always have the same name across reboots. For example, a disk that is normally identified by /dev/sdc might be named /dev/sdb. It is also possible for the replacement disk, /dev/sdc, to appear in the operating system as /dev/sdd even if you want to use it as a replacement for /dev/sdc. To address this issue, use names that are persistent and match the following pattern: /dev/disk/by-*. For more information, see Persistent Naming in the Red Hat Enterprise Linux (RHEL) 7 Storage Administration Guide.

Depending on the naming method that you use to deploy Ceph, you might need to update the devices list after you replace the OSD. Use the following list of naming methods to determine if you must change the devices list:

The major and minor number range method

If you used sd and want to continue to use it, after you install the new disk, check if the name has changed. If the name did not change, for example, if the same name appears correctly as /dev/sdd, it is not necessary to change the name after you complete the disk replacement procedures.

Important

This naming method is not recommended because there is still a risk that the name becomes inconsistent over time. For more information, see Persistent Naming in the RHEL 7 Storage Administration Guide.

The by-path method

If you use this method, and you add a replacement disk in the same slot, then the path is consistent and no change is necessary.

Important

Although this naming method is preferable to the major and minor number range method, use caution to ensure that the target numbers do not change. For example, use persistent binding and update the names if a host adapter is moved to a different PCI slot. In addition, there is the possibility that the SCSI host numbers can change if a HBA fails to probe, if drivers are loaded in a different order, or if a new HBA is installed on the system. The by-path naming method also differs between RHEL7 and RHEL8. For more information, see:

The by-uuid method
If you use this method, you can use the blkid utility to set the new disk to have the same UUID as the old disk. For more information, see Persistent Naming in the RHEL 7 Storage Administration Guide.
The by-id method
If you use this method, you must change the devices list because this identifier is a property of the device and the device has been replaced.

When you add the new disk to the system, if it is possible to modify the persistent naming attributes according to the RHEL7 Storage Administrator Guide, see Persistent Naming, so that the device name is unchanged, then it is not necessary to update the devices list and re-run ceph-ansible, or trigger director to re-run ceph-ansible and you can proceed with the disk replacement procedures. However, you can re-run ceph-ansible to ensure that the change did not result in any inconsistencies.

12.2. Ensuring that the OSD is down and destroyed

On the server that hosts the Ceph Monitor, use the ceph command in the running monitor container to ensure that the OSD that you want to replace is down, and then destroy it.

Procedure

  1. Identify the name of the running Ceph monitor container and store it in an environment variable called MON:

    MON=$(podman ps | grep ceph-mon | awk {'print $1'})
  2. Alias the ceph command so that it executes within the running Ceph monitor container:

    alias ceph="podman exec $MON ceph"
  3. Use the new alias to verify that the OSD that you want to replace is down:

    [root@overcloud-controller-0 ~]# ceph osd tree | grep 27
    27   hdd 0.04790         osd.27                    down  1.00000 1.00000
  4. Destroy the OSD. The following example command destroys OSD 27:

    [root@overcloud-controller-0 ~]# ceph osd destroy 27 --yes-i-really-mean-it
    destroyed osd.27

12.3. Removing the old disk from the system and installing the replacement disk

On the container host with the OSD that you want to replace, remove the old disk from the system and install the replacement disk.

Prerequisites:

The ceph-volume command is present in the Ceph container but is not installed on the overcloud node. Create an alias so that the ceph-volume command runs the ceph-volume binary inside the Ceph container. Then use the ceph-volume command to clean the new disk and add it as an OSD.

Procedure

  1. Ensure that the failed OSD is not running:

    systemctl stop ceph-osd@27
  2. Identify the image ID of the ceph container image and store it in an environment variable called IMG:

    IMG=$(podman images | grep ceph | awk {'print $3'})
  3. Alias the ceph-volume command so that it runs inside the $IMG Ceph container, with the ceph-volume entry point and relevant directories:

    alias ceph-volume="podman run --rm --privileged --net=host --ipc=host -v /run/lock/lvm:/run/lock/lvm:z -v /var/run/udev/:/var/run/udev/:z -v /dev:/dev -v /etc/ceph:/etc/ceph:z -v /var/lib/ceph/:/var/lib/ceph/:z -v /var/log/ceph/:/var/log/ceph/:z --entrypoint=ceph-volume $IMG --cluster ceph"
  4. Verify that the aliased command runs successfully:

    ceph-volume lvm list
  5. Check that your new OSD device is not already part of LVM. Use the pvdisplay command to inspect the device, and ensure that the VG Name field is empty. Replace <NEW_DEVICE> with the /dev/* path of your new OSD device:

    [root@overcloud-computehci-2 ~]# pvdisplay <NEW_DEVICE>
      --- Physical volume ---
      PV Name               /dev/sdj
      VG Name               ceph-0fb0de13-fc8e-44c8-99ea-911e343191d2
      PV Size               50.00 GiB / not usable 1.00 GiB
      Allocatable           yes (but full)
      PE Size               1.00 GiB
      Total PE              49
      Free PE               0
      Allocated PE          49
      PV UUID               kOO0If-ge2F-UH44-6S1z-9tAv-7ypT-7by4cp
    [root@overcloud-computehci-2 ~]#

    If the VG Name field is not empty, then the device belongs to a volume group that you must remove.

  6. If the device belongs to a volume group, use the lvdisplay command to check if there is a logical volume in the volume group. Replace <VOLUME_GROUP> with the value of the VG Name field that you retrieved from the pvdisplay command:

    [root@overcloud-computehci-2 ~]# lvdisplay | grep <VOLUME_GROUP>
      LV Path                /dev/ceph-0fb0de13-fc8e-44c8-99ea-911e343191d2/osd-data-a0810722-7673-43c7-8511-2fd9db1dbbc6
      VG Name                ceph-0fb0de13-fc8e-44c8-99ea-911e343191d2
    [root@overcloud-computehci-2 ~]#

    If the LV Path field is not empty, then the device contains a logical volume that you must remove.

  7. If the new device is part of a logical volume or volume group, remove the logical volume, volume group, and the device association as a physical volume within the LVM system.

    • Replace <LV_PATH> with the value of the LV Path field.
    • Replace <VOLUME_GROUP> with the value of the VG Name field.
    • Replace <NEW_DEVICE> with the /dev/* path of your new OSD device.

      [root@overcloud-computehci-2 ~]# lvremove --force <LV_PATH>
        Logical volume "osd-data-a0810722-7673-43c7-8511-2fd9db1dbbc6" successfully removed
      [root@overcloud-computehci-2 ~]# vgremove --force <VOLUME_GROUP>
        Volume group "ceph-0fb0de13-fc8e-44c8-99ea-911e343191d2" successfully removed
      [root@overcloud-computehci-2 ~]# pvremove <NEW_DEVICE>
        Labels on physical volume "/dev/sdj" successfully wiped.
  8. Ensure that the new OSD device is clean. In the following example, the device is /dev/sdj:

    [root@overcloud-computehci-2 ~]# ceph-volume lvm zap /dev/sdj
    --> Zapping: /dev/sdj
    --> --destroy was not specified, but zapping a whole device will remove the partition table
    Running command: /usr/sbin/wipefs --all /dev/sdj
    Running command: /bin/dd if=/dev/zero of=/dev/sdj bs=1M count=10
     stderr: 10+0 records in
    10+0 records out
    10485760 bytes (10 MB, 10 MiB) copied, 0.010618 s, 988 MB/s
    --> Zapping successful for: <Raw Device: /dev/sdj>
    [root@overcloud-computehci-2 ~]#
  9. Create the new OSD with the existing OSD ID by using the new device but pass --no-systemd so that ceph-volume does not attempt to start the OSD. This is not possible from within the container:

    ceph-volume lvm create --osd-id 27 --data /dev/sdj --no-systemd
  10. Start the OSD outside of the container:

    systemctl start ceph-osd@27

12.4. Verifying that the disk replacement is successful

To check that your disk replacement is successful, on the undercloud, complete the following steps.

Procedure

  1. Check if the device name changed, update the devices list according to the naming method you used to deploy Ceph. For more information, see Section 12.1, “Determining if there is a device name change”.
  2. To ensure that the change did not introduce any inconsistencies, re-run the overcloud deploy command to perform a stack update.
  3. In cases where you have hosts that have different device lists, you might have to define an exception. For example, you might use the following example heat environment file to deploy a node with three OSD devices.

    parameter_defaults:
      CephAnsibleDisksConfig:
        devices:
          - /dev/sdb
          - /dev/sdc
          - /dev/sdd
        osd_scenario: lvm
        osd_objectstore: bluestore

    The CephAnsibleDisksConfig parameter applies to all nodes that host OSDs, so you cannot update the devices parameter with the new device list. Instead, you must define an exception for the new host that has a different device list. For more information about defining an exception, see Section 5.5, “Mapping the disk layout to non-homogeneous Ceph Storage nodes”.