11.2. Scaling down and replacing Ceph Storage nodes

In some cases, you might need to scale down your Ceph cluster, or even replace a Ceph Storage node, for example, if a Ceph Storage node is faulty. In either situation, you must disable and rebalance any Ceph Storage node that you want to remove from the overcloud to avoid data loss.

注記

This procedure uses steps from the Red Hat Ceph Storage Administration Guide to manually remove Ceph Storage nodes. For more in-depth information about manual removal of Ceph Storage nodes, see Starting, stopping, and restarting Ceph daemons that run in containers and Removing a Ceph OSD using the command-line interface.

Procedure

  1. Log in to a Controller node as the heat-admin user. The director stack user has an SSH key to access the heat-admin user.
  2. List the OSD tree and find the OSDs for your node. For example, the node you want to remove might contain the following OSDs:

    -2 0.09998     host overcloud-cephstorage-0
    0 0.04999         osd.0                         up  1.00000          1.00000
    1 0.04999         osd.1                         up  1.00000          1.00000
  3. Disable the OSDs on the Ceph Storage node. In this case, the OSD IDs are 0 and 1.

    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph osd out 0
    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph osd out 1
  4. The Ceph Storage cluster begins rebalancing. Wait for this process to complete. Follow the status by using the following command:

    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph -w
  5. After the Ceph cluster completes rebalancing, log in to the Ceph Storage node you are removing, in this case, overcloud-cephstorage-0, as the heat-admin user, and stop and disable the node.

    [heat-admin@overcloud-cephstorage-0 ~]$ sudo systemctl stop ceph-osd@0
    [heat-admin@overcloud-cephstorage-0 ~]$ sudo systemctl stop ceph-osd@1
    [heat-admin@overcloud-cephstorage-0 ~]$ sudo systemctl disable ceph-osd@0
    [heat-admin@overcloud-cephstorage-0 ~]$ sudo systemctl disable ceph-osd@1
  6. Stop the OSDs.

    [heat-admin@overcloud-cephstorage-0 ~]$ sudo systemctl stop ceph-osd@0
    [heat-admin@overcloud-cephstorage-0 ~]$ sudo systemctl stop ceph-osd@1
  7. While logged in to the Controller node, remove the OSDs from the CRUSH map so that they no longer receive data.

    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph osd crush remove osd.0
    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph osd crush remove osd.1
  8. Remove the OSD authentication key.

    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph auth del osd.0
    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph auth del osd.1
  9. Remove the OSD from the cluster.

    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph osd rm 0
    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph osd rm 1
  10. Remove the Storage node from the CRUSH map:

    [heat-admin@overcloud-controller-0 ~]$ sudo docker exec ceph-mon-<HOSTNAME> ceph osd crush rm <NODE>
    [heat-admin@overcloud-controller-0 ~]$ sudo ceph osd crush remove <NODE>

    You can confirm the <NODE> name as defined in the CRUSH map by searching the CRUSH tree:

    [heat-admin@overcloud-controller-0 ~]$ sudo podman exec ceph-mon-<HOSTNAME> ceph osd crush tree | grep overcloud-osd-compute-3 -A 4
                    "name": "overcloud-osd-compute-3",
                    "type": "host",
                    "type_id": 1,
                    "items": []
                },
    [heat-admin@overcloud-controller-0 ~]$

    In the CRUSH tree, ensure that the items list is empty. If the list is not empty, revisit step 7.

  11. Leave the node and return to the undercloud as the stack user.

    [heat-admin@overcloud-controller-0 ~]$ exit
    [stack@director ~]$
  12. Disable the Ceph Storage node so that director does not reprovision it.

    [stack@director ~]$ openstack baremetal node list
    [stack@director ~]$ openstack baremetal node maintenance set UUID
  13. Removing a Ceph Storage node requires an update to the overcloud stack in director with the local template files. First identify the UUID of the overcloud stack:

    $ openstack stack list
  14. Identify the UUIDs of the Ceph Storage node you want to delete:

    $ openstack server list
  15. Delete the node from the stack and update the plan accordingly:

    重要

    If you passed any extra environment files when you created the overcloud, pass them again here using the -e option to avoid making undesired changes to the overcloud. For more information, see Modifying the overcloud environment in the Director Installation and Usage guide.

    $ openstack overcloud node delete /
    --stack <stack-name> /
    --templates /
    -e <other-environment-files> /
    <node_UUID>
  16. Wait until the stack completes its update. Use the heat stack-list --show-nested command to monitor the stack update.
  17. Add new nodes to the director node pool and deploy them as Ceph Storage nodes. Use the CephStorageCount parameter in parameter_defaults of your environment file, in this case, ~/templates/storage-config.yaml, to define the total number of Ceph Storage nodes in the overcloud.

    parameter_defaults:
      ControllerCount: 3
      OvercloudControlFlavor: control
      ComputeCount: 3
      OvercloudComputeFlavor: compute
      CephStorageCount: 3
      OvercloudCephStorageFlavor: ceph-storage
      CephMonCount: 3
      OvercloudCephMonFlavor: ceph-mon
    注記

    For more information about how to define the number of nodes per role, see 「Assigning nodes and flavors to roles」.

  18. After you update your environment file, redeploy the overcloud:

    $ openstack overcloud deploy --templates -e <ENVIRONMENT_FILE>

    Director provisions the new node and updates the entire stack with the details of the new node.

  19. Log in to a Controller node as the heat-admin user and check the status of the Ceph Storage node:

    [heat-admin@overcloud-controller-0 ~]$ sudo ceph status
  20. Confirm that the value in the osdmap section matches the number of nodes in your cluster that you want. The Ceph Storage node that you removed is replaced with a new node.