Chapter 9. Replacing DistributedComputeHCI nodes

During hardware maintenance you may need to scale down, scale up, or replace a DistributedComputeHCI node at an edge site. To replace a DistributedComputeHCI node, remove services from the node you are replacing, scale the number of nodes down, and then follow the procedures for scaling those nodes back up.

9.1. Removing Red Hat Ceph Storage services

Before removing an HCI (hyperconverged) node from a cluster, you must remove Red Hat Ceph Storage services. To remove the Red Hat Ceph services, you must disable and remove ceph-osd service from the cluster services on the node you are removing, then stop and disable the mon, mgr, and osd services.

Procedure

  1. On the undercloud, use SSH to connect to the DistributedComputeHCI node that you want to remove:

    $ ssh tripleo-admin@<dcn-computehci-node>
  2. Start a cephadm shell. Use the configuration file and keyring file for the site that the host being removed is in:

    $ sudo cephadm shell --config /etc/ceph/dcn2.conf \
    --keyring /etc/ceph/dcn2.client.admin.keyring
  3. Record the OSDs (object storage devices) associated with the DistributedComputeHCI node you are removing for use reference in a later step:

    [ceph: root@dcn2-computehci2-1 ~]# ceph osd tree -c /etc/ceph/dcn2.conf
    …
    -3       0.24399     host dcn2-computehci2-1
     1   hdd 0.04880         osd.1                           up  1.00000 1.00000
     7   hdd 0.04880         osd.7                           up  1.00000 1.00000
    11   hdd 0.04880         osd.11                          up  1.00000 1.00000
    15   hdd 0.04880         osd.15                          up  1.00000 1.00000
    18   hdd 0.04880         osd.18                          up  1.00000 1.00000
    …
  4. Use SSH to connect to another node in the same cluster and remove the monitor from the cluster:

    $ sudo cephadm shell --config /etc/ceph/dcn2.conf \
    --keyring /etc/ceph/dcn2.client.admin.keyring
    
    [ceph: root@dcn-computehci2-0]# ceph mon remove dcn2-computehci2-1 -c /etc/ceph/dcn2.conf
    removing mon.dcn2-computehci2-1 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitors
  5. Use SSH to log in again to the node that you are removing from the cluster.
  6. Stop and disable the mgr service:

    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph
    ceph-crash@dcn2-computehci2-1.service    loaded active     running       Ceph crash dump collector
    ceph-mgr@dcn2-computehci2-1.service      loaded active     running       Ceph Manager
    
    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl stop ceph-mgr@dcn2-computehci2-1
    
    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph
    ceph-crash@dcn2-computehci2-1.service  loaded active running Ceph crash dump collector
    
    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl disable ceph-mgr@dcn2-computehci2-1
    Removed /etc/systemd/system/multi-user.target.wants/ceph-mgr@dcn2-computehci2-1.service.
  7. Start the cephadm shell:

    $ sudo cephadm shell --config /etc/ceph/dcn2.conf \
    --keyring /etc/ceph/dcn2.client.admin.keyring
  8. Verify that the mgr service for the node is removed from the cluster:

    [ceph: root@dcn2-computehci2-1 ~]# ceph -s
    
    cluster:
        id:     b9b53581-d590-41ac-8463-2f50aa985001
        health: HEALTH_WARN
                3 pools have too many placement groups
                mons are allowing insecure global_id reclaim
    
      services:
        mon: 2 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0 (age 2h)
        mgr: dcn2-computehci2-2(active, since 20h), standbys: dcn2-computehci2-0 1
        osd: 15 osds: 15 up (since 3h), 15 in (since 3h)
    
      data:
        pools:   3 pools, 384 pgs
        objects: 32 objects, 88 MiB
        usage:   16 GiB used, 734 GiB / 750 GiB avail
        pgs:     384 active+clean
    1
    The node that the mgr service is removed from is no longer listed when the mgr service is successfully removed.
  9. Export the Red Hat Ceph Storage specification:

    [ceph: root@dcn2-computehci2-1 ~]# ceph orch ls --export > spec.yml
  10. Edit the specifications in the spec.yaml file:

    • Remove all instances of the host <dcn-computehci-node> from spec.yml
    • Remove all instances of the <dcn-computehci-node> entry from the following:

      • service_type: osd
      • service_type: mon
      • service_type: host
  11. Reapply the Red Hat Ceph Storage specification:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch apply -i spec.yml
  12. Remove the OSDs that you identified using ceph osd tree:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm --zap 1 7 11 15 18
    Scheduled OSD(s) for removal
  13. Verify the status of the OSDs being removed. Do not continue until the following command returns no output:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm status
    OSD_ID  HOST                    STATE     PG_COUNT  REPLACE  FORCE  DRAIN_STARTED_AT
    1       dcn2-computehci2-1      draining  27        False    False  2021-04-23 21:35:51.215361
    7       dcn2-computehci2-1      draining  8         False    False  2021-04-23 21:35:49.111500
    11      dcn2-computehci2-1      draining  14        False    False  2021-04-23 21:35:50.243762
  14. Verify that no daemons remain on the host you are removing:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch ps dcn2-computehci2-1

    If daemons are still present, you can remove them with the following command:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch host drain dcn2-computehci2-1
  15. Remove the <dcn-computehci-node> host from the Red Hat Ceph Storage cluster:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch host rm dcn2-computehci2-1
    Removed host ‘dcn2-computehci2-1’

9.2. Removing the Image service (glance) services

Remove image services from a node when you remove it from service.

Procedure

  • To disable the Image service services, disable them using systemctl on the node you are removing:

    [root@dcn2-computehci2-1 ~]# systemctl stop tripleo_glance_api.service
    [root@dcn2-computehci2-1 ~]# systemctl stop  tripleo_glance_api_tls_proxy.service
    
    [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_glance_api.service
    Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api.service.
    [root@dcn2-computehci2-1 ~]# systemctl disable  tripleo_glance_api_tls_proxy.service
    Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api_tls_proxy.service.

9.3. Removing the Block Storage (cinder) services

You must remove the cinder-volume and etcd services from the DistributedComputeHCI node when you remove it from service.

Procedure

  1. Identify and disable the cinder-volume service on the node you are removing:

    (central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume
    | cinder-volume | dcn2-computehci2-1@tripleo_ceph | az-dcn2    | enabled | up    | 2022-03-23T17:41:43.000000 |
    (central) [stack@site-undercloud-0 ~]$ openstack volume service set --disable dcn2-computehci2-1@tripleo_ceph cinder-volume
  2. Log on to a different DistributedComputeHCI node in the stack:

    $ ssh tripleo-admin@dcn2-computehci2-0
  3. Remove the cinder-volume service associated with the node that you are removing:

    [root@dcn2-computehci2-0 ~]# podman exec -it cinder_volume cinder-manage service remove cinder-volume dcn2-computehci2-1@tripleo_ceph
    Service cinder-volume on host dcn2-computehci2-1@tripleo_ceph removed.
  4. Stop and disable the tripleo_cinder_volume service on the node that you are removing:

    [root@dcn2-computehci2-1 ~]# systemctl stop tripleo_cinder_volume.service
    [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_cinder_volume.service
    Removed /etc/systemd/system/multi-user.target.wants/tripleo_cinder_volume.service

9.4. Delete the DistributedComputeHCI node

Set the provisioned parameter to a value of false and remove the node from the stack. Disable the nova-compute service and delete the relevant network agent.

Procedure

  1. Copy the baremetal-deployment.yaml file:

    cp /home/stack/dcn2/overcloud-baremetal-deploy.yaml \
    /home/stack/dcn2/baremetal-deployment-scaledown.yaml
  2. Edit the baremetal-deployement-scaledown.yaml file. Identify the host you want to remove and set the provisioned parameter to have a value of false:

    instances:
    ...
      - hostname: dcn2-computehci2-1
        provisioned: false
  3. Remove the node from the stack:

    openstack overcloud node delete --stack dcn2 --baremetal-deployment /home/stack/dcn2/baremetal_deployment_scaledown.yaml
  4. Optional: If you are going to reuse the node, use ironic to clean the disk. This is required if the node will host Ceph OSDs:

    openstack baremetal node manage $UUID
    openstack baremetal node clean $UUID --clean-steps '[{"interface":"deploy", "step": "erase_devices_metadata"}]'
    openstack baremetal provide $UUID
  5. Redeploy the central site. Include all templates that you used for the initial configuration:

    openstack overcloud deploy \
    --deployed-server \
    --stack central \
    --templates /usr/share/openstack-tripleo-heat-templates/ \
    -r ~/control-plane/central_roles.yaml \
    -n ~/network-data.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/dcn-storage.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/nova-az-config.yaml \
    -e /home/stack/central/overcloud-networks-deployed.yaml \
    -e /home/stack/central/overcloud-vip-deployed.yaml \
    -e /home/stack/central/deployed_metal.yaml \
    -e /home/stack/central/deployed_ceph.yaml \
    -e /home/stack/central/dcn_ceph.yaml \
    -e /home/stack/central/glance_update.yaml

9.5. Replacing a removed DistributedComputeHCI node

9.5.1. Replacing a removed DistributedComputeHCI node

To add new HCI nodes to your DCN deployment, you must redeploy the edge stack with the additional node, perform a ceph export of that stack, and then perform a stack update for the central location. A stack update of the central location adds configurations specific to edge-sites.

Prerequisites

The node counts are correct in the nodes_data.yaml file of the stack that you want to replace the node in or add a new node to.

Procedure

  1. You must set the EtcdIntialClusterState parameter to existing in one of the templates called by your deploy script:

    parameter_defaults:
      EtcdInitialClusterState: existing
  2. Redeploy using the deployment script specific to the stack:

    (undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy_dcn2.sh
    …
    Overcloud Deployed without error
  3. Export the Red Hat Ceph Storage data from the stack:

    (undercloud) [stack@site-undercloud-0 ~]$ sudo -E openstack overcloud export ceph --stack dcn1,dcn2 --config-download-dir /var/lib/mistral --output-file ~/central/dcn2_scale_up_ceph_external.yaml
  4. Replace dcn_ceph_external.yaml with the newly generated dcn2_scale_up_ceph_external.yaml in the deploy script for the central location.
  5. Perform a stack update at central:

    (undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy.sh
    ...
    Overcloud Deployed without error

9.6. Verify the functionality of a replaced DistributedComputeHCI node

  1. Ensure the value of the status field is enabled, and that the value of the State field is up:

    (central) [stack@site-undercloud-0 ~]$ openstack compute service list -c Binary -c Host -c Zone -c Status -c State
    +----------------+-----------------------------------------+------------+---------+-------+
    | Binary         | Host                                    | Zone       | Status  | State |
    +----------------+-----------------------------------------+------------+---------+-------+
    ...
    | nova-compute   | dcn1-compute1-0.redhat.local            | az-dcn1    | enabled | up    |
    | nova-compute   | dcn1-compute1-1.redhat.local            | az-dcn1    | enabled | up    |
    | nova-compute   | dcn2-computehciscaleout2-0.redhat.local | az-dcn2    | enabled | up    |
    | nova-compute   | dcn2-computehci2-0.redhat.local         | az-dcn2    | enabled | up    |
    | nova-compute   | dcn2-computescaleout2-0.redhat.local    | az-dcn2    | enabled | up    |
    | nova-compute   | dcn2-computehci2-2.redhat.local         | az-dcn2    | enabled | up    |
    ...
  2. Ensure that all network agents are in the up state:

    (central) [stack@site-undercloud-0 ~]$ openstack network agent list -c "Agent Type" -c Host -c Alive -c State
    +--------------------+-----------------------------------------+-------+-------+
    | Agent Type         | Host                                    | Alive | State |
    +--------------------+-----------------------------------------+-------+-------+
    | DHCP agent         | dcn3-compute3-1.redhat.local            | :-)   | UP    |
    | Open vSwitch agent | central-computehci0-1.redhat.local      | :-)   | UP    |
    | DHCP agent         | dcn3-compute3-0.redhat.local            | :-)   | UP    |
    | DHCP agent         | central-controller0-2.redhat.local      | :-)   | UP    |
    | Open vSwitch agent | dcn3-compute3-1.redhat.local            | :-)   | UP    |
    | Open vSwitch agent | dcn1-compute1-1.redhat.local            | :-)   | UP    |
    | Open vSwitch agent | central-computehci0-0.redhat.local      | :-)   | UP    |
    | DHCP agent         | central-controller0-1.redhat.local      | :-)   | UP    |
    | L3 agent           | central-controller0-2.redhat.local      | :-)   | UP    |
    | Metadata agent     | central-controller0-1.redhat.local      | :-)   | UP    |
    | Open vSwitch agent | dcn2-computescaleout2-0.redhat.local    | :-)   | UP    |
    | Open vSwitch agent | dcn2-computehci2-5.redhat.local         | :-)   | UP    |
    | Open vSwitch agent | central-computehci0-2.redhat.local      | :-)   | UP    |
    | DHCP agent         | central-controller0-0.redhat.local      | :-)   | UP    |
    | Open vSwitch agent | central-controller0-1.redhat.local      | :-)   | UP    |
    | Open vSwitch agent | dcn2-computehci2-0.redhat.local         | :-)   | UP    |
    | Open vSwitch agent | dcn1-compute1-0.redhat.local            | :-)   | UP    |
    ...
  3. Verify the status of the Ceph Cluster:

    1. Use SSH to connect to the new DistributedComputeHCI node and check the status of the Ceph cluster:

      [root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 \
      ceph -s -c /etc/ceph/dcn2.conf
    2. Verify that both the ceph mon and ceph mgr services exist for the new node:

      services:
          mon: 3 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0,dcn2-computehci2-5 (age 3d)
          mgr: dcn2-computehci2-2(active, since 3d), standbys: dcn2-computehci2-0, dcn2-computehci2-5
          osd: 20 osds: 20 up (since 3d), 20 in (since 3d)
    3. Verify the status of the ceph osds with ‘ceph osd tree’. Ensure all osds for our new node are in STATUS up:

      [root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 ceph osd tree -c /etc/ceph/dcn2.conf
      ID CLASS WEIGHT  TYPE NAME                           STATUS REWEIGHT PRI-AFF
      -1       0.97595 root default
      -5       0.24399     host dcn2-computehci2-0
       0   hdd 0.04880         osd.0                           up  1.00000 1.00000
       4   hdd 0.04880         osd.4                           up  1.00000 1.00000
       8   hdd 0.04880         osd.8                           up  1.00000 1.00000
      13   hdd 0.04880         osd.13                          up  1.00000 1.00000
      17   hdd 0.04880         osd.17                          up  1.00000 1.00000
      -9       0.24399     host dcn2-computehci2-2
       3   hdd 0.04880         osd.3                           up  1.00000 1.00000
       5   hdd 0.04880         osd.5                           up  1.00000 1.00000
      10   hdd 0.04880         osd.10                          up  1.00000 1.00000
      14   hdd 0.04880         osd.14                          up  1.00000 1.00000
      19   hdd 0.04880         osd.19                          up  1.00000 1.00000
      -3       0.24399     host dcn2-computehci2-5
       1   hdd 0.04880         osd.1                           up  1.00000 1.00000
       7   hdd 0.04880         osd.7                           up  1.00000 1.00000
      11   hdd 0.04880         osd.11                          up  1.00000 1.00000
      15   hdd 0.04880         osd.15                          up  1.00000 1.00000
      18   hdd 0.04880         osd.18                          up  1.00000 1.00000
      -7       0.24399     host dcn2-computehciscaleout2-0
       2   hdd 0.04880         osd.2                           up  1.00000 1.00000
       6   hdd 0.04880         osd.6                           up  1.00000 1.00000
       9   hdd 0.04880         osd.9                           up  1.00000 1.00000
      12   hdd 0.04880         osd.12                          up  1.00000 1.00000
      16   hdd 0.04880         osd.16                          up  1.00000 1.00000
  4. Verify the cinder-volume service for the new DistributedComputeHCI node is in Status ‘enabled’ and in State ‘up’:

    (central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume -c Binary -c Host -c Zone -c Status -c State
    +---------------+---------------------------------+------------+---------+-------+
    | Binary        | Host                            | Zone       | Status  | State |
    +---------------+---------------------------------+------------+---------+-------+
    | cinder-volume | hostgroup@tripleo_ceph          | az-central | enabled | up    |
    | cinder-volume | dcn1-compute1-1@tripleo_ceph    | az-dcn1    | enabled | up    |
    | cinder-volume | dcn1-compute1-0@tripleo_ceph    | az-dcn1    | enabled | up    |
    | cinder-volume | dcn2-computehci2-0@tripleo_ceph | az-dcn2    | enabled | up    |
    | cinder-volume | dcn2-computehci2-2@tripleo_ceph | az-dcn2    | enabled | up    |
    | cinder-volume | dcn2-computehci2-5@tripleo_ceph | az-dcn2    | enabled | up    |
    +---------------+---------------------------------+------------+---------+-------+
    Note

    If the State of the cinder-volume service is down, then the service has not been started on the node.

  5. Use ssh to connect to the new DistributedComputeHCI node and check the status of the Glance services with ‘systemctl’:

    [root@dcn2-computehci2-5 ~]# systemctl --type service | grep glance
      tripleo_glance_api.service                        loaded active     running       glance_api container
      tripleo_glance_api_healthcheck.service            loaded activating start   start glance_api healthcheck
      tripleo_glance_api_tls_proxy.service              loaded active     running       glance_api_tls_proxy container

9.7. Troubleshooting DistributedComputeHCI state down

If the replacement node was deployed without the EtcdInitialClusterState parameter value set to existing, then the cinder-volume service of the replaced node shows down when you run openstack volume service list.

Procedure

  1. Log onto the replacement node and check logs for the etcd service. Check that the logs show the etcd service is reporting a cluster ID mismatch in the /var/log/containers/stdouts/etcd.log log file:

    2022-04-06T18:00:11.834104130+00:00 stderr F 2022-04-06 18:00:11.834045 E | rafthttp: request cluster ID mismatch (got 654f4cf0e2cfb9fd want 918b459b36fe2c0c)
  2. Set the EtcdInitialClusterState parameter to the value of existing in your deployment templates and rerun the deployment script.
  3. Use SSH to connect to the replacement node and run the following commands as root:

    [root@dcn2-computehci2-4 ~]# systemctl stop tripleo_etcd
    [root@dcn2-computehci2-4 ~]# rm -rf /var/lib/etcd/*
    [root@dcn2-computehci2-4 ~]# systemctl start tripleo_etcd
  4. Recheck the /var/log/containers/stdouts/etcd.log log file to verify that the node successfully joined the cluster:

    2022-04-06T18:24:22.130059875+00:00 stderr F 2022-04-06 18:24:22.129395 I | etcdserver/membership: added member 96f61470cd1839e5 [https://dcn2-computehci2-4.internalapi.redhat.local:2380] to cluster 654f4cf0e2cfb9fd
  5. Check the state of the cinder-volume service, and confirm it reads up on the replacement node when you run openstack volume service list.