Chapter 8. Replacing DistributedComputeHCI nodes
During hardware maintenance you may need to scale down, scale up, or replace a DistributedComputeHCI node at an edge site. To replace a DistributedComputeHCI node, remove services from the node you are replacing, scale the number of nodes down, and then follow the procedures for scaling those nodes back up.
8.1. Removing Red Hat Ceph Storage services
Before removing an HCI (hyperconverged) node from a cluster, you must remove Red Hat Ceph Storage services. To remove the Red Hat Ceph services, you must disable and remove ceph-osd
service from the cluster services on the node you are removing, then stop and disable the mon
, mgr
, and osd
services.
Procedure
On the undercloud, use SSH to connect to the DistributedComputeHCI node that you want to remove:
$ ssh tripleo-admin@<dcn-computehci-node>
Start a cephadm shell. Use the configuration file and keyring file for the site that the host being removed is in:
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
Record the OSDs (object storage devices) associated with the DistributedComputeHCI node you are removing for use reference in a later step:
[ceph: root@dcn2-computehci2-1 ~]# ceph osd tree -c /etc/ceph/dcn2.conf … -3 0.24399 host dcn2-computehci2-1 1 hdd 0.04880 osd.1 up 1.00000 1.00000 7 hdd 0.04880 osd.7 up 1.00000 1.00000 11 hdd 0.04880 osd.11 up 1.00000 1.00000 15 hdd 0.04880 osd.15 up 1.00000 1.00000 18 hdd 0.04880 osd.18 up 1.00000 1.00000 …
Use SSH to connect to another node in the same cluster and remove the monitor from the cluster:
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring [ceph: root@dcn-computehci2-0]# ceph mon remove dcn2-computehci2-1 -c /etc/ceph/dcn2.conf removing mon.dcn2-computehci2-1 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitors
- Use SSH to log in again to the node that you are removing from the cluster.
Stop and disable the
mgr
service:[tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph ceph-crash@dcn2-computehci2-1.service loaded active running Ceph crash dump collector ceph-mgr@dcn2-computehci2-1.service loaded active running Ceph Manager [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl stop ceph-mgr@dcn2-computehci2-1 [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph ceph-crash@dcn2-computehci2-1.service loaded active running Ceph crash dump collector [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl disable ceph-mgr@dcn2-computehci2-1 Removed /etc/systemd/system/multi-user.target.wants/ceph-mgr@dcn2-computehci2-1.service.
Start the cephadm shell:
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
Verify that the
mgr
service for the node is removed from the cluster:[ceph: root@dcn2-computehci2-1 ~]# ceph -s cluster: id: b9b53581-d590-41ac-8463-2f50aa985001 health: HEALTH_WARN 3 pools have too many placement groups mons are allowing insecure global_id reclaim services: mon: 2 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0 (age 2h) mgr: dcn2-computehci2-2(active, since 20h), standbys: dcn2-computehci2-0 1 osd: 15 osds: 15 up (since 3h), 15 in (since 3h) data: pools: 3 pools, 384 pgs objects: 32 objects, 88 MiB usage: 16 GiB used, 734 GiB / 750 GiB avail pgs: 384 active+clean
- 1
- The node that the mgr service is removed from is no longer listed when the mgr service is successfully removed.
Export the Red Hat Ceph Storage specification:
[ceph: root@dcn2-computehci2-1 ~]# ceph orch ls --export > spec.yml
Edit the specifications in the
spec.yaml
file:- Remove all instances of the host <dcn-computehci-node> from spec.yml
Remove all instances of the <dcn-computehci-node> entry from the following:
- service_type: osd
- service_type: mon
- service_type: host
Reapply the Red Hat Ceph Storage specification:
[ceph: root@dcn2-computehci2-1 /]# ceph orch apply -i spec.yml
Remove the OSDs that you identified using
ceph osd tree
:[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm --zap 1 7 11 15 18 Scheduled OSD(s) for removal
Verify the status of the OSDs being removed. Do not continue until the following command returns no output:
[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm status OSD_ID HOST STATE PG_COUNT REPLACE FORCE DRAIN_STARTED_AT 1 dcn2-computehci2-1 draining 27 False False 2021-04-23 21:35:51.215361 7 dcn2-computehci2-1 draining 8 False False 2021-04-23 21:35:49.111500 11 dcn2-computehci2-1 draining 14 False False 2021-04-23 21:35:50.243762
Verify that no daemons remain on the host you are removing:
[ceph: root@dcn2-computehci2-1 /]# ceph orch ps dcn2-computehci2-1
If daemons are still present, you can remove them with the following command:
[ceph: root@dcn2-computehci2-1 /]# ceph orch host drain dcn2-computehci2-1
Remove the <dcn-computehci-node> host from the Red Hat Ceph Storage cluster:
[ceph: root@dcn2-computehci2-1 /]# ceph orch host rm dcn2-computehci2-1 Removed host ‘dcn2-computehci2-1’
8.2. Removing the Image service (glance) services
Remove image services from a node when you remove it from service.
Procedure
To disable the Image service services, disable them using
systemctl
on the node you are removing:[root@dcn2-computehci2-1 ~]# systemctl stop tripleo_glance_api.service [root@dcn2-computehci2-1 ~]# systemctl stop tripleo_glance_api_tls_proxy.service [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_glance_api.service Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api.service. [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_glance_api_tls_proxy.service Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api_tls_proxy.service.
8.3. Removing the Block Storage (cinder) services
You must remove the cinder-volume
and etcd
services from the DistributedComputeHCI node when you remove it from service.
Procedure
Identify and disable the
cinder-volume
service on the node you are removing:(central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume | cinder-volume | dcn2-computehci2-1@tripleo_ceph | az-dcn2 | enabled | up | 2022-03-23T17:41:43.000000 | (central) [stack@site-undercloud-0 ~]$ openstack volume service set --disable dcn2-computehci2-1@tripleo_ceph cinder-volume
Log on to a different DistributedComputeHCI node in the stack:
$ ssh tripleo-admin@dcn2-computehci2-0
Remove the
cinder-volume
service associated with the node that you are removing:[root@dcn2-computehci2-0 ~]# podman exec -it cinder_volume cinder-manage service remove cinder-volume dcn2-computehci2-1@tripleo_ceph Service cinder-volume on host dcn2-computehci2-1@tripleo_ceph removed.
Stop and disable the
tripleo_cinder_volume
service on the node that you are removing:[root@dcn2-computehci2-1 ~]# systemctl stop tripleo_cinder_volume.service [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_cinder_volume.service Removed /etc/systemd/system/multi-user.target.wants/tripleo_cinder_volume.service
8.4. Delete the DistributedComputeHCI node
Set the provisioned
parameter to a value of false
and remove the node from the stack. Disable the nova-compute
service and delete the relevant network agent.
Procedure
Copy the
baremetal-deployment.yaml
file:cp /home/stack/dcn2/overcloud-baremetal-deploy.yaml \ /home/stack/dcn2/baremetal-deployment-scaledown.yaml
Edit the
baremetal-deployement-scaledown.yaml
file. Identify the host you want to remove and set theprovisioned
parameter to have a value offalse
:instances: ... - hostname: dcn2-computehci2-1 provisioned: false
Remove the node from the stack:
openstack overcloud node delete --stack dcn2 --baremetal-deployment /home/stack/dcn2/baremetal_deployment_scaledown.yaml
Optional: If you are going to reuse the node, use ironic to clean the disk. This is required if the node will host Ceph OSDs:
openstack baremetal node manage $UUID openstack baremetal node clean $UUID --clean-steps '[{"interface":"deploy", "step": "erase_devices_metadata"}]' openstack baremetal provide $UUID
Redeploy the central site. Include all templates that you used for the initial configuration:
openstack overcloud deploy \ --deployed-server \ --stack central \ --templates /usr/share/openstack-tripleo-heat-templates/ \ -r ~/control-plane/central_roles.yaml \ -n ~/network-data.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/dcn-storage.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/nova-az-config.yaml \ -e /home/stack/central/overcloud-networks-deployed.yaml \ -e /home/stack/central/overcloud-vip-deployed.yaml \ -e /home/stack/central/deployed_metal.yaml \ -e /home/stack/central/deployed_ceph.yaml \ -e /home/stack/central/dcn_ceph.yaml \ -e /home/stack/central/glance_update.yaml
8.5. Replacing a removed DistributedComputeHCI node
8.5.1. Replacing a removed DistributedComputeHCI node
To add new HCI nodes to your DCN deployment, you must redeploy the edge stack with the additional node, perform a ceph export
of that stack, and then perform a stack update for the central location. A stack update of the central location adds configurations specific to edge-sites.
Prerequisites
The node counts are correct in the nodes_data.yaml file of the stack that you want to replace the node in or add a new node to.
Procedure
You must set the
EtcdIntialClusterState
parameter toexisting
in one of the templates called by your deploy script:parameter_defaults: EtcdInitialClusterState: existing
Redeploy using the deployment script specific to the stack:
(undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy_dcn2.sh … Overcloud Deployed without error
Export the Red Hat Ceph Storage data from the stack:
(undercloud) [stack@site-undercloud-0 ~]$ sudo -E openstack overcloud export ceph --stack dcn1,dcn2 --config-download-dir /var/lib/mistral --output-file ~/central/dcn2_scale_up_ceph_external.yaml
- Replace dcn_ceph_external.yaml with the newly generated dcn2_scale_up_ceph_external.yaml in the deploy script for the central location.
Perform a stack update at central:
(undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy.sh ... Overcloud Deployed without error
8.6. Verify the functionality of a replaced DistributedComputeHCI node
Ensure the value of the
status
field isenabled
, and that the value of theState
field isup
:(central) [stack@site-undercloud-0 ~]$ openstack compute service list -c Binary -c Host -c Zone -c Status -c State +----------------+-----------------------------------------+------------+---------+-------+ | Binary | Host | Zone | Status | State | +----------------+-----------------------------------------+------------+---------+-------+ ... | nova-compute | dcn1-compute1-0.redhat.local | az-dcn1 | enabled | up | | nova-compute | dcn1-compute1-1.redhat.local | az-dcn1 | enabled | up | | nova-compute | dcn2-computehciscaleout2-0.redhat.local | az-dcn2 | enabled | up | | nova-compute | dcn2-computehci2-0.redhat.local | az-dcn2 | enabled | up | | nova-compute | dcn2-computescaleout2-0.redhat.local | az-dcn2 | enabled | up | | nova-compute | dcn2-computehci2-2.redhat.local | az-dcn2 | enabled | up | ...
Ensure that all network agents are in the
up
state:(central) [stack@site-undercloud-0 ~]$ openstack network agent list -c "Agent Type" -c Host -c Alive -c State +--------------------+-----------------------------------------+-------+-------+ | Agent Type | Host | Alive | State | +--------------------+-----------------------------------------+-------+-------+ | DHCP agent | dcn3-compute3-1.redhat.local | :-) | UP | | Open vSwitch agent | central-computehci0-1.redhat.local | :-) | UP | | DHCP agent | dcn3-compute3-0.redhat.local | :-) | UP | | DHCP agent | central-controller0-2.redhat.local | :-) | UP | | Open vSwitch agent | dcn3-compute3-1.redhat.local | :-) | UP | | Open vSwitch agent | dcn1-compute1-1.redhat.local | :-) | UP | | Open vSwitch agent | central-computehci0-0.redhat.local | :-) | UP | | DHCP agent | central-controller0-1.redhat.local | :-) | UP | | L3 agent | central-controller0-2.redhat.local | :-) | UP | | Metadata agent | central-controller0-1.redhat.local | :-) | UP | | Open vSwitch agent | dcn2-computescaleout2-0.redhat.local | :-) | UP | | Open vSwitch agent | dcn2-computehci2-5.redhat.local | :-) | UP | | Open vSwitch agent | central-computehci0-2.redhat.local | :-) | UP | | DHCP agent | central-controller0-0.redhat.local | :-) | UP | | Open vSwitch agent | central-controller0-1.redhat.local | :-) | UP | | Open vSwitch agent | dcn2-computehci2-0.redhat.local | :-) | UP | | Open vSwitch agent | dcn1-compute1-0.redhat.local | :-) | UP | ...
Verify the status of the Ceph Cluster:
Use SSH to connect to the new DistributedComputeHCI node and check the status of the Ceph cluster:
[root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 \ ceph -s -c /etc/ceph/dcn2.conf
Verify that both the ceph mon and ceph mgr services exist for the new node:
services: mon: 3 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0,dcn2-computehci2-5 (age 3d) mgr: dcn2-computehci2-2(active, since 3d), standbys: dcn2-computehci2-0, dcn2-computehci2-5 osd: 20 osds: 20 up (since 3d), 20 in (since 3d)
Verify the status of the ceph osds with ‘ceph osd tree’. Ensure all osds for our new node are in STATUS up:
[root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 ceph osd tree -c /etc/ceph/dcn2.conf ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.97595 root default -5 0.24399 host dcn2-computehci2-0 0 hdd 0.04880 osd.0 up 1.00000 1.00000 4 hdd 0.04880 osd.4 up 1.00000 1.00000 8 hdd 0.04880 osd.8 up 1.00000 1.00000 13 hdd 0.04880 osd.13 up 1.00000 1.00000 17 hdd 0.04880 osd.17 up 1.00000 1.00000 -9 0.24399 host dcn2-computehci2-2 3 hdd 0.04880 osd.3 up 1.00000 1.00000 5 hdd 0.04880 osd.5 up 1.00000 1.00000 10 hdd 0.04880 osd.10 up 1.00000 1.00000 14 hdd 0.04880 osd.14 up 1.00000 1.00000 19 hdd 0.04880 osd.19 up 1.00000 1.00000 -3 0.24399 host dcn2-computehci2-5 1 hdd 0.04880 osd.1 up 1.00000 1.00000 7 hdd 0.04880 osd.7 up 1.00000 1.00000 11 hdd 0.04880 osd.11 up 1.00000 1.00000 15 hdd 0.04880 osd.15 up 1.00000 1.00000 18 hdd 0.04880 osd.18 up 1.00000 1.00000 -7 0.24399 host dcn2-computehciscaleout2-0 2 hdd 0.04880 osd.2 up 1.00000 1.00000 6 hdd 0.04880 osd.6 up 1.00000 1.00000 9 hdd 0.04880 osd.9 up 1.00000 1.00000 12 hdd 0.04880 osd.12 up 1.00000 1.00000 16 hdd 0.04880 osd.16 up 1.00000 1.00000
Verify the
cinder-volume
service for the new DistributedComputeHCI node is in Status ‘enabled’ and in State ‘up’:(central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume -c Binary -c Host -c Zone -c Status -c State +---------------+---------------------------------+------------+---------+-------+ | Binary | Host | Zone | Status | State | +---------------+---------------------------------+------------+---------+-------+ | cinder-volume | hostgroup@tripleo_ceph | az-central | enabled | up | | cinder-volume | dcn1-compute1-1@tripleo_ceph | az-dcn1 | enabled | up | | cinder-volume | dcn1-compute1-0@tripleo_ceph | az-dcn1 | enabled | up | | cinder-volume | dcn2-computehci2-0@tripleo_ceph | az-dcn2 | enabled | up | | cinder-volume | dcn2-computehci2-2@tripleo_ceph | az-dcn2 | enabled | up | | cinder-volume | dcn2-computehci2-5@tripleo_ceph | az-dcn2 | enabled | up | +---------------+---------------------------------+------------+---------+-------+
NoteIf the State of the
cinder-volume
service isdown
, then the service has not been started on the node.Use ssh to connect to the new DistributedComputeHCI node and check the status of the Glance services with ‘systemctl’:
[root@dcn2-computehci2-5 ~]# systemctl --type service | grep glance tripleo_glance_api.service loaded active running glance_api container tripleo_glance_api_healthcheck.service loaded activating start start glance_api healthcheck tripleo_glance_api_tls_proxy.service loaded active running glance_api_tls_proxy container
8.7. Troubleshooting DistributedComputeHCI state down
If the replacement node was deployed without the EtcdInitialClusterState parameter value set to existing
, then the cinder-volume
service of the replaced node shows down
when you run openstack volume service list
.
Procedure
Log onto the replacement node and check logs for the etcd service. Check that the logs show the
etcd
service is reporting a cluster ID mismatch in the/var/log/containers/stdouts/etcd.log
log file:2022-04-06T18:00:11.834104130+00:00 stderr F 2022-04-06 18:00:11.834045 E | rafthttp: request cluster ID mismatch (got 654f4cf0e2cfb9fd want 918b459b36fe2c0c)
-
Set the
EtcdInitialClusterState
parameter to the value ofexisting
in your deployment templates and rerun the deployment script. Use SSH to connect to the replacement node and run the following commands as root:
[root@dcn2-computehci2-4 ~]# systemctl stop tripleo_etcd [root@dcn2-computehci2-4 ~]# rm -rf /var/lib/etcd/* [root@dcn2-computehci2-4 ~]# systemctl start tripleo_etcd
Recheck the
/var/log/containers/stdouts/etcd.log
log file to verify that the node successfully joined the cluster:2022-04-06T18:24:22.130059875+00:00 stderr F 2022-04-06 18:24:22.129395 I | etcdserver/membership: added member 96f61470cd1839e5 [https://dcn2-computehci2-4.internalapi.redhat.local:2380] to cluster 654f4cf0e2cfb9fd
-
Check the state of the cinder-volume service, and confirm it reads
up
on the replacement node when you runopenstack volume service list
.