Chapter 20. Replacing Controller nodes
In certain circumstances a Controller node in a high availability cluster might fail. In these situations, you must remove the node from the cluster and replace it with a new Controller node.
Complete the steps in this section to replace a Controller node. The Controller node replacement process involves running the openstack overcloud deploy
command to update the overcloud with a request to replace a Controller node.
The following procedure applies only to high availability environments. Do not use this procedure if you are using only one Controller node.
20.1. Preparing for Controller replacement
Before you replace an overcloud Controller node, it is important to check the current state of your Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Controller replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Controller node replacement. Run all commands for these checks on the undercloud.
Procedure
Check the current status of the
overcloud
stack on the undercloud:$ source stackrc $ openstack overcloud status
Only continue if the
overcloud
stack has a deployment status ofDEPLOY_SUCCESS
.Install the database client tools:
$ sudo dnf -y install mariadb
Configure root user access to the database:
$ sudo cp /var/lib/config-data/puppet-generated/mysql/root/.my.cnf /root/.
Perform a backup of the undercloud databases:
$ mkdir /home/stack/backup $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz
Check that your undercloud contains 10 GB free storage to accommodate for image caching and conversion when you provision the new node:
$ df -h
If you are reusing the IP address for the new controller node, ensure that you delete the port used by the old controller:
$ openstack port delete <port>
Check the status of Pacemaker on the running Controller nodes. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to view the Pacemaker status:
$ ssh tripleo-admin@192.168.0.47 'sudo pcs status'
The output shows all services that are running on the existing nodes and those that are stopped on the failed node.
Check the following parameters on each node of the overcloud MariaDB cluster:
-
wsrep_local_state_comment: Synced
wsrep_cluster_size: 2
Use the following command to check these parameters on each running Controller node. In this example, the Controller node IP addresses are 192.168.0.47 and 192.168.0.46:
$ for i in 192.168.0.46 192.168.0.47 ; do echo "*** $i ***" ; ssh tripleo-admin@$i "sudo podman exec \$(sudo podman ps --filter name=galera-bundle -q) mysql -e \"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
-
Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to view the RabbitMQ status:
$ ssh tripleo-admin@192.168.0.47 "sudo podman exec \$(sudo podman ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"
The
running_nodes
key should show only the two available nodes and not the failed node.If fencing is enabled, disable it. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to check the status of fencing:
$ ssh tripleo-admin@192.168.0.47 "sudo pcs property show stonith-enabled"
Run the following command to disable fencing:
$ ssh tripleo-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"
Login to the failed Controller node and stop all the
nova_*
containers that are running:$ sudo systemctl stop tripleo_nova_api.service $ sudo systemctl stop tripleo_nova_api_cron.service $ sudo systemctl stop tripleo_nova_compute.service $ sudo systemctl stop tripleo_nova_conductor.service $ sudo systemctl stop tripleo_nova_metadata.service $ sudo systemctl stop tripleo_nova_placement.service $ sudo systemctl stop tripleo_nova_scheduler.service
Optional: If you are using the Bare Metal Service (ironic) as the virt driver, you must manually update the service entries in your cell database for any bare metal instances whose
instances.host
is set to the controller that you are removing. Contact Red Hat Support for assistance.NoteThis manual update of the cell database when using Bare Metal Service (ironic) as the virt driver is a temporary workaround to ensure the nodes are rebalanced, until BZ2017980 is complete.
20.2. Removing a Ceph Monitor daemon
If your Controller node is running a Ceph monitor service, complete the following steps to remove the ceph-mon
daemon.
Adding a new Controller node to the cluster also adds a new Ceph monitor daemon automatically.
Procedure
Connect to the Controller node that you want to replace:
$ ssh tripleo-admin@192.168.0.47
List the Ceph mon services:
$ sudo systemctl --type=service | grep ceph ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@crash.controller-0.service loaded active running Ceph crash.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service loaded active running Ceph mgr.controller-0.mufglq for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service loaded active running Ceph mon.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@rgw.rgw.controller-0.ikaevh.service loaded active running Ceph rgw.rgw.controller-0.ikaevh for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
Stop the Ceph mon service:
$ sudo systemtctl stop ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service
Disable the Ceph mon service:
$ sudo systemctl disable ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service
- Disconnect from the Controller node that you want to replace.
Use SSH to connect to another Controller node in the same cluster:
$ ssh tripleo-admin@192.168.0.46
The Ceph specification file is modified and applied later in this procedure, to manipulate the file you must export it:
$ sudo cephadm shell --ceph orch ls --export > spec.yaml
Remove the monitor from the cluster:
$ sudo cephadm shell -- ceph mon remove controller-0 removing mon.controller-0 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitors
Disconnect from the Controller node and log back into the Controller node you are removing from the cluster:
$ ssh tripleo-admin@192.168.0.47
List the Ceph mgr services:
$ sudo systemctl --type=service | grep ceph ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@crash.controller-0.service loaded active running Ceph crash.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service loaded active running Ceph mgr.controller-0.mufglq for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@rgw.rgw.controller-0.ikaevh.service loaded active running Ceph rgw.rgw.controller-0.ikaevh for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
Stop the Ceph mgr service:
$ sudo systemctl stop ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service
Disable the Ceph mgr service:
$ sudo systemctl disable ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service
Start a
cephadm
shell:$ sudo cephadm shell
Verify that the Ceph mgr service for the Controller node is removed from the cluster:
$ ceph -s cluster: id: b9b53581-d590-41ac-8463-2f50aa985001 health: HEALTH_OK services: mon: 2 daemons, quorum controller-2,controller-1 (age 2h) mgr: controller-2(active, since 20h), standbys: controller1-1 osd: 15 osds: 15 up (since 3h), 15 in (since 3h) data: pools: 3 pools, 384 pgs objects: 32 objects, 88 MiB usage: 16 GiB used, 734 GiB / 750 GiB avail pgs: 384 active+clean
The node is not listed if the Ceph mgr service is successfully removed.
Export the Red Hat Ceph Storage specification:
$ ceph orch ls --export > spec.yaml
-
In the
spec.yaml
specification file, remove all instances of the host, for examplecontroller-0
, from theservice_type: mon
andservice_type: mgr
. Reapply the Red Hat Ceph Storage specification:
$ ceph orch apply -i spec.yaml
Verify that no Ceph daemons remain on the removed host:
$ ceph orch ps controller-0
NoteIf daemons are present, use the following command to remove them:
$ ceph orch host drain controller-0
Prior to running the
ceph orch host drain
command, backup the contents of/etc/ceph
. Restore the contents after running theceph orch host drain
command. You must back up prior to running theceph orch host drain
command until https://bugzilla.redhat.com/show_bug.cgi?id=2153827 is resolved.Remove the
controller-0
host from the Red Hat Ceph Storage cluster:$ ceph orch host rm controller-0 Removed host 'controller-0'
Exit the cephadm shell:
$ exit
Additional Resources
For more information on controlling Red Hat Ceph Storage services with systemd, see Understanding process management for Ceph
For more information on editing and applying Red Hat Ceph Storage specification files, see Deploying the Ceph monitor daemons using the service specification
20.3. Preparing the cluster for Controller node replacement
Before you replace the node, ensure that Pacemaker is not running on the node and then remove that node from the Pacemaker cluster.
Procedure
To view the list of IP addresses for the Controller nodes, run the following command:
(undercloud)$ metalsmith -c Hostname -c "IP Addresses" list +------------------------+-----------------------+ | Hostname | IP Addresses | +------------------------+-----------------------+ | overcloud-compute-0 | ctlplane=192.168.0.44 | | overcloud-controller-0 | ctlplane=192.168.0.47 | | overcloud-controller-1 | ctlplane=192.168.0.45 | | overcloud-controller-2 | ctlplane=192.168.0.46 | +------------------------+-----------------------+
Log in to the node and confirm the pacemaker status. If pacemaker is running, use the
pcs cluster
command to stop pacemaker. This example stops pacemaker onovercloud-controller-0
:(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs status | grep -w Online | grep -w overcloud-controller-0" (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster stop overcloud-controller-0"
NoteIn the case that the node is physically unavailable or stopped, it is not necessary to perform the previous operation, as pacemaker is already stopped on that node.
After you stop Pacemaker on the node, delete the node from the pacemaker cluster. The following example logs in to
overcloud-controller-1
to removeovercloud-controller-0
:(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster node remove overcloud-controller-0"
If the node that that you want to replace is unreachable (for example, due to a hardware failure), run the
pcs
command with additional--skip-offline
and--force
options to forcibly remove the node from the cluster:(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster node remove overcloud-controller-0 --skip-offline --force"
After you remove the node from the pacemaker cluster, remove the node from the list of known hosts in pacemaker:
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs host deauth overcloud-controller-0"
You can run this command whether the node is reachable or not.
To ensure that the new Controller node uses the correct STONITH fencing device after replacement, delete the devices from the node by entering the following command:
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs stonith delete <stonith_resource_name>"
-
Replace
<stonith_resource_name>
with the name of the STONITH resource that corresponds to the node. The resource name uses the the format<resource_agent>-<host_mac>
. You can find the resource agent and the host MAC address in theFencingConfig
section of thefencing.yaml
file.
-
Replace
The overcloud database must continue to run during the replacement procedure. To ensure that Pacemaker does not stop Galera during this procedure, select a running Controller node and run the following command on the undercloud with the IP address of the Controller node:
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs resource unmanage galera-bundle"
Remove the OVN northbound database server for the replaced Controller node from the cluster:
Obtain the server ID of the OVN northbound database server to be replaced:
$ ssh tripleo-admin@<controller_ip> sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null|grep -A4 Servers:
Replace
<controller_ip>
with the IP address of any active Controller node.You should see output similar to the following:
Servers: 96da (96da at tcp:172.17.1.55:6643) (self) next_index=26063 match_index=26063 466b (466b at tcp:172.17.1.51:6643) next_index=26064 match_index=26063 last msg 2936 ms ago ba77 (ba77 at tcp:172.17.1.52:6643) next_index=26064 match_index=26063 last msg 2936 ms ago
In this example,
172.17.1.55
is the internal IP address of the Controller node that is being replaced, so the northbound database server ID is96da
.Using the server ID you obtained in the preceding step, remove the OVN northbound database server by running the following command:
$ ssh tripleo-admin@172.17.1.52 sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound 96da
In this example, you would replace
172.17.1.52
with the IP address of any active Controller node, and replace96da
with the server ID of the OVN northbound database server.
Remove the OVN southbound database server for the replaced Controller node from the cluster:
Obtain the server ID of the OVN southbound database server to be replaced:
$ ssh tripleo-admin@<controller_ip> sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Southbound 2>/dev/null|grep -A4 Servers:
Replace
<controller_ip>
with the IP address of any active Controller node.You should see output similar to the following:
Servers: e544 (e544 at tcp:172.17.1.55:6644) last msg 42802690 ms ago 17ca (17ca at tcp:172.17.1.51:6644) last msg 5281 ms ago 6e52 (6e52 at tcp:172.17.1.52:6644) (self)
In this example,
172.17.1.55
is the internal IP address of the Controller node that is being replaced, so the southbound database server ID ise544
.Using the server ID you obtained in the preceding step, remove the OVN southbound database server by running the following command:
$ ssh tripleo-admin@172.17.1.52 sudo podman exec ovn_cluster_south_db_server ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound e544
In this example, you would replace
172.17.1.52
with the IP address of any active Controller node, and replacee544
with the server ID of the OVN southbound database server.
Run the following clean up commands to prevent cluster rejoins.
Substitute
<replaced_controller_ip>
with the IP address of the Controller node that you are replacing:$ ssh tripleo-admin@<replaced_controller_ip> sudo systemctl disable --now tripleo_ovn_cluster_south_db_server.service tripleo_ovn_cluster_north_db_server.service $ ssh tripleo-admin@<replaced_controller_ip> sudo rm -rfv /var/lib/openvswitch/ovn/.ovn* /var/lib/openvswitch/ovn/ovn*.db
20.4. Replacing a bootstrap Controller node
If you want to replace the Controller node that you use for bootstrap operations and keep the node name, complete the following steps to set the name of the bootstrap Controller node after the replacement process.
Currently, when a bootstrap Controller node is replaced, the OVN database cluster is partitioned with two database clusters for both the northbound and southbound databases. This situation makes instances unusable.
To find the name of the bootstrap Controller node, run the following command:
ssh tripleo-admin@<controller_ip> "sudo hiera -c /etc/puppet/hiera.yaml ovn_dbs_short_bootstrap_node_name"
Workaround: Do not reuse the original bootstrap node hostname and IP address for the new Controller node. RHOSP director sorts the hostnames and then selects the first hostname in the list as the bootstrap node. Choose a name for the new Controller node so that it does not become the first hostname after sorting.
You can track the progress of the fix for this known issue in BZ 2222543.
Procedure
Find the name of the bootstrap Controller node by running the following command:
ssh tripleo-admin@<controller_ip> "sudo hiera -c /etc/puppet/hiera.yaml pacemaker_short_bootstrap_node_name"
-
Replace
<controller_ip>
with the IP address of any active Controller node.
-
Replace
Check if your environment files include the
ExtraConfig
section. If theExtraConfig
parameter does not exist, create the following environment file~/templates/bootstrap-controller.yaml
and add the following content:parameter_defaults: ExtraConfig: pacemaker_short_bootstrap_node_name: NODE_NAME mysql_short_bootstrap_node_name: NODE_NAME
Replace
NODE_NAME
with the name of an existing Controller node that you want to use in bootstrap operations after the replacement process.If your environment files already include the
ExtraConfig
parameter, add only the lines that set thepacemaker_short_bootstrap_node_name
andmysql_short_bootstrap_node_name
parameters.
For information about troubleshooting the bootstrap Controller node replacement, see the article Replacement of the first Controller node fails at step 1 if the same hostname is used for a new node.
20.5. Unprovision and remove Controller nodes
To unprovision and remove Controller nodes, complete the following steps.
Procedure
Source the
stackrc
file:$ source ~/stackrc
Identify the UUID of the
overcloud-controller-0
node:(undercloud)$ NODE=$(metalsmith -c UUID -f value show overcloud-controller-0)
Set the node to maintenance mode:
$ openstack baremetal node maintenance set $NODE
Copy the
overcloud-baremetal-deploy.yaml
file:$ cp /home/stack/templates/overcloud-baremetal-deploy.yaml /home/stack/templates/unprovision_controller-0.yaml
In the
unprovision_controller-0.yaml
file, lower the Controller count to unprovision the Controller node that you are replacing. In this example, the count is reduced from3
to2
. Move thecontroller-0
node to theinstances
dictionary and set theprovisioned
parameter tofalse
:- name: Controller count: 2 hostname_format: controller-%index% defaults: resource_class: BAREMETAL.controller networks: [ ... ] instances: - hostname: controller-0 name: <IRONIC_NODE_UUID_or_NAME> provisioned: false - name: Compute count: 2 hostname_format: compute-%index% defaults: resource_class: BAREMETAL.compute networks: [ ... ]
Run the
node unprovision
command:$ openstack overcloud node delete \ --stack overcloud \ --baremetal-deployment /home/stack/templates/unprovision_controller-0.yaml
The following nodes will be unprovisioned: +--------------+-------------------------+--------------------------------------+ | hostname | name | id | +--------------+-------------------------+--------------------------------------+ | controller-0 | baremetal-35400-leaf1-2 | b0d5abf7-df28-4ae7-b5da-9491e84c21ac | +--------------+-------------------------+--------------------------------------+ Are you sure you want to unprovision these overcloud nodes and ports [y/N]?
Optional
Delete the ironic node:
$ openstack baremetal node delete <IRONIC_NODE_UUID>
-
Replace
IRONIC_NODE_UUID
with the UUID of the node.
20.6. Deploying a new controller node to the overcloud
To deploy a new controller node to the overcloud complete the following steps.
Prerequisites
- The new Controller node must be registered, inspected, and tagged ready for provisioning. For more information, see Provisioning bare metal overcloud nodes
Procedure
Log into director and source the
stackrc
credentials file:$ source ~/stackrc
Provision the overcloud with the original
overcloud-baremetal-deploy.yaml
environment file:$ openstack overcloud node provision --stack overcloud --network-config --output /home/stack/templates/overcloud-baremetal-deployed.yaml /home/stack/templates/overcloud-baremetal-deploy.yaml
NoteIf you want to use the same scheduling, placement, or IP addresses you can edit the
overcloud-baremetal-deploy.yaml
environment file. Set the hostname, name, and networks for the new controller-0 instance in theinstances
section. For example:- name: Controller count: 3 hostname_format: controller-%index% defaults: resource_class: BAREMETAL.controller networks: - network: external subnet: external_subnet - network: internal_api subnet: internal_api_subnet01 - network: storage subnet: storage_subnet01 - network: storage_mgmt subnet: storage_mgmt_subnet01 - network: tenant subnet: tenant_subnet01 network_config: template: templates/multiple_nics/multiple_nics_dvr.j2 default_route_network: - external instances: - hostname: controller-0 name: baremetal-35400-leaf1-2 networks: - network: external subnet: external_subnet fixed_ip: 10.0.0.224 - network: internal_api subnet: internal_api_subnet01 fixed_ip: 172.17.0.97 - network: storage subnet: storage_subnet01 fixed_ip: 172.18.0.24 - network: storage_mgmt subnet: storage_mgmt_subnet01 fixed_ip: 172.19.0.129 - network: tenant subnet: tenant_subnet01 fixed_ip: 172.16.0.11 - name: Compute count: 2 hostname_format: compute-%index% defaults: [ ... ]
When the node is provisioned, remove the
instances
section from theovercloud-baremetal-deploy.yaml
file.To create the
cephadm
user on the new Controller node, export a basic Ceph specification containing the new host information:$ openstack overcloud ceph spec --stack overcloud \ /home/stack/templates/overcloud-baremetal-deployed.yaml \ -o ceph_spec_host.yaml
NoteIf your environment uses a custom role, include the
--roles-data
option.Add the
cephadm
user to the new Controller node:$ openstack overcloud ceph user enable \ --stack overcloud ceph_spec_host.yaml
Add the new role to the Ceph cluster:
$ sudo cephadm shell \ -- ceph orch test add controlller-3 <IP_ADDRESS> <LABELS> 192.168.24.31 _admin mon mgr Inferring fsid 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 Using recent ceph image undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:3075e8708792ebd527ca14849b6af4a11256a3f881ab09b837d7af0f8b2102ea Added host 'controller-3' with addr '192.168.24.31'
- Replace <IP_ADDRESS> with the IP address of the Controller node.
- Replace <LABELS> with any required Ceph labels.
Re-run the
openstack overcloud deploy
command:$ openstack overcloud deploy --stack overcloud --templates \ -n /home/stack/templates/network_data.yaml \ -r /home/stack/templates/roles_data.yaml \ -e /home/stack/templates/overcloud-baremetal-deployed.yaml \ -e /home/stack/templates/overcloud-networks-deployed.yaml \ -e /home/stack/templates/overcloud-vips-deployed.yaml \ -e /home/stack/templates/bootstrap_node.yaml \ -e [ ... ]
NoteIf the replacement Controller node is the bootstrap node, include the
bootstrap_node.yaml
environment file.
20.7. Deploying Ceph services on the new controller node
After you provision a new Controller node and the Ceph monitor services are running you can deploy the mgr
, rgw
and osd
Ceph services on the Controller node.
Prerequisites
- The new Controller node is provisioned and is running Ceph monitor services.
Procedure
Modify the
spec.yml
environment file, replace the previous Controller node name with the new Controller node name:$ cephadm shell -- ceph orch ls --export > spec.yml
NoteDo not use the basic Ceph environment file
ceph_spec_host.yaml
as it does not contain all necessary cluster information.Apply the modified Ceph specification file:
$ cat spec.yml | sudo cephadm shell -- ceph orch apply -i - Inferring fsid 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 Using recent ceph image undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:3075e8708792ebd527ca14849b6af4a11256a3f881ab09b837d7af0f8b2102ea Scheduled crash update... Scheduled mgr update... Scheduled mon update... Scheduled osd.default_drive_group update... Scheduled rgw.rgw update...
Verify the visibility of the new monitor:
$ sudo cephadm --ceph status
.
20.8. Cleaning up after Controller node replacement
After you complete the node replacement, you can finalize the Controller cluster.
Procedure
- Log into a Controller node.
Enable Pacemaker management of the Galera cluster and start Galera on the new node:
[tripleo-admin@overcloud-controller-0 ~]$ sudo pcs resource refresh galera-bundle [tripleo-admin@overcloud-controller-0 ~]$ sudo pcs resource manage galera-bundle
Enable fencing:
[tripleo-admin@overcloud-controller-0 ~]$ sudo pcs property set stonith-enabled=true
Perform a final status check to ensure that the services are running correctly:
[tripleo-admin@overcloud-controller-0 ~]$ sudo pcs status
NoteIf any services have failed, use the
pcs resource refresh
command to resolve and restart the failed services.Exit to director:
[tripleo-admin@overcloud-controller-0 ~]$ exit
Source the
overcloudrc
file so that you can interact with the overcloud:$ source ~/overcloudrc
Check the network agents in your overcloud environment:
(overcloud) $ openstack network agent list
If any agents appear for the old node, remove them:
(overcloud) $ for AGENT in $(openstack network agent list --host overcloud-controller-1.localdomain -c ID -f value) ; do openstack network agent delete $AGENT ; done
If necessary, add your router to the L3 agent host on the new node. Use the following example command to add a router named
r1
to the L3 agent using the UUID 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4:(overcloud) $ openstack network agent add router --l3 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4 r1
Clean the cinder services.
List the cinder services:
(overcloud) $ openstack volume service list
Log in to a controller node, connect to the
cinder-api
container and use thecinder-manage service remove
command to remove leftover services:[tripleo-admin@overcloud-controller-0 ~]$ sudo podman exec -it cinder_api cinder-manage service remove cinder-backup <host> [tripleo-admin@overcloud-controller-0 ~]$ sudo podman exec -it cinder_api cinder-manage service remove cinder-scheduler <host>
Clean the RabbitMQ cluster.
- Log into a Controller node.
Use the
podman exec
command to launch bash, and verify the status of the RabbitMQ cluster:[tripleo-admin@overcloud-controller-0 ~]$ podman exec -it rabbitmq-bundle-podman-0 bash [tripleo-admin@overcloud-controller-0 ~]$ rabbitmqctl cluster_status
Use the
rabbitmqctl
command to forget the replaced controller node:[tripleo-admin@overcloud-controller-0 ~]$ rabbitmqctl forget_cluster_node <node_name>
-
If you replaced a bootstrap Controller node, you must remove the environment file
~/templates/bootstrap-controller.yaml
after the replacement process, or delete thepacemaker_short_bootstrap_node_name
andmysql_short_bootstrap_node_name
parameters from your existing environment file. This step prevents director from attempting to override the Controller node name in subsequent replacements. For more information, see Replacing a bootstrap Controller node. If you are using the Object Storage service (swift) on the overcloud, you must synchronize the swift rings after updating the overcloud nodes. Use a script, similar to the following example, to distribute ring files from a previously existing Controller node (Controller node 0 in this example) to all Controller nodes and restart the Object Storage service containers on those nodes:
#!/bin/sh set -xe SRC="tripleo-admin@overcloud-controller-0.ctlplane" ALL="tripleo-admin@overcloud-controller-0.ctlplane tripleo-admin@overcloud-controller-1.ctlplane tripleo-admin@overcloud-controller-2.ctlplane"
Fetch the current set of ring files:
ssh "${SRC}" 'sudo tar -czvf - /var/lib/config-data/puppet-generated/swift_ringbuilder/etc/swift/{*.builder,*.ring.gz,backups/*.builder}' > swift-rings.tar.gz
Upload rings to all nodes, put them into the correct place, and restart swift services:
for DST in ${ALL}; do cat swift-rings.tar.gz | ssh "${DST}" 'sudo tar -C / -xvzf -' ssh "${DST}" 'sudo podman restart swift_copy_rings' ssh "${DST}" 'sudo systemctl restart tripleo_swift*' done