Chapter 20. Replacing Controller nodes

In certain circumstances a Controller node in a high availability cluster might fail. In these situations, you must remove the node from the cluster and replace it with a new Controller node.

Complete the steps in this section to replace a Controller node. The Controller node replacement process involves running the openstack overcloud deploy command to update the overcloud with a request to replace a Controller node.

Important

The following procedure applies only to high availability environments. Do not use this procedure if you are using only one Controller node.

20.1. Preparing for Controller replacement

Before you replace an overcloud Controller node, it is important to check the current state of your Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Controller replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Controller node replacement. Run all commands for these checks on the undercloud.

Procedure

  1. Check the current status of the overcloud stack on the undercloud:

    $ source stackrc
    $ openstack overcloud status

    Only continue if the overcloud stack has a deployment status of DEPLOY_SUCCESS.

  2. Install the database client tools:

    $ sudo dnf -y install mariadb
  3. Configure root user access to the database:

    $ sudo cp /var/lib/config-data/puppet-generated/mysql/root/.my.cnf /root/.
  4. Perform a backup of the undercloud databases:

    $ mkdir /home/stack/backup
    $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz
  5. Check that your undercloud contains 10 GB free storage to accommodate for image caching and conversion when you provision the new node:

    $ df -h
  6. If you are reusing the IP address for the new controller node, ensure that you delete the port used by the old controller:

    $ openstack port delete <port>
  7. Check the status of Pacemaker on the running Controller nodes. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to view the Pacemaker status:

    $ ssh tripleo-admin@192.168.0.47 'sudo pcs status'

    The output shows all services that are running on the existing nodes and those that are stopped on the failed node.

  8. Check the following parameters on each node of the overcloud MariaDB cluster:

    • wsrep_local_state_comment: Synced
    • wsrep_cluster_size: 2

      Use the following command to check these parameters on each running Controller node. In this example, the Controller node IP addresses are 192.168.0.47 and 192.168.0.46:

      $ for i in 192.168.0.46 192.168.0.47 ; do echo "*** $i ***" ; ssh tripleo-admin@$i "sudo podman exec \$(sudo podman ps --filter name=galera-bundle -q) mysql -e \"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
  9. Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to view the RabbitMQ status:

    $ ssh tripleo-admin@192.168.0.47 "sudo podman exec \$(sudo podman ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"

    The running_nodes key should show only the two available nodes and not the failed node.

  10. If fencing is enabled, disable it. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to check the status of fencing:

    $ ssh tripleo-admin@192.168.0.47 "sudo pcs property show stonith-enabled"

    Run the following command to disable fencing:

    $ ssh tripleo-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"
  11. Login to the failed Controller node and stop all the nova_* containers that are running:

    $ sudo systemctl stop tripleo_nova_api.service
    $ sudo systemctl stop tripleo_nova_api_cron.service
    $ sudo systemctl stop tripleo_nova_compute.service
    $ sudo systemctl stop tripleo_nova_conductor.service
    $ sudo systemctl stop tripleo_nova_metadata.service
    $ sudo systemctl stop tripleo_nova_placement.service
    $ sudo systemctl stop tripleo_nova_scheduler.service
  12. Optional: If you are using the Bare Metal Service (ironic) as the virt driver, you must manually update the service entries in your cell database for any bare metal instances whose instances.host is set to the controller that you are removing. Contact Red Hat Support for assistance.

    Note

    This manual update of the cell database when using Bare Metal Service (ironic) as the virt driver is a temporary workaround to ensure the nodes are rebalanced, until BZ2017980 is complete.

20.2. Removing a Ceph Monitor daemon

If your Controller node is running a Ceph monitor service, complete the following steps to remove the ceph-mon daemon.

Note

Adding a new Controller node to the cluster also adds a new Ceph monitor daemon automatically.

Procedure

  1. Connect to the Controller node that you want to replace:

    $ ssh tripleo-admin@192.168.0.47
  2. List the Ceph mon services:

    $ sudo systemctl --type=service | grep ceph
    ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@crash.controller-0.service          loaded active running Ceph crash.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
      ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service     loaded active running Ceph mgr.controller-0.mufglq for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
      ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service            loaded active running Ceph mon.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
      ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@rgw.rgw.controller-0.ikaevh.service loaded active running Ceph rgw.rgw.controller-0.ikaevh for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
  3. Stop the Ceph mon service:

    $ sudo systemtctl stop ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service
  4. Disable the Ceph mon service:

    $ sudo systemctl disable ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service
  5. Disconnect from the Controller node that you want to replace.
  6. Use SSH to connect to another Controller node in the same cluster:

    $ ssh tripleo-admin@192.168.0.46
  7. The Ceph specification file is modified and applied later in this procedure, to manipulate the file you must export it:

    $ sudo cephadm shell --ceph orch ls --export > spec.yaml
  8. Remove the monitor from the cluster:

    $ sudo cephadm shell -- ceph mon remove controller-0
      removing mon.controller-0 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitors
  9. Disconnect from the Controller node and log back into the Controller node you are removing from the cluster:

    $ ssh tripleo-admin@192.168.0.47
  10. List the Ceph mgr services:

    $ sudo systemctl --type=service | grep ceph
    ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@crash.controller-0.service          loaded active running Ceph crash.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
      ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service     loaded active running Ceph mgr.controller-0.mufglq for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
      ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@rgw.rgw.controller-0.ikaevh.service loaded active running Ceph rgw.rgw.controller-0.ikaevh for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
  11. Stop the Ceph mgr service:

    $ sudo systemctl stop ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service
  12. Disable the Ceph mgr service:

    $ sudo systemctl disable ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service
  13. Start a cephadm shell:

    $ sudo cephadm shell
  14. Verify that the Ceph mgr service for the Controller node is removed from the cluster:

    $ ceph -s
    cluster:
         id:     b9b53581-d590-41ac-8463-2f50aa985001
         health: HEALTH_OK
    
       services:
         mon: 2 daemons, quorum controller-2,controller-1 (age 2h)
         mgr: controller-2(active, since 20h), standbys: controller1-1
         osd: 15 osds: 15 up (since 3h), 15 in (since 3h)
    
       data:
         pools:   3 pools, 384 pgs
         objects: 32 objects, 88 MiB
         usage:   16 GiB used, 734 GiB / 750 GiB avail
        pgs:     384 active+clean

    The node is not listed if the Ceph mgr service is successfully removed.

  15. Export the Red Hat Ceph Storage specification:

    $ ceph orch ls --export > spec.yaml
  16. In the spec.yaml specification file, remove all instances of the host, for example controller-0, from the service_type: mon and service_type: mgr.
  17. Reapply the Red Hat Ceph Storage specification:

    $ ceph orch apply -i spec.yaml
  18. Verify that no Ceph daemons remain on the removed host:

    $ ceph orch ps controller-0
    Note

    If daemons are present, use the following command to remove them:

    $ ceph orch host drain controller-0

    Prior to running the ceph orch host drain command, backup the contents of /etc/ceph. Restore the contents after running the ceph orch host drain command. You must back up prior to running the ceph orch host drain command until https://bugzilla.redhat.com/show_bug.cgi?id=2153827 is resolved.

  19. Remove the controller-0 host from the Red Hat Ceph Storage cluster:

    $ ceph orch host rm controller-0
      Removed host 'controller-0'
  20. Exit the cephadm shell:

    $ exit

Additional Resources

For more information on controlling Red Hat Ceph Storage services with systemd, see Understanding process management for Ceph

For more information on editing and applying Red Hat Ceph Storage specification files, see Deploying the Ceph monitor daemons using the service specification

20.3. Preparing the cluster for Controller node replacement

Before you replace the node, ensure that Pacemaker is not running on the node and then remove that node from the Pacemaker cluster.

Procedure

  1. To view the list of IP addresses for the Controller nodes, run the following command:

    (undercloud)$ metalsmith -c Hostname -c "IP Addresses" list
    +------------------------+-----------------------+
    | Hostname               | IP Addresses          |
    +------------------------+-----------------------+
    | overcloud-compute-0    | ctlplane=192.168.0.44 |
    | overcloud-controller-0 | ctlplane=192.168.0.47 |
    | overcloud-controller-1 | ctlplane=192.168.0.45 |
    | overcloud-controller-2 | ctlplane=192.168.0.46 |
    +------------------------+-----------------------+
  2. Log in to the node and confirm the pacemaker status. If pacemaker is running, use the pcs cluster command to stop pacemaker. This example stops pacemaker on overcloud-controller-0:

    (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs status | grep -w Online | grep -w overcloud-controller-0"
    (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster stop overcloud-controller-0"
    Note

    In the case that the node is physically unavailable or stopped, it is not necessary to perform the previous operation, as pacemaker is already stopped on that node.

  3. After you stop Pacemaker on the node, delete the node from the pacemaker cluster. The following example logs in to overcloud-controller-1 to remove overcloud-controller-0:

    (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster node remove overcloud-controller-0"

    If the node that that you want to replace is unreachable (for example, due to a hardware failure), run the pcs command with additional --skip-offline and --force options to forcibly remove the node from the cluster:

    (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster node remove overcloud-controller-0 --skip-offline --force"
  4. After you remove the node from the pacemaker cluster, remove the node from the list of known hosts in pacemaker:

    (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs host deauth overcloud-controller-0"

    You can run this command whether the node is reachable or not.

  5. To ensure that the new Controller node uses the correct STONITH fencing device after replacement, delete the devices from the node by entering the following command:

    (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs stonith delete <stonith_resource_name>"
    • Replace <stonith_resource_name> with the name of the STONITH resource that corresponds to the node. The resource name uses the the format <resource_agent>-<host_mac>. You can find the resource agent and the host MAC address in the FencingConfig section of the fencing.yaml file.
  6. The overcloud database must continue to run during the replacement procedure. To ensure that Pacemaker does not stop Galera during this procedure, select a running Controller node and run the following command on the undercloud with the IP address of the Controller node:

    (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs resource unmanage galera-bundle"
  7. Remove the OVN northbound database server for the replaced Controller node from the cluster:

    1. Obtain the server ID of the OVN northbound database server to be replaced:

      $ ssh tripleo-admin@<controller_ip> sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null|grep -A4 Servers:

      Replace <controller_ip> with the IP address of any active Controller node.

      You should see output similar to the following:

      Servers:
      96da (96da at tcp:172.17.1.55:6643) (self) next_index=26063 match_index=26063 466b (466b at tcp:172.17.1.51:6643) next_index=26064 match_index=26063 last msg 2936 ms ago
      ba77 (ba77 at tcp:172.17.1.52:6643) next_index=26064 match_index=26063 last msg 2936 ms ago

      In this example, 172.17.1.55 is the internal IP address of the Controller node that is being replaced, so the northbound database server ID is 96da.

    2. Using the server ID you obtained in the preceding step, remove the OVN northbound database server by running the following command:

      $ ssh tripleo-admin@172.17.1.52 sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound 96da

      In this example, you would replace 172.17.1.52 with the IP address of any active Controller node, and replace 96da with the server ID of the OVN northbound database server.

  8. Remove the OVN southbound database server for the replaced Controller node from the cluster:

    1. Obtain the server ID of the OVN southbound database server to be replaced:

      $ ssh tripleo-admin@<controller_ip> sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Southbound 2>/dev/null|grep -A4 Servers:

      Replace <controller_ip> with the IP address of any active Controller node.

      You should see output similar to the following:

      Servers:
      e544 (e544 at tcp:172.17.1.55:6644) last msg 42802690 ms ago
      17ca (17ca at tcp:172.17.1.51:6644) last msg 5281 ms ago
      6e52 (6e52 at tcp:172.17.1.52:6644) (self)

      In this example, 172.17.1.55 is the internal IP address of the Controller node that is being replaced, so the southbound database server ID is e544.

    2. Using the server ID you obtained in the preceding step, remove the OVN southbound database server by running the following command:

      $ ssh tripleo-admin@172.17.1.52 sudo podman exec ovn_cluster_south_db_server ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound e544

      In this example, you would replace 172.17.1.52 with the IP address of any active Controller node, and replace e544 with the server ID of the OVN southbound database server.

  9. Run the following clean up commands to prevent cluster rejoins.

    Substitute <replaced_controller_ip> with the IP address of the Controller node that you are replacing:

    $ ssh tripleo-admin@<replaced_controller_ip> sudo systemctl disable --now tripleo_ovn_cluster_south_db_server.service tripleo_ovn_cluster_north_db_server.service
    
    $ ssh tripleo-admin@<replaced_controller_ip> sudo rm -rfv /var/lib/openvswitch/ovn/.ovn* /var/lib/openvswitch/ovn/ovn*.db

20.4. Replacing a bootstrap Controller node

If you want to replace the Controller node that you use for bootstrap operations and keep the node name, complete the following steps to set the name of the bootstrap Controller node after the replacement process.

Important

Currently, when a bootstrap Controller node is replaced, the OVN database cluster is partitioned with two database clusters for both the northbound and southbound databases. This situation makes instances unusable.

To find the name of the bootstrap Controller node, run the following command:

ssh tripleo-admin@<controller_ip> "sudo hiera -c /etc/puppet/hiera.yaml ovn_dbs_short_bootstrap_node_name"

Workaround: Do not reuse the original bootstrap node hostname and IP address for the new Controller node. RHOSP director sorts the hostnames and then selects the first hostname in the list as the bootstrap node. Choose a name for the new Controller node so that it does not become the first hostname after sorting.

You can track the progress of the fix for this known issue in BZ 2222543.

Procedure

  1. Find the name of the bootstrap Controller node by running the following command:

    ssh tripleo-admin@<controller_ip> "sudo hiera -c /etc/puppet/hiera.yaml pacemaker_short_bootstrap_node_name"
    • Replace <controller_ip> with the IP address of any active Controller node.
  2. Check if your environment files include the ExtraConfig section. If the ExtraConfig parameter does not exist, create the following environment file ~/templates/bootstrap-controller.yaml and add the following content:

    parameter_defaults:
      ExtraConfig:
        pacemaker_short_bootstrap_node_name: NODE_NAME
        mysql_short_bootstrap_node_name: NODE_NAME
    • Replace NODE_NAME with the name of an existing Controller node that you want to use in bootstrap operations after the replacement process.

      If your environment files already include the ExtraConfig parameter, add only the lines that set the pacemaker_short_bootstrap_node_name and mysql_short_bootstrap_node_name parameters.

For information about troubleshooting the bootstrap Controller node replacement, see the article Replacement of the first Controller node fails at step 1 if the same hostname is used for a new node.

20.5. Unprovision and remove Controller nodes

To unprovision and remove Controller nodes, complete the following steps.

Procedure

  1. Source the stackrc file:

    $ source ~/stackrc
  2. Identify the UUID of the overcloud-controller-0 node:

    (undercloud)$ NODE=$(metalsmith -c UUID -f value show overcloud-controller-0)
  3. Set the node to maintenance mode:

    $ openstack baremetal node maintenance set $NODE
  4. Copy the overcloud-baremetal-deploy.yaml file:

    $ cp /home/stack/templates/overcloud-baremetal-deploy.yaml /home/stack/templates/unprovision_controller-0.yaml
  5. In the unprovision_controller-0.yaml file, lower the Controller count to unprovision the Controller node that you are replacing. In this example, the count is reduced from 3 to 2. Move the controller-0 node to the instances dictionary and set the provisioned parameter to false:

    - name: Controller
      count: 2
      hostname_format: controller-%index%
      defaults:
        resource_class: BAREMETAL.controller
        networks:
          [ ... ]
      instances:
      - hostname: controller-0
        name: <IRONIC_NODE_UUID_or_NAME>
        provisioned: false
    - name: Compute
      count: 2
      hostname_format: compute-%index%
      defaults:
        resource_class: BAREMETAL.compute
        networks:
          [ ... ]
  6. Run the node unprovision command:

    $ openstack overcloud node delete \
      --stack overcloud \
      --baremetal-deployment /home/stack/templates/unprovision_controller-0.yaml
    The following nodes will be unprovisioned:
    +--------------+-------------------------+--------------------------------------+
    | hostname     | name                    | id                                   |
    +--------------+-------------------------+--------------------------------------+
    | controller-0 | baremetal-35400-leaf1-2 | b0d5abf7-df28-4ae7-b5da-9491e84c21ac |
    +--------------+-------------------------+--------------------------------------+
    
    Are you sure you want to unprovision these overcloud nodes and ports [y/N]?

Optional

Delete the ironic node:

$ openstack baremetal node delete <IRONIC_NODE_UUID>
  • Replace IRONIC_NODE_UUID with the UUID of the node.

20.6. Deploying a new controller node to the overcloud

To deploy a new controller node to the overcloud complete the following steps.

Prerequisites

Procedure

  1. Log into director and source the stackrc credentials file:

    $ source ~/stackrc
  2. Provision the overcloud with the original overcloud-baremetal-deploy.yaml environment file:

    $ openstack overcloud node provision
      --stack overcloud
      --network-config
      --output /home/stack/templates/overcloud-baremetal-deployed.yaml
      /home/stack/templates/overcloud-baremetal-deploy.yaml
    Note

    If you want to use the same scheduling, placement, or IP addresses you can edit the overcloud-baremetal-deploy.yaml environment file. Set the hostname, name, and networks for the new controller-0 instance in the instances section. For example:

    - name: Controller
      count: 3
      hostname_format: controller-%index%
      defaults:
        resource_class: BAREMETAL.controller
        networks:
        - network: external
          subnet: external_subnet
        - network: internal_api
          subnet: internal_api_subnet01
        - network: storage
          subnet: storage_subnet01
        - network: storage_mgmt
          subnet: storage_mgmt_subnet01
        - network: tenant
          subnet: tenant_subnet01
        network_config:
          template: templates/multiple_nics/multiple_nics_dvr.j2
          default_route_network:
          - external
      instances:
      - hostname: controller-0
        name: baremetal-35400-leaf1-2
        networks:
        - network: external
          subnet: external_subnet
          fixed_ip: 10.0.0.224
        - network: internal_api
          subnet: internal_api_subnet01
          fixed_ip: 172.17.0.97
        - network: storage
          subnet: storage_subnet01
          fixed_ip: 172.18.0.24
        - network: storage_mgmt
          subnet: storage_mgmt_subnet01
          fixed_ip: 172.19.0.129
        - network: tenant
          subnet: tenant_subnet01
          fixed_ip: 172.16.0.11
    - name: Compute
      count: 2
      hostname_format: compute-%index%
      defaults:
        [ ... ]

    When the node is provisioned, remove the instances section from the overcloud-baremetal-deploy.yaml file.

  3. To create the cephadm user on the new Controller node, export a basic Ceph specification containing the new host information:

    $ openstack overcloud ceph spec --stack overcloud \
      /home/stack/templates/overcloud-baremetal-deployed.yaml \
      -o ceph_spec_host.yaml
    Note

    If your environment uses a custom role, include the --roles-data option.

  4. Add the cephadm user to the new Controller node:

    $ openstack overcloud ceph user enable \
      --stack overcloud ceph_spec_host.yaml
  5. Add the new role to the Ceph cluster:

    $ sudo cephadm shell \
      -- ceph orch test add controlller-3 <IP_ADDRESS> <LABELS>
    192.168.24.31 _admin mon mgr
    Inferring fsid 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
    Using recent ceph image undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:3075e8708792ebd527ca14849b6af4a11256a3f881ab09b837d7af0f8b2102ea
    Added host 'controller-3' with addr '192.168.24.31'
    • Replace <IP_ADDRESS> with the IP address of the Controller node.
    • Replace <LABELS> with any required Ceph labels.
  6. Re-run the openstack overcloud deploy command:

    $ openstack overcloud deploy --stack overcloud --templates \
        -n /home/stack/templates/network_data.yaml \
        -r /home/stack/templates/roles_data.yaml \
        -e /home/stack/templates/overcloud-baremetal-deployed.yaml \
        -e /home/stack/templates/overcloud-networks-deployed.yaml \
        -e /home/stack/templates/overcloud-vips-deployed.yaml \
        -e /home/stack/templates/bootstrap_node.yaml \
        -e [ ... ]
    Note

    If the replacement Controller node is the bootstrap node, include the bootstrap_node.yaml environment file.

20.7. Deploying Ceph services on the new controller node

After you provision a new Controller node and the Ceph monitor services are running you can deploy the mgr, rgw and osd Ceph services on the Controller node.

Prerequisites

  • The new Controller node is provisioned and is running Ceph monitor services.

Procedure

  1. Modify the spec.yml environment file, replace the previous Controller node name with the new Controller node name:

    $ cephadm shell -- ceph orch ls --export > spec.yml
    Note

    Do not use the basic Ceph environment file ceph_spec_host.yaml as it does not contain all necessary cluster information.

  2. Apply the modified Ceph specification file:

    $ cat spec.yml | sudo cephadm shell -- ceph orch apply -i -
    Inferring fsid 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
    Using recent ceph image undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:3075e8708792ebd527ca14849b6af4a11256a3f881ab09b837d7af0f8b2102ea
    Scheduled crash update...
    Scheduled mgr update...
    Scheduled mon update...
    Scheduled osd.default_drive_group update...
    Scheduled rgw.rgw update...
  3. Verify the visibility of the new monitor:

    $ sudo cephadm --ceph status

.

20.8. Cleaning up after Controller node replacement

After you complete the node replacement, you can finalize the Controller cluster.

Procedure

  1. Log into a Controller node.
  2. Enable Pacemaker management of the Galera cluster and start Galera on the new node:

    [tripleo-admin@overcloud-controller-0 ~]$ sudo pcs resource refresh galera-bundle
    [tripleo-admin@overcloud-controller-0 ~]$ sudo pcs resource manage galera-bundle
  3. Enable fencing:

    [tripleo-admin@overcloud-controller-0 ~]$ sudo pcs property set stonith-enabled=true
  4. Perform a final status check to ensure that the services are running correctly:

    [tripleo-admin@overcloud-controller-0 ~]$ sudo pcs status
    Note

    If any services have failed, use the pcs resource refresh command to resolve and restart the failed services.

  5. Exit to director:

    [tripleo-admin@overcloud-controller-0 ~]$ exit
  6. Source the overcloudrc file so that you can interact with the overcloud:

    $ source ~/overcloudrc
  7. Check the network agents in your overcloud environment:

    (overcloud) $ openstack network agent list
  8. If any agents appear for the old node, remove them:

    (overcloud) $ for AGENT in $(openstack network agent list --host overcloud-controller-1.localdomain -c ID -f value) ; do openstack network agent delete $AGENT ; done
  9. If necessary, add your router to the L3 agent host on the new node. Use the following example command to add a router named r1 to the L3 agent using the UUID 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4:

    (overcloud) $ openstack network agent add router --l3 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4 r1
  10. Clean the cinder services.

    1. List the cinder services:

      (overcloud) $ openstack volume service list
    2. Log in to a controller node, connect to the cinder-api container and use the cinder-manage service remove command to remove leftover services:

      [tripleo-admin@overcloud-controller-0 ~]$ sudo podman exec -it cinder_api cinder-manage service remove cinder-backup <host>
      [tripleo-admin@overcloud-controller-0 ~]$ sudo podman exec -it cinder_api cinder-manage service remove cinder-scheduler <host>
  11. Clean the RabbitMQ cluster.

    1. Log into a Controller node.
    2. Use the podman exec command to launch bash, and verify the status of the RabbitMQ cluster:

      [tripleo-admin@overcloud-controller-0 ~]$ podman exec -it rabbitmq-bundle-podman-0 bash
      [tripleo-admin@overcloud-controller-0 ~]$ rabbitmqctl cluster_status
    3. Use the rabbitmqctl command to forget the replaced controller node:

      [tripleo-admin@overcloud-controller-0 ~]$ rabbitmqctl forget_cluster_node <node_name>
  12. If you replaced a bootstrap Controller node, you must remove the environment file ~/templates/bootstrap-controller.yaml after the replacement process, or delete the pacemaker_short_bootstrap_node_name and mysql_short_bootstrap_node_name parameters from your existing environment file. This step prevents director from attempting to override the Controller node name in subsequent replacements. For more information, see Replacing a bootstrap Controller node.
  13. If you are using the Object Storage service (swift) on the overcloud, you must synchronize the swift rings after updating the overcloud nodes. Use a script, similar to the following example, to distribute ring files from a previously existing Controller node (Controller node 0 in this example) to all Controller nodes and restart the Object Storage service containers on those nodes:

    #!/bin/sh
    set -xe
    
    SRC="tripleo-admin@overcloud-controller-0.ctlplane"
    ALL="tripleo-admin@overcloud-controller-0.ctlplane tripleo-admin@overcloud-controller-1.ctlplane tripleo-admin@overcloud-controller-2.ctlplane"
    • Fetch the current set of ring files:

      ssh "${SRC}" 'sudo tar -czvf - /var/lib/config-data/puppet-generated/swift_ringbuilder/etc/swift/{*.builder,*.ring.gz,backups/*.builder}' > swift-rings.tar.gz
    • Upload rings to all nodes, put them into the correct place, and restart swift services:

      for DST in ${ALL}; do
        cat swift-rings.tar.gz | ssh "${DST}" 'sudo tar -C / -xvzf -'
        ssh "${DST}" 'sudo podman restart swift_copy_rings'
        ssh "${DST}" 'sudo systemctl restart tripleo_swift*'
      done