Red Hat Training

A Red Hat training course is available for Red Hat OpenStack Platform

Chapter 9. Scaling the Overcloud

Warning

Do not use openstack server delete to remove nodes from the overcloud. Read the procedures defined in this section to properly remove and replace nodes.

There might be situations where you need to add or remove nodes after the creation of the overcloud. For example, you might need to add more Compute nodes to the overcloud. This situation requires updating the overcloud.

Use the following table to determine support for scaling each node type:

Table 9.1. Scale Support for Each Node Type

Node Type

Scale Up?

Scale Down?

Notes

Controller

N

N

 

Compute

Y

Y

 

Ceph Storage Nodes

Y

N

You must have at least 1 Ceph Storage node from the initial overcloud creation.

Block Storage Nodes

N

N

 

Object Storage Nodes

Y

Y

Requires manual ring management, which is described in Section 9.6, “Replacing Object Storage Nodes”.

Important

Make sure to leave at least 10 GB free space before scaling the overcloud. This free space accommodates image conversion and caching during the node provisioning process.

9.1. Adding Additional Nodes

To add more nodes to the director’s node pool, create a new JSON file (for example, newnodes.json) containing the new node details to register:

{
  "nodes":[
    {
        "mac":[
            "dd:dd:dd:dd:dd:dd"
        ],
        "cpu":"4",
        "memory":"6144",
        "disk":"40",
        "arch":"x86_64",
        "pm_type":"pxe_ipmitool",
        "pm_user":"admin",
        "pm_password":"p@55w0rd!",
        "pm_addr":"192.0.2.207"
    },
    {
        "mac":[
            "ee:ee:ee:ee:ee:ee"
        ],
        "cpu":"4",
        "memory":"6144",
        "disk":"40",
        "arch":"x86_64",
        "pm_type":"pxe_ipmitool",
        "pm_user":"admin",
        "pm_password":"p@55w0rd!",
        "pm_addr":"192.0.2.208"
    }
  ]
}

See Section 5.1, “Registering Nodes for the Overcloud” for an explanation of these parameters.

Run the following command to register these nodes:

$ openstack baremetal import --json newnodes.json

After registering the new nodes, launch the introspection process for them. Use the following commands for each new node:

$ openstack baremetal node manage [NODE UUID]
$ openstack overcloud node introspect [NODE UUID] --provide

This detects and benchmarks the hardware properties of the nodes.

After the introspection process completes, tag each new node for its desired role. For example, for a Compute node, use the following command:

$ openstack baremetal node set --property capabilities='profile:compute,boot_option:local' [NODE UUID]

Scaling the overcloud requires running the openstack overcloud deploy again with the desired number of nodes for a role. For example, to scale to 5 Compute nodes:

$ openstack overcloud deploy --templates --compute-scale 5 [OTHER_OPTIONS]

This updates the entire overcloud stack. Note that this only updates the stack. It does not delete the overcloud and replace the stack.

Important

Make sure to include all environment files and options from your initial overcloud creation. This includes the same scale parameters for non-Compute nodes.

9.2. Removing Compute Nodes

There might be situations where you need to remove Compute nodes from the overcloud. For example, you might need to replace a problematic Compute node.

Important

Before removing a Compute node from the overcloud, migrate the workload from the node to other Compute nodes. See Chapter 8, Migrating Virtual Machines Between Compute Nodes for more details.

Next, disable the node’s Compute service on the overcloud. This stops the node from scheduling new instances.

$ source ~/overcloudrc
$ openstack compute service list
$ openstack compute service set [hostname] nova-compute --disable
$ source ~/stackrc

Removing overcloud nodes requires an update to the overcloud stack in the director using the local template files. First identify the UUID of the overcloud stack:

$ openstack stack list

Identify the UUIDs of the nodes to delete:

$ openstack server list

Run the following commands to update the overcloud plan and delete the nodes from the stack:

$ openstack overcloud deploy --update-plan-only \
  --templates  \
  -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
  -e /home/stack/templates/network-environment.yaml \
  -e /home/stack/templates/storage-environment.yaml \
  -e /home/stack/templates/rhel-registration/environment-rhel-registration.yaml \
  [-e |...]
$ openstack overcloud node delete --stack [STACK_UUID] --templates -e [ENVIRONMENT_FILE] [NODE1_UUID] [NODE2_UUID] [NODE3_UUID]
Important

If you passed any extra environment files when you created the overcloud, pass them here again using the -e or --environment-file option to avoid making undesired manual changes to the overcloud.

Important

Make sure the openstack overcloud node delete command runs to completion before you continue. Use the openstack stack list command and check the overcloud stack has reached an UPDATE_COMPLETE status.

Finally, remove the node’s Compute service:

$ source ~/overcloudrc
$ openstack compute service list
$ openstack compute service delete [service-id]
$ source ~/stackrc

And remove the node’s Open vSwitch agent:

$ source ~/overcloudrc
$ openstack network agent list
$ openstack network agent delete [openvswitch-agent-id]
$ source ~/stackrc

You are now free to remove the node from the overcloud and re-provision it for other purposes.

9.3. Replacing Compute Nodes

If a Compute node fails, you can replace the node with a working one. Replacing a Compute node uses the following process:

This process ensures that a node can be replaced without affecting the availability of any instances.

9.4. Replacing Controller Nodes

In certain circumstances a Controller node in a high availability cluster might fail. In these situations, you must remove the node from the cluster and replace it with a new Controller node. This also includes ensuring the node connects to the other nodes in the cluster.

This section provides instructions on how to replace a Controller node. The process involves running the openstack overcloud deploy command to update the overcloud with a request to replace a controller node. Note that this process is not completely automatic; during the overcloud stack update process, the openstack overcloud deploy command will at some point report a failure and halt the overcloud stack update. At this point, the process requires some manual intervention. Then the openstack overcloud deploy process can continue.

Important

The following procedure only applies to high availability environments. Do not use this procedure if only using one Controller node.

9.4.1. Preliminary Checks

Before attempting to replace an overcloud Controller node, it is important to check the current state of your Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Controller replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Controller node replacement. Run all commands for these checks on the undercloud.

  1. Check the current status of the overcloud stack on the undercloud:

    $ source stackrc
    $ openstack stack list --nested

    The overcloud stack and its subsequent child stacks should have either a CREATE_COMPLETE or UPDATE_COMPLETE.

  2. Perform a backup of the undercloud databases:

    $ mkdir /home/stack/backup
    $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz
    $ sudo systemctl stop openstack-ironic-api.service openstack-ironic-conductor.service openstack-ironic-inspector.service openstack-ironic-inspector-dnsmasq.service
    $ sudo cp /var/lib/ironic-inspector/inspector.sqlite /home/stack/backup
    $ sudo systemctl start openstack-ironic-api.service openstack-ironic-conductor.service openstack-ironic-inspector.service openstack-ironic-inspector-dnsmasq.service
  3. Check your undercloud contains 10 GB free storage to accommodate for image caching and conversion when provisioning the new node.
  4. Check the status of Pacemaker on the running Controller nodes. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the Pacemaker status:

    $ ssh heat-admin@192.168.0.47 'sudo pcs status'

    The output should show all services running on the existing nodes and stopped on the failed node.

  5. Check the following parameters on each node of the overcloud’s MariaDB cluster:

    • wsrep_local_state_comment: Synced
    • wsrep_cluster_size: 2

      Use the following command to check these parameters on each running Controller node (respectively using 192.168.0.47 and 192.168.0.46 for IP addresses):

      $ for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_local_state_comment'\" ; sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_cluster_size'\""; done
  6. Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the status

    $ ssh heat-admin@192.168.0.47 "sudo rabbitmqctl cluster_status"

    The running_nodes key should only show the two available nodes and not the failed node.

  7. Disable fencing, if enabled. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to disable fencing:

    $ ssh heat-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"

    Check the fencing status with the following command:

    $ ssh heat-admin@192.168.0.47 "sudo pcs property show stonith-enabled"
  8. Check the nova-compute service on the director node:

    $ sudo systemctl status openstack-nova-compute
    $ openstack hypervisor list

    The output should show all non-maintenance mode nodes as up.

  9. Make sure all undercloud services are running:

    $ sudo systemctl -t service

9.4.2. Removing a Ceph Monitor Daemon

This procedure removes a ceph-mon daemon from the storage cluster. If your Controller node is running a Ceph monitor service, complete the following steps to remove the ceph-mon daemon. This procedure assumes the Controller is reachable.

Note

A new Ceph monitor daemon will be added after a new Controller is added to the cluster.

  1. Connect to the controller to be replaced and become root:

    # ssh heat-admin@192.168.0.47
    # sudo su -
    Note

    If the controller is unreachable, skip steps 1 and 2 and continue the procedure at step 3 on any working controller node.

  2. As root, stop the monitor:

    # systemctl stop ceph-mon@<monitor_hostname>

    For example:

    # systemctl stop ceph-mon@overcloud-controller-2
  3. Remove the monitor from the cluster:

    # ceph mon remove <mon_id>
  4. On the Ceph monitor node, remove the monitor entry from /etc/ceph/ceph.conf. For example, if you remove controller-2, then remove the IP and hostname for controller-2.

    Before:

    mon host = 172.18.0.21,172.18.0.22,172.18.0.24
    mon initial members = overcloud-controller-2,overcloud-controller-1,overcloud-controller-0

    After:

    mon host = 172.18.0.22,172.18.0.24
    mon initial members = overcloud-controller-1,overcloud-controller-0
  5. Apply the same change to /etc/ceph/ceph.conf on the other overcloud nodes.

    Note

    The ceph.conf file is updated on the relevant overcloud nodes by director when the replacement controller node is added. Normally, this configuration file is managed only by director and should not be manually edited, but it is edited in this step to ensure consistency in case the other nodes restart before the new node is added.

  6. Optionally, archive the monitor data and save it on another server:

    # mv /var/lib/ceph/mon/<cluster>-<daemon_id> /var/lib/ceph/mon/removed-<cluster>-<daemon_id>

9.4.3. Node Replacement

Identify the index of the node to remove. The node index is the suffix on the instance name from nova list output.

[stack@director ~]$ openstack server list
+--------------------------------------+------------------------+
| ID                                   | Name                   |
+--------------------------------------+------------------------+
| 861408be-4027-4f53-87a6-cd3cf206ba7a | overcloud-compute-0    |
| 0966e9ae-f553-447a-9929-c4232432f718 | overcloud-compute-1    |
| 9c08fa65-b38c-4b2e-bd47-33870bff06c7 | overcloud-compute-2    |
| a7f0f5e1-e7ce-4513-ad2b-81146bc8c5af | overcloud-controller-0 |
| cfefaf60-8311-4bc3-9416-6a824a40a9ae | overcloud-controller-1 |
| 97a055d4-aefd-481c-82b7-4a5f384036d2 | overcloud-controller-2 |
+--------------------------------------+------------------------+

In this example, the aim is to remove the overcloud-controller-1 node and replace it with overcloud-controller-3. First, set the node into maintenance mode so the director does not reprovision the failed node. Correlate the instance ID from nova list with the node ID from openstack baremetal node list

[stack@director ~]$ openstack baremetal node list
+--------------------------------------+------+--------------------------------------+
| UUID                                 | Name | Instance UUID                        |
+--------------------------------------+------+--------------------------------------+
| 36404147-7c8a-41e6-8c72-a6e90afc7584 | None | 7bee57cf-4a58-4eaf-b851-2a8bf6620e48 |
| 91eb9ac5-7d52-453c-a017-c0e3d823efd0 | None | None                                 |
| 75b25e9a-948d-424a-9b3b-f0ef70a6eacf | None | None                                 |
| 038727da-6a5c-425f-bd45-fda2f4bd145b | None | 763bfec2-9354-466a-ae65-2401c13e07e5 |
| dc2292e6-4056-46e0-8848-d6e96df1f55d | None | 2017b481-706f-44e1-852a-2ee857c303c4 |
| c7eadcea-e377-4392-9fc3-cf2b02b7ec29 | None | 5f73c7d7-4826-49a5-b6be-8bfd558f3b41 |
| da3a8d19-8a59-4e9d-923a-6a336fe10284 | None | cfefaf60-8311-4bc3-9416-6a824a40a9ae |
| 807cb6ce-6b94-4cd1-9969-5c47560c2eee | None | c07c13e6-a845-4791-9628-260110829c3a |
+--------------------------------------+------+--------------------------------------+

Set the node into maintenance mode:

[stack@director ~]$ openstack baremetal node maintenance set da3a8d19-8a59-4e9d-923a-6a336fe10284

Tag the new node with the control profile.

[stack@director ~]$ openstack baremetal node set --property capabilities='profile:control,boot_option:local' 75b25e9a-948d-424a-9b3b-f0ef70a6eacf

The overcloud’s database must continue running during the replacement procedure. To ensure Pacemaker does not stop Galera during this procedure, select a running Controller node and run the following command on the undercloud using the Controller node’s IP address:

[stack@director ~]$ ssh heat-admin@192.168.0.47 "sudo pcs resource unmanage galera"

Create a YAML file (~/templates/remove-controller.yaml) that defines the node index to remove:

parameters:
  ControllerRemovalPolicies:
    [{'resource_list': ['1']}]
Note

You can speed up the replacement process by reducing the number for tries for settle in Corosync. Include the CorosyncSettleTries parameter in the ~/templates/remove-controller.yaml environment file:

parameter_defaults:
  CorosyncSettleTries: 5

After identifying the node index, redeploy the overcloud and include the remove-controller.yaml environment file:

[stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 -e ~/templates/remove-controller.yaml [OTHER OPTIONS]

If you passed any extra environment files or options when you created the overcloud, pass them again here to avoid making undesired changes to the overcloud.

However, note that the -e ~/templates/remove-controller.yaml is only required once in this instance.

The director removes the old node, creates a new one, and updates the overcloud stack. You can check the status of the overcloud stack with the following command:

[stack@director ~]$ openstack stack list --nested

9.4.4. Manual Intervention

During the ControllerNodesPostDeployment stage, the overcloud stack update halts with an UPDATE_FAILED error at ControllerDeployment_Step1. This is because some Puppet modules do not support nodes replacement. This point in the process requires some manual intervention. Follow these configuration steps:

  1. Get a list of IP addresses for the Controller nodes. For example:

    [stack@director ~]$ openstack server list
    ... +------------------------+ ... +-------------------------+
    ... | Name                   | ... | Networks                |
    ... +------------------------+ ... +-------------------------+
    ... | overcloud-compute-0    | ... | ctlplane=192.168.0.44   |
    ... | overcloud-controller-0 | ... | ctlplane=192.168.0.47   |
    ... | overcloud-controller-2 | ... | ctlplane=192.168.0.46   |
    ... | overcloud-controller-3 | ... | ctlplane=192.168.0.48   |
    ... +------------------------+ ... +-------------------------+
  2. Check the nodeid value of the removed node in the /etc/corosync/corosync.conf file on an existing node. For example, the existing node is overcloud-controller-0 at 192.168.0.47:

    [stack@director ~]$ ssh heat-admin@192.168.0.47 "sudo cat /etc/corosync/corosync.conf"

    This displays a nodelist that contains the ID for the removed node (overcloud-controller-1):

    nodelist {
      node {
        ring0_addr: overcloud-controller-0
        nodeid: 1
      }
      node {
        ring0_addr: overcloud-controller-1
        nodeid: 2
      }
      node {
        ring0_addr: overcloud-controller-2
        nodeid: 3
      }
    }

    Note the nodeid value of the removed node for later. In this example, it is 2.

  3. Delete the failed node from the Corosync configuration on each node and restart Corosync. For this example, log into overcloud-controller-0 and overcloud-controller-2 and run the following commands:

    [stack@director] ssh heat-admin@192.168.0.47 "sudo pcs cluster localnode remove overcloud-controller-1"
    [stack@director] ssh heat-admin@192.168.0.47 "sudo pcs cluster reload corosync"
    [stack@director] ssh heat-admin@192.168.0.46 "sudo pcs cluster localnode remove overcloud-controller-1"
    [stack@director] ssh heat-admin@192.168.0.46 "sudo pcs cluster reload corosync"
  4. Log into one of the remaining nodes and delete the node from the cluster with the crm_node command:

    [stack@director] ssh heat-admin@192.168.0.47
    [heat-admin@overcloud-controller-0 ~]$ sudo crm_node -R overcloud-controller-1 --force

    Stay logged into this node.

  5. Delete the failed node from the RabbitMQ cluster:

    [heat-admin@overcloud-controller-0 ~]$ sudo rabbitmqctl forget_cluster_node rabbit@overcloud-controller-1
  6. Delete the failed node from MongoDB. First, find the IP address for the node’s Interal API connection.

    [heat-admin@overcloud-controller-0 ~]$ sudo netstat -tulnp | grep 27017
    tcp        0      0 192.168.0.47:27017    0.0.0.0:*               LISTEN      13415/mongod

    Check that the node is the primary replica set:

    [root@overcloud-controller-0 ~]# echo "db.isMaster()" | mongo --host 192.168.0.47:27017
    MongoDB shell version: 2.6.11
    connecting to: 192.168.0.47:27017/echo
    {
      "setName" : "tripleo",
      "setVersion" : 1,
      "ismaster" : true,
      "secondary" : false,
      "hosts" : [
        "192.168.0.47:27017",
        "192.168.0.46:27017",
        "192.168.0.45:27017"
      ],
      "primary" : "192.168.0.47:27017",
      "me" : "192.168.0.47:27017",
      "electionId" : ObjectId("575919933ea8637676159d28"),
      "maxBsonObjectSize" : 16777216,
      "maxMessageSizeBytes" : 48000000,
      "maxWriteBatchSize" : 1000,
      "localTime" : ISODate("2016-06-09T09:02:43.340Z"),
      "maxWireVersion" : 2,
      "minWireVersion" : 0,
      "ok" : 1
    }
    bye

    This should indicate if the current node is the primary. If not, use the IP address of the node indicated in the primary key.

    Connect to MongoDB on the primary node:

    [heat-admin@overcloud-controller-0 ~]$ mongo --host 192.168.0.47
    MongoDB shell version: 2.6.9
    connecting to: 192.168.0.47:27017/test
    Welcome to the MongoDB shell.
    For interactive help, type "help".
    For more comprehensive documentation, see
    http://docs.mongodb.org/
    Questions? Try the support group
    http://groups.google.com/group/mongodb-user
    tripleo:PRIMARY>

    Check the status of the MongoDB cluster:

    tripleo:PRIMARY> rs.status()

    Identify the node using the _id key and remove the failed node using the name key. In this case, we remove Node 1, which has 192.168.0.45:27017 for name:

    tripleo:PRIMARY> rs.remove('192.168.0.45:27017')
    Important

    You must run the command against the PRIMARY replica set. If you see the following message:

    "replSetReconfig command must be sent to the current replica set primary."

    Relog into MongoDB on the node designated as PRIMARY.

    Note

    The following output is normal when removing the failed node’s replica set:

    2016-05-07T03:57:19.541+0000 DBClientCursor::init call() failed
    2016-05-07T03:57:19.543+0000 Error: error doing query: failed at src/mongo/shell/query.js:81
    2016-05-07T03:57:19.545+0000 trying reconnect to 192.168.0.47:27017 (192.168.0.47) failed
    2016-05-07T03:57:19.547+0000 reconnect 192.168.0.47:27017 (192.168.0.47) ok

    Exit MongoDB:

    tripleo:PRIMARY> exit
  7. Update list of nodes in the Galera cluster:

    [heat-admin@overcloud-controller-0 ~]$ sudo pcs resource update galera wsrep_cluster_address=gcomm://overcloud-controller-0,overcloud-controller-3,overcloud-controller-2
  8. Configure the Galera cluster check on the new node. Copy the /etc/sysconfig/clustercheck from the existing node to the same location on the new node.
  9. Configure the root user’s Galera access on the new node. Copy the /root/.my.cnf from the existing node to the same location on the new node.
  10. Add the new node to the cluster:

    [heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster node add overcloud-controller-3
  11. Check the /etc/corosync/corosync.conf file on each node. If the nodeid of the new node is the same as the removed node, update the value to a new nodeid value. For example, the /etc/corosync/corosync.conf file contains an entry for the new node (overcloud-controller-3):

    nodelist {
      node {
        ring0_addr: overcloud-controller-0
        nodeid: 1
      }
      node {
        ring0_addr: overcloud-controller-2
        nodeid: 3
      }
      node {
        ring0_addr: overcloud-controller-3
        nodeid: 2
      }
    }

    Note that in this example, the new node uses the same nodeid of the removed node. Update this value to a unused node ID value. For example:

    node {
      ring0_addr: overcloud-controller-3
      nodeid: 4
    }

    Update this nodeid value on each Controller node’s /etc/corosync/corosync.conf file, including the new node.

  12. Restart the Corosync service on the existing nodes only. For example, on overcloud-controller-0:

    [heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster reload corosync

    And on overcloud-controller-2:

    [heat-admin@overcloud-controller-2 ~]$ sudo pcs cluster reload corosync

    Do not run this command on the new node.

  13. Start the new Controller node:

    [heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster start overcloud-controller-3
  14. Restart the Galera cluster and return it to Pacemaker management:

    [heat-admin@overcloud-controller-0 ~]$ sudo pcs resource cleanup galera
    [heat-admin@overcloud-controller-0 ~]$ sudo pcs resource manage galera
  15. Enable and restart some services through Pacemaker. The cluster is currently in maintenance mode and you will need to temporarily disable it to enable the service. For example:

    [heat-admin@overcloud-controller-3 ~]$ sudo pcs property set maintenance-mode=false --wait
  16. Wait until the Galera service starts on all nodes.

    [heat-admin@overcloud-controller-3 ~]$ sudo pcs status | grep galera -A1
    Master/Slave Set: galera-master [galera]
    Masters: [ overcloud-controller-0 overcloud-controller-2 overcloud-controller-3 ]

    If need be, perform a cleanup on the new node:

    [heat-admin@overcloud-controller-3 ~]$ sudo pcs resource cleanup galera --node overcloud-controller-3
  17. Switch the cluster back into maintenance mode:

    [heat-admin@overcloud-controller-3 ~]$ sudo pcs property set maintenance-mode=true --wait

The manual configuration is complete. Re-run the overcloud deployment command to continue the stack update:

[stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 [OTHER OPTIONS]
Important

If you passed any extra environment files or options when you created the overcloud, pass them again here to avoid making undesired changes to the overcloud. However, note that the remove-controller.yaml file is no longer needed.

9.4.5. Finalizing Overcloud Services

After the overcloud stack update completes, some final configuration is required. Log in to one of the Controller nodes and refresh any stopped services in Pacemaker:

[heat-admin@overcloud-controller-0 ~]$ for i in `sudo pcs status|grep -B2 Stop |grep -v "Stop\|Start"|awk -F"[" '/\[/ {print substr($NF,0,length($NF)-1)}'`; do echo $i; sudo pcs resource cleanup $i; done

Perform a final status check to make sure services are running correctly:

[heat-admin@overcloud-controller-0 ~]$ sudo pcs status
Note

If any services have failed, use the pcs resource cleanup command to restart them after resolving them.

If the Controller nodes use fencing, delete the old fencing record and create a new one:

[heat-admin@overcloud-controller-0 ~]$ sudo pcs stonith show
[heat-admin@overcloud-controller-0 ~]$ sudo pcs stonish delete my-ipmilan-for-controller-1
[heat-admin@overcloud-controller-0 ~]$ sudo pcs stonith create my-ipmilan-for-controller-3 fence_ipmilan pcmk_host_list=overcloud-controller-3 ipaddr=192.0.2.208 login=admin passwd=p@55w0rd! lanplus=1 cipher=1 op monitor interval=60s
[heat-admin@overcloud-controller-0 ~]$ sudo pcs constraint location my-ipmilan-for-controller-3 avoids overcloud-controller-3

Re-enable fencing:

[heat-admin@overcloud-controller-0 ~]$ sudo pcs property set stonith-enabled=true
Note

For more information on fencing configuration, see Section 7.7, “Fencing the Controller Nodes”.

Exit to the director

[heat-admin@overcloud-controller-0 ~]$ exit

9.4.6. Finalizing L3 Agent Router Hosting

Source the overcloudrc file so that you can interact with the overcloud. Check your routers to make sure the L3 agents are properly hosting the routers in your overcloud environment. In this example, we use a router with the name r1:

[stack@director ~]$ source ~/overcloudrc
[stack@director ~]$ neutron l3-agent-list-hosting-router r1

This list might still show the old node instead of the new node. To replace it, list the L3 network agents in your environment:

[stack@director ~]$ neutron agent-list | grep "neutron-l3-agent"

Identify the UUID for the agents on the new node and the old node. Add the router to the agent on the new node and remove the router from old node. For example:

[stack@director ~]$ neutron l3-agent-router-add fd6b3d6e-7d8c-4e1a-831a-4ec1c9ebb965 r1
[stack@director ~]$ neutron l3-agent-router-remove b40020af-c6dd-4f7a-b426-eba7bac9dbc2 r1

Perform a final check on the router and make all are active:

[stack@director ~]$ neutron l3-agent-list-hosting-router r1

Delete the existing Neutron agents that point to old Controller node. For example:

[stack@director ~]$ neutron agent-list -F id -F host | grep overcloud-controller-1
| ddae8e46-3e8e-4a1b-a8b3-c87f13c294eb | overcloud-controller-1.localdomain |
[stack@director ~]$ neutron agent-delete ddae8e46-3e8e-4a1b-a8b3-c87f13c294eb

9.4.7. Finalizing Compute Services

Compute services for the removed node still exist in the overcloud and require removal. Source the overcloudrc file so that you can interact with the overcloud. Check the compute services for the removed node:

[stack@director ~]$ source ~/overcloudrc
[stack@director ~]$ nova service-list | grep "overcloud-controller-1.localdomain"

Remove the compute services for the node. For example, if the nova-scheduler service for overcloud-controller-1.localdomain has an ID of 5, run the following command:

[stack@director ~]$ nova service-delete 5

Perform this task for each service of the removed node.

Check the openstack-nova-consoleauth service on the new node.

[stack@director ~]$ nova service-list | grep consoleauth

If the service is not running, log into a Controller node and restart the service:

[stack@director] ssh heat-admin@192.168.0.47
[heat-admin@overcloud-controller-0 ~]$ pcs resource restart openstack-nova-consoleauth

9.4.8. Conclusion

The failed Controller node and its related services are now replaced with a new node.

Important

If you disabled automatic ring building for Object Storage, like in Section 9.6, “Replacing Object Storage Nodes”, you need to manually build the Object Storage ring files for the new node. See Section 9.6, “Replacing Object Storage Nodes” for more information on manually building ring files.

9.5. Replacing Ceph Storage Nodes

The director provides a method to replace Ceph Storage nodes in a director-created cluster. You can find these instructions in the Red Hat Ceph Storage for the Overcloud.

9.6. Replacing Object Storage Nodes

This section describes how to replace Object Storage nodes while maintaining the integrity of the cluster. In this example, we have a two-node Object Storage cluster where the node overcloud-objectstorage-1 needs to be replaced. Our aim is to add one more node, then remove overcloud-objectstorage-1 (effectively replacing it).

  1. Create an environment file called ~/templates/swift-upscale.yaml with the following content:

    parameter_defaults:
      ObjectStorageCount: 3

    The ObjectStorageCount defines how many Object Storage nodes in our environment. In this situation, we scale from 2 to 3 nodes.

  2. Include the swift-upscale.yaml file with the rest of your overcloud’s environment files (ENVIRONMENT_FILES) as part of the openstack overcloud deploy:

    $ openstack overcloud deploy --templates ENVIRONMENT_FILES -e swift-upscale.yaml
    Note

    Add swift-upscale.yaml to the end of the environment file list so its parameters supersede previous environment file parameters.

    After redeployment completes, the overcloud now contains an additional Object Storage node.

  3. Data now needs to be replicated to the new node. Before removing a node (in this case, overcloud-objectstorage-1) you should wait for a replication pass to finish on the new node. You can check the replication pass progress in /var/log/swift/swift.log. When the pass finishes, the Object Storage service should log entries similar to the following:

    Mar 29 08:49:05 localhost object-server: Object replication complete.
    Mar 29 08:49:11 localhost container-server: Replication run OVER
    Mar 29 08:49:13 localhost account-server: Replication run OVER
  4. To remove the old node from the ring, reduce the ObjectStorageCount in swift-upscale.yaml to the omit the old ring. In this case, we reduce it to 2:

    parameter_defaults:
      ObjectStorageCount: 2
  5. Create a new environment file named remove-object-node.yaml. This file will identify and remove the specified Object Storage node. The following content specifies the removal of overcloud-objectstorage-1:

    parameter_defaults:
      ObjectStorageRemovalPolicies:
        [{'resource_list': ['1']}]
  6. Include both environment files with the deployment command:

    $ openstack overcloud deploy --templates ENVIRONMENT_FILES -e swift-upscale.yaml -e remove-object-node.yaml ...

The director deletes the Object Storage node from the overcloud and updates the rest of the nodes on the overcloud to accommodate the node removal.