Chapter 9. Scaling the Overcloud
There might be situations where you need to add or remove nodes after the creation of the Overcloud. For example, you might need to add more Compute nodes to the Overcloud. This situation requires updating the Overcloud.
With High Availaibility for Compute instances (or Instance HA, as described in High Availability for Compute Instances), upgrades or scale-up operations are not possible. Any attempts to do so will fail.
If you have Instance HA enabled, disable it before performing an upgrade or scale-up. To do so, perform a rollback as described in Rollback.
Use the following table to determine support for scaling each node type:
Table 9.1. Scale Support for Each Node Type
Node Type | Scale Up? | Scale Down? | Notes |
Controller | N | N | |
Compute | Y | Y | |
Ceph Storage Nodes | Y | N | You must have at least 1 Ceph Storage node from the initial Overcloud creation. |
Block Storage Nodes | N | N | |
Object Storage Nodes | Y | Y | Requires manual ring management, which is described in Section 9.6, “Replacing Object Storage Nodes”. |
Make sure to leave at least 10 GB free space before scaling the Overcloud. This free space accommodates image conversion and caching during the node provisioning process.
9.1. Adding Additional Nodes
To add more nodes to the director’s node pool, create a new JSON file (for example, newnodes.json
) containing the new node details to register:
{ "nodes":[ { "mac":[ "dd:dd:dd:dd:dd:dd" ], "cpu":"4", "memory":"6144", "disk":"40", "arch":"x86_64", "pm_type":"pxe_ipmitool", "pm_user":"admin", "pm_password":"p@55w0rd!", "pm_addr":"192.0.2.207" }, { "mac":[ "ee:ee:ee:ee:ee:ee" ], "cpu":"4", "memory":"6144", "disk":"40", "arch":"x86_64", "pm_type":"pxe_ipmitool", "pm_user":"admin", "pm_password":"p@55w0rd!", "pm_addr":"192.0.2.208" } ] }
See Section 5.1, “Registering Nodes for the Overcloud” for an explanation of these parameters.
Run the following command to register these nodes:
$ openstack baremetal import --json newnodes.json
After registering the new nodes, launch the introspection process for them. Use the following commands for each new node:
$ ironic node-set-provision-state [NODE UUID] manage $ openstack baremetal introspection start [NODE UUID] $ ironic node-set-provision-state [NODE UUID] provide
This detects and benchmarks the hardware properties of the nodes.
After the introspection process completes, tag each new node for its desired role. For example, for a Compute node, use the following command:
$ ironic node-update [NODE UUID] add properties/capabilities='profile:compute,boot_option:local'
Set the boot images to use during the deployment. Find the UUIDs for the bm-deploy-kernel
and bm-deploy-ramdisk
images:
$ glance image-list +--------------------------------------+------------------------+ | ID | Name | +--------------------------------------+------------------------+ | 09b40e3d-0382-4925-a356-3a4b4f36b514 | bm-deploy-kernel | | 765a46af-4417-4592-91e5-a300ead3faf6 | bm-deploy-ramdisk | | ef793cd0-e65c-456a-a675-63cd57610bd5 | overcloud-full | | 9a51a6cb-4670-40de-b64b-b70f4dd44152 | overcloud-full-initrd | | 4f7e33f4-d617-47c1-b36f-cbe90f132e5d | overcloud-full-vmlinuz | +--------------------------------------+------------------------+
Set these UUIDs for the new node’s deploy_kernel
and deploy_ramdisk
settings:
$ ironic node-update [NODE UUID] add driver_info/deploy_kernel='09b40e3d-0382-4925-a356-3a4b4f36b514' $ ironic node-update [NODE UUID] add driver_info/deploy_ramdisk='765a46af-4417-4592-91e5-a300ead3faf6'
Scaling the Overcloud requires running the openstack overcloud deploy
again with the desired number of nodes for a role. For example, to scale to 5 Compute nodes:
$ openstack overcloud deploy --templates --compute-scale 5 [OTHER_OPTIONS]
This updates the entire Overcloud stack. Note that this only updates the stack. It does not delete the Overcloud and replace the stack.
Make sure to include all environment files and options from your initial Overcloud creation. This includes the same scale parameters for non-Compute nodes.
9.2. Removing Compute Nodes
There might be situations where you need to remove Compute nodes from the Overcloud. For example, you might need to replace a problematic Compute node.
Before removing a Compute node from the Overcloud, migrate the workload from the node to other Compute nodes. See Section 8.9, “Migrating VMs from an Overcloud Compute Node” for more details.
Next, disable the node’s Compute service on the Overcloud. This stops the node from scheduling new instances.
$ source ~/stack/overcloudrc $ nova service-list $ nova service-disable [hostname] nova-compute $ source ~/stack/stackrc
Removing Overcloud nodes requires an update to the overcloud
stack in the director using the local template files. First identify the UUID of the Overcloud stack:
$ heat stack-list
Identify the UUIDs of the nodes to delete:
$ nova list
Run the following command to delete the nodes from the stack and update the plan accordingly:
$ openstack overcloud node delete --stack [STACK_UUID] --templates -e [ENVIRONMENT_FILE] [NODE1_UUID] [NODE2_UUID] [NODE3_UUID]
If you passed any extra environment files when you created the Overcloud, pass them here again using the -e
or --environment-file
option to avoid making undesired manual changes to the Overcloud.
Make sure the openstack overcloud node delete
command runs to completion before you continue. Use the openstack stack list
command and check the overcloud
stack has reached an UPDATE_COMPLETE
status.
Finally, remove the node’s Compute service:
$ source ~/stack/overcloudrc $ nova service-list $ nova service-delete [service-id] $ source ~/stack/stackrc
And remove the node’s Open vSwitch agent:
$ source ~/stack/overcloudrc $ neutron agent-list $ neutron agent-delete [openvswitch-agent-id] $ source ~/stack/stackrc
You are now free to remove the node from the Overcloud and re-provision it for other purposes.
9.3. Replacing Compute Nodes
If a Compute node fails, you can replace the node with a working one. Replacing a Compute node uses the following process:
- Migrate workload off the existing Compute node and shutdown the node. See Section 8.9, “Migrating VMs from an Overcloud Compute Node” for this process.
- Remove the Compute node from the Overcloud. See Section 9.2, “Removing Compute Nodes” for this process.
- Scale out the Overcloud with a new Compute node. See Section 9.1, “Adding Additional Nodes” for this process.
This process ensures that a node can be replaced without affecting the availability of any instances.
9.4. Replacing Controller Nodes
In certain circumstances a Controller node in a high availability cluster might fail. In these situations, you must remove the node from the cluster and replace it with a new Controller node. This also includes ensuring the node connects to the other nodes in the cluster.
This section provides instructions on how to replace a Controller node. The process involves running the openstack overcloud deploy
command to update the Overcloud with a request to replace a controller node. Note that this process is not completely automatic; during the Overcloud stack update process, the openstack overcloud deploy
command will at some point report a failure and halt the Overcloud stack update. At this point, the process requires some manual intervention. Then the openstack overcloud deploy
process can continue.
The following procedure only applies to high availability environments. Do not use this procedure if only using one Controller node.
9.4.1. Preliminary Checks
Before attempting to replace an Overcloud Controller node, it is important to check the current state of your Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Controller replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Controller node replacement. Run all commands for these checks on the Undercloud.
Check the current status of the
overcloud
stack on the Undercloud:$ source stackrc $ heat stack-list --show-nested
The
overcloud
stack and its subsequent child stacks should have either aCREATE_COMPLETE
orUPDATE_COMPLETE
.Perform a backup of the Undercloud databases:
$ mkdir /home/stack/backup $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz $ sudo systemctl stop openstack-ironic-api.service openstack-ironic-conductor.service openstack-ironic-inspector.service openstack-ironic-inspector-dnsmasq.service $ sudo cp /var/lib/ironic-inspector/inspector.sqlite /home/stack/backup $ sudo systemctl start openstack-ironic-api.service openstack-ironic-conductor.service openstack-ironic-inspector.service openstack-ironic-inspector-dnsmasq.service
- Check your Undercloud contains 10 GB free storage to accomodate for image caching and conversion when provisioning the new node.
Check the status of Pacemaker on the running Controller nodes. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the Pacemaker status:
$ ssh heat-admin@192.168.0.47 'sudo pcs status'
The output should show all services running on the existing nodes and stopped on the failed node.
Check the following parameters on each node of the Overcloud’s MariaDB cluster:
-
wsrep_local_state_comment: Synced
wsrep_cluster_size: 2
Use the following command to check these parameters on each running Controller node (respectively using 192.168.0.47 and 192.168.0.46 for IP addresses):
$ for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_local_state_comment'\" ; sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_cluster_size'\""; done
-
Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the status
$ ssh heat-admin@192.168.0.47 "sudo rabbitmqctl cluster_status"
The
running_nodes
key should only show the two available nodes and not the failed node.Disable fencing, if enabled. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to disable fencing:
$ ssh heat-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"
Check the fencing status with the following command:
$ ssh heat-admin@192.168.0.47 "sudo pcs property show stonith-enabled"
Check the
nova-compute
service on the director node:$ sudo systemctl status openstack-nova-compute $ nova hypervisor-list
The output should show all non-maintenance mode nodes as
up
.Make sure all Undercloud services are running:
$ sudo systemctl -t service
9.4.2. Node Replacement
Identify the index of the node to remove. The node index is the suffix on the instance name from nova list
output.
[stack@director ~]$ nova list +--------------------------------------+------------------------+ | ID | Name | +--------------------------------------+------------------------+ | 861408be-4027-4f53-87a6-cd3cf206ba7a | overcloud-compute-0 | | 0966e9ae-f553-447a-9929-c4232432f718 | overcloud-compute-1 | | 9c08fa65-b38c-4b2e-bd47-33870bff06c7 | overcloud-compute-2 | | a7f0f5e1-e7ce-4513-ad2b-81146bc8c5af | overcloud-controller-0 | | cfefaf60-8311-4bc3-9416-6a824a40a9ae | overcloud-controller-1 | | 97a055d4-aefd-481c-82b7-4a5f384036d2 | overcloud-controller-2 | +--------------------------------------+------------------------+
In this example, the aim is to remove the overcloud-controller-1
node and replace it with overcloud-controller-3
. First, set the node into maintenance mode so the director does not reprovision the failed node. Correlate the instance ID from nova list
with the node ID from ironic node-list
[stack@director ~]$ ironic node-list +--------------------------------------+------+--------------------------------------+ | UUID | Name | Instance UUID | +--------------------------------------+------+--------------------------------------+ | 36404147-7c8a-41e6-8c72-a6e90afc7584 | None | 7bee57cf-4a58-4eaf-b851-2a8bf6620e48 | | 91eb9ac5-7d52-453c-a017-c0e3d823efd0 | None | None | | 75b25e9a-948d-424a-9b3b-f0ef70a6eacf | None | None | | 038727da-6a5c-425f-bd45-fda2f4bd145b | None | 763bfec2-9354-466a-ae65-2401c13e07e5 | | dc2292e6-4056-46e0-8848-d6e96df1f55d | None | 2017b481-706f-44e1-852a-2ee857c303c4 | | c7eadcea-e377-4392-9fc3-cf2b02b7ec29 | None | 5f73c7d7-4826-49a5-b6be-8bfd558f3b41 | | da3a8d19-8a59-4e9d-923a-6a336fe10284 | None | cfefaf60-8311-4bc3-9416-6a824a40a9ae | | 807cb6ce-6b94-4cd1-9969-5c47560c2eee | None | c07c13e6-a845-4791-9628-260110829c3a | +--------------------------------------+------+--------------------------------------+
Set the node into maintenance mode:
[stack@director ~]$ ironic node-set-maintenance da3a8d19-8a59-4e9d-923a-6a336fe10284 true
Tag the new node with the control
profile.
[stack@director ~]$ ironic node-update 75b25e9a-948d-424a-9b3b-f0ef70a6eacf add properties/capabilities='profile:control,boot_option:local'
Create a YAML file (~/templates/remove-controller.yaml
) that defines the node index to remove:
parameters: ControllerRemovalPolicies: [{'resource_list': ['1']}]
If replacing the node with index 0, edit the heat templates and change the bootstrap node index and node validation index before starting replacement. Create a copy of the director’s Heat template collection (see Section 6.18, “Using Customized Core Heat Templates”) and run the following command on the overcloud.yaml
file:
$ sudo sed -i "s/resource\.0/resource.1/g" ~/templates/my-overcloud/overcloud.yaml
This changes the node index for the following resources:
ControllerBootstrapNodeConfig: type: OS::TripleO::BootstrapNode::SoftwareConfig properties: bootstrap_nodeid: {get_attr: [Controller, resource.0.hostname]} bootstrap_nodeid_ip: {get_attr: [Controller, resource.0.ip_address]}
And:
AllNodesValidationConfig: type: OS::TripleO::AllNodes::Validation properties: PingTestIps: list_join: - ' ' - - {get_attr: [Controller, resource.0.external_ip_address]} - {get_attr: [Controller, resource.0.internal_api_ip_address]} - {get_attr: [Controller, resource.0.storage_ip_address]} - {get_attr: [Controller, resource.0.storage_mgmt_ip_address]} - {get_attr: [Controller, resource.0.tenant_ip_address]}
You can speed up the replacement process by reducing the number for tries for settle in Corosync. Include the following hieradata in the ExtraConfig
parameter in an environment file:
parameter_defaults: ExtraConfig: pacemaker::corosync::settle_tries: 5
After identifying the node index, redeploy the Overcloud and include the remove-controller.yaml
environment file:
[stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 -e ~/templates/remove-controller.yaml [OTHER OPTIONS]
If you passed any extra environment files or options when you created the overcloud, pass them again here to avoid making undesired changes to the overcloud.
However, note that the -e ~/templates/remove-controller.yaml
is only required once in this instance.
The director removes the old node, creates a new one, and updates the Overcloud stack. You can check the status of the Overcloud stack with the following command:
[stack@director ~]$ heat stack-list --show-nested
9.4.3. Manual Intervention
During the ControllerNodesPostDeployment
stage, the Overcloud stack update halts with an UPDATE_FAILED
error at ControllerLoadBalancerDeployment_Step1
. This is because some Puppet modules do not support nodes replacement. This point in the process requires some manual intervention. Follow these configuration steps:
Get a list of IP addresses for the Controller nodes. For example:
[stack@director ~]$ nova list ... +------------------------+ ... +-------------------------+ ... | Name | ... | Networks | ... +------------------------+ ... +-------------------------+ ... | overcloud-compute-0 | ... | ctlplane=192.168.0.44 | ... | overcloud-controller-0 | ... | ctlplane=192.168.0.47 | ... | overcloud-controller-2 | ... | ctlplane=192.168.0.46 | ... | overcloud-controller-3 | ... | ctlplane=192.168.0.48 | ... +------------------------+ ... +-------------------------+
Check the
nodeid
value of the removed node in the/etc/corosync/corosync.conf
file on an existing node. For example, the existing node isovercloud-controller-0
at 192.168.0.47:[stack@director ~]$ ssh heat-admin@192.168.0.47 "sudo cat /etc/corosync/corosync.conf"
This displays a
nodelist
that contains the ID for the removed node (overcloud-controller-1
):nodelist { node { ring0_addr: overcloud-controller-0 nodeid: 1 } node { ring0_addr: overcloud-controller-1 nodeid: 2 } node { ring0_addr: overcloud-controller-2 nodeid: 3 } }
Note the
nodeid
value of the removed node for later. In this example, it is 2.Delete the failed node from the Corosync configuration on each node and restart Corosync. For this example, log into
overcloud-controller-0
andovercloud-controller-2
and run the following commands:[stack@director] ssh heat-admin@192.168.0.47 "sudo pcs cluster localnode remove overcloud-controller-1" [stack@director] ssh heat-admin@192.168.0.47 "sudo pcs cluster reload corosync" [stack@director] ssh heat-admin@192.168.0.46 "sudo pcs cluster localnode remove overcloud-controller-1" [stack@director] ssh heat-admin@192.168.0.46 "sudo pcs cluster reload corosync"
Log into one of the remaining nodes and delete the node from the cluster with the
crm_node
command:[stack@director] ssh heat-admin@192.168.0.47 [heat-admin@overcloud-controller-0 ~]$ sudo crm_node -R overcloud-controller-1 --force
Stay logged into this node.
Delete the failed node from the RabbitMQ cluster:
[heat-admin@overcloud-controller-0 ~]$ sudo rabbitmqctl forget_cluster_node rabbit@overcloud-controller-1
Delete the failed node from MongoDB. First, find the IP address for the node’s Interal API connection.
[heat-admin@overcloud-controller-0 ~]$ sudo netstat -tulnp | grep 27017 tcp 0 0 192.168.0.47:27017 0.0.0.0:* LISTEN 13415/mongod
Check that the node is the
primary
replica set:[root@overcloud-controller-0 ~]# echo "db.isMaster()" | mongo --host 192.168.0.47:27017 MongoDB shell version: 2.6.11 connecting to: 192.168.0.47:27017/echo { "setName" : "tripleo", "setVersion" : 1, "ismaster" : true, "secondary" : false, "hosts" : [ "192.168.0.47:27017", "192.168.0.46:27017", "192.168.0.45:27017" ], "primary" : "192.168.0.47:27017", "me" : "192.168.0.47:27017", "electionId" : ObjectId("575919933ea8637676159d28"), "maxBsonObjectSize" : 16777216, "maxMessageSizeBytes" : 48000000, "maxWriteBatchSize" : 1000, "localTime" : ISODate("2016-06-09T09:02:43.340Z"), "maxWireVersion" : 2, "minWireVersion" : 0, "ok" : 1 } bye
This should indicate if the current node is the primary. If not, use the IP address of the node indicated in the
primary
key.Connect to MongoDB on the primary node:
[heat-admin@overcloud-controller-0 ~]$ mongo --host 192.168.0.47 MongoDB shell version: 2.6.9 connecting to: 192.168.0.47:27017/test Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see http://docs.mongodb.org/ Questions? Try the support group http://groups.google.com/group/mongodb-user tripleo:PRIMARY>
Check the status of the MongoDB cluster:
tripleo:PRIMARY> rs.status()
Identify the node using the
_id
key and remove the failed node using thename
key. In this case, we remove Node 1, which has192.168.0.45:27017
forname
:tripleo:PRIMARY> rs.remove('192.168.0.45:27017')
ImportantYou must run the command against the
PRIMARY
replica set. If you see the following message:"replSetReconfig command must be sent to the current replica set primary."
Relog into MongoDB on the node designated as
PRIMARY
.NoteThe following output is normal when removing the failed node’s replica set:
2016-05-07T03:57:19.541+0000 DBClientCursor::init call() failed 2016-05-07T03:57:19.543+0000 Error: error doing query: failed at src/mongo/shell/query.js:81 2016-05-07T03:57:19.545+0000 trying reconnect to 192.168.0.47:27017 (192.168.0.47) failed 2016-05-07T03:57:19.547+0000 reconnect 192.168.0.47:27017 (192.168.0.47) ok
Exit MongoDB:
tripleo:PRIMARY> exit
Update list of nodes in the Galera cluster:
[heat-admin@overcloud-controller-0 ~]$ sudo pcs resource update galera wsrep_cluster_address=gcomm://overcloud-controller-0,overcloud-controller-3,overcloud-controller-2
-
Configure the Galera cluster check on the new node. Copy the
/etc/sysconfig/clustercheck
from the existing node to the same location on the new node. -
Configure the
root
user’s Galera access on the new node. Copy the/root/.my.cnf
from the existing node to the same location on the new node. Add the new node to the cluster:
[heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster node add overcloud-controller-3
Check the
/etc/corosync/corosync.conf
file on each node. If thenodeid
of the new node is the same as the removed node, update the value to a new nodeid value. For example, the/etc/corosync/corosync.conf
file contains an entry for the new node (overcloud-controller-3
):nodelist { node { ring0_addr: overcloud-controller-0 nodeid: 1 } node { ring0_addr: overcloud-controller-2 nodeid: 3 } node { ring0_addr: overcloud-controller-3 nodeid: 2 } }
Note that in this example, the new node uses the same
nodeid
of the removed node. Update this value to a unused node ID value. For example:node { ring0_addr: overcloud-controller-3 nodeid: 4 }
Update this
nodeid
value on each Controller node’s/etc/corosync/corosync.conf
file, including the new node.Restart the Corosync service on the existing nodes only. For example, on
overcloud-controller-0
:[heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster reload corosync
And on
overcloud-controller-2
:[heat-admin@overcloud-controller-2 ~]$ sudo pcs cluster reload corosync
Do not run this command on the new node.
Start the new Controller node:
[heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster start overcloud-controller-3
Enable the keystone service on the new node. Copy the
/etc/keystone
directory from a remaining node to the director host:[heat-admin@overcloud-controller-0 ~]$ sudo -i [root@overcloud-controller-0 ~]$ scp -r /etc/keystone stack@192.168.0.1:~/.
Log in to the new Controller node. Remove the
/etc/keystone
directory from the new Controller node and copy thekeystone
files from the director host:[heat-admin@overcloud-controller-3 ~]$ sudo -i [root@overcloud-controller-3 ~]$ rm -rf /etc/keystone [root@overcloud-controller-3 ~]$ scp -r stack@192.168.0.1:~/keystone /etc/. [root@overcloud-controller-3 ~]$ chown -R keystone: /etc/keystone [root@overcloud-controller-3 ~]$ chown root /etc/keystone/logging.conf /etc/keystone/default_catalog.templates
Edit
/etc/keystone/keystone.conf
and set theadmin_bind_host
andpublic_bind_host
parameters to new Controller node’s IP addresses. To find these IP addresses, use theip addr
command and look for the IP address within the following networks:-
admin_bind_host
- Provisioning network -
public_bind_host
- Internal API network
NoteThese networks might differ if you deployed the Overcloud using a custom
ServiceNetMap
parameter.For example, if the Provisioning network uses the
192.168.0.0/24
subnet and the Internal API uses the172.17.0.0/24
subnet, use the following commands to find the node’s IP addresses on those networks:[root@overcloud-controller-3 ~]$ ip addr | grep "192\.168\.0\..*/24" [root@overcloud-controller-3 ~]$ ip addr | grep "172\.17\.0\..*/24"
-
Enable and restart some services through Pacemaker. The cluster is currently in maintenance mode and you will need to temporarily disable it to enable the service. For example:
[heat-admin@overcloud-controller-3 ~]$ sudo pcs property set maintenance-mode=false --wait
Wait until the Galera service starts on all nodes.
[heat-admin@overcloud-controller-3 ~]$ sudo pcs status | grep galera -A1 Master/Slave Set: galera-master [galera] Masters: [ overcloud-controller-0 overcloud-controller-2 overcloud-controller-3 ]
If need be, perform a
cleanup
on the new node:[heat-admin@overcloud-controller-3 ~]$ sudo pcs resource cleanup galera --node overcloud-controller-3
Wait until the
httpd
service starts on all nodes.[heat-admin@overcloud-controller-3 ~]$ sudo pcs status | grep httpd -A1 Clone Set: httpd-clone [httpd] Started: [ overcloud-controller-0 overcloud-controller-2 overcloud-controller-3 ]
If need be, perform a
cleanup
on the new node:[heat-admin@overcloud-controller-3 ~]$ sudo pcs resource cleanup httpd --node overcloud-controller-3
Switch the cluster back into maintenance mode:
[heat-admin@overcloud-controller-3 ~]$ sudo pcs property set maintenance-mode=true --wait
The manual configuration is complete. Re-run the Overcloud deployment command to continue the stack update:
[stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 [OTHER OPTIONS]
If you passed any extra environment files or options when you created the Overcloud, pass them again here to avoid making undesired changes to the Overcloud. However, note that the remove-controller.yaml
file is no longer needed.
9.4.4. Finalizing Overcloud Services
After the Overcloud stack update completes, some final configuration is required. Log in to one of the Controller nodes and refresh any stopped services in Pacemaker:
[heat-admin@overcloud-controller-0 ~]$ for i in `sudo pcs status|grep -B2 Stop |grep -v "Stop\|Start"|awk -F"[" '/\[/ {print substr($NF,0,length($NF)-1)}'`; do echo $i; sudo pcs resource cleanup $i; done
Perform a final status check to make sure services are running correctly:
[heat-admin@overcloud-controller-0 ~]$ sudo pcs status
If any services have failed, use the pcs resource cleanup
command to restart them after resolving them.
Exit to the director
[heat-admin@overcloud-controller-0 ~]$ exit
9.4.5. Finalizing L3 Agent Router Hosting
Source the overcloudrc
file so that you can interact with the Overcloud. Check your routers to make sure the L3 agents are properly hosting the routers in your Overcloud environment. In this example, we use a router with the name r1
:
[stack@director ~]$ source ~/overcloudrc [stack@director ~]$ neutron l3-agent-list-hosting-router r1
This list might still show the old node instead of the new node. To replace it, list the L3 network agents in your environment:
[stack@director ~]$ neutron agent-list | grep "neutron-l3-agent"
Identify the UUID for the agents on the new node and the old node. Add the router to the agent on the new node and remove the router from old node. For example:
[stack@director ~]$ neutron l3-agent-router-add fd6b3d6e-7d8c-4e1a-831a-4ec1c9ebb965 r1 [stack@director ~]$ neutron l3-agent-router-remove b40020af-c6dd-4f7a-b426-eba7bac9dbc2 r1
Perform a final check on the router and make sure all are active:
[stack@director ~]$ neutron l3-agent-list-hosting-router r1
Delete the existing Neutron agents that point to old Controller node. For example:
[stack@director ~]$ neutron agent-list -F id -F host | grep overcloud-controller-1 | ddae8e46-3e8e-4a1b-a8b3-c87f13c294eb | overcloud-controller-1.localdomain | [stack@director ~]$ neutron agent-delete ddae8e46-3e8e-4a1b-a8b3-c87f13c294eb
9.4.6. Finalizing Compute Services
Compute services for the removed node still exist in the Overcloud and require removal. Source the overcloudrc
file so that you can interact with the Overcloud. Check the compute services for the removed node:
[stack@director ~]$ source ~/overcloudrc [stack@director ~]$ nova service-list | grep "overcloud-controller-1.localdomain"
Remove the compute services for the node. For example, if the nova-scheduler
service for overcloud-controller-1.localdomain
has an ID of 5, run the following command:
[stack@director ~]$ nova service-delete 5
Perform this task for each service of the removed node.
Check the openstack-nova-consoleauth
service on the new node.
[stack@director ~]$ nova service-list | grep consoleauth
If the service is not running, log into a Controller node and restart the service:
[stack@director] ssh heat-admin@192.168.0.47 [heat-admin@overcloud-controller-0 ~]$ pcs resource restart openstack-nova-consoleauth
9.4.7. Conclusion
The failed Controller node and its related services are now replaced with a new node.
If you disabled automatic ring building for Object Storage, like in Section 9.6, “Replacing Object Storage Nodes”, you need to manually build the Object Storage ring files for the new node. See Section 9.6, “Replacing Object Storage Nodes” for more information on manually building ring files.
9.5. Replacing Ceph Storage Nodes
The director provides a method to replace Ceph Storage nodes in a director-created cluster. You can find these instructions in the Red Hat Ceph Storage for the Overcloud.
9.6. Replacing Object Storage Nodes
To replace nodes on the Object Storage cluster, you need to:
- Update the Overcloud with the new Object Storage nodes and prevent Director from creating the ring files.
-
Manually add/remove the nodes to the cluster using
swift-ring-builder
.
The following procedure describes how to replace nodes while maintaining the integrity of the cluster. In this example, we have a two node Object Storage cluster. The aim is to add an additional node, then replace the faulty node.
First, create an environment file called ~/templates/swift-ring-prevent.yaml
with the following content:
parameter_defaults: SwiftRingBuild: false RingBuild: false ObjectStorageCount: 3
The SwiftRingBuild
and RingBuild
parameters define whether the Overcloud automatically builds the ring files for Object Storage and Controller nodes respectively. The ObjectStorageCount
defines how many Object Storage nodes in our environment. In this situation, we scale from 2 to 3 nodes.
Include the swift-ring-prevent.yaml
file with the rest of your Overcloud’s environment files as part of the openstack overcloud deploy
:
$ openstack overcloud deploy --templates [ENVIRONMENT_FILES] -e swift-ring-prevent.yaml
Add this file to the end of the environment file list so its parameters supersede previous environment file parameters.
After redeployment completes, the Overcloud now contains an additional Object Storage node. However, the node’s storage directory has not been created and ring files for the node’s object store are unbuilt. This means you must create the storage directory and build the ring files manually.
Use the following procedure to also build ring files on Controller nodes.
Login to the new node and create the storage directory:
$ sudo mkdir -p /srv/node/d1 $ sudo chown -R swift:swift /srv/node/d1
You can also mount an external storage device at this directory.
Copy the existing ring files to the node. Log into a Controller node as the heat-admin
user and then change to the superuser. For example, given a Controller node with an IP address of 192.168.201.24.
$ ssh heat-admin@192.168.201.24 $ sudo -i
Copy the /etc/swift/*.builder
files from the Controller node to the new Object Storage node’s /etc/swift/
directory. If necessary, transfer the files to the director host:
[root@overcloud-controller-0 ~]# scp /etc/swift/*.builder stack@192.1.2.1:~/.
Then transfer the files to the new node:
[stack@director ~]$ scp ~/*.builder heat-admin@192.1.2.24:~/.
Log into the new Object Storage node as the heat-admin
user and then change to the superuser. For example, given a Object Storage node with an IP address of 192.168.201.29.
$ ssh heat-admin@192.168.201.29 $ sudo -i
Copy the files to the /etc/swift
directory:
# cp /home/heat-admin/*.builder /etc/swift/.
Add the new Object Storage node to the account, container, and object rings. Run the following commands for the new node:
# swift-ring-builder /etc/swift/account.builder add zX-IP:6002/d1 weight # swift-ring-builder /etc/swift/container.builder add zX-IP:6001/d1 weight # swift-ring-builder /etc/swift/object.builder add zX-IP:6000/d1 weight
Replace the following values in these commands:
- zX
- Replace X with the corresponding integer of a specified zone (for example, z1 for Zone 1).
- IP
-
The IP that the account, container, and object services use to listen. This should match the IP address of each storage node; specifically, the value of
bind_ip
in theDEFAULT
sections of/etc/swift/object-server.conf
,/etc/swift/account-server.conf
, and/etc/swift/container-server.conf
. - weight
- Describes relative weight of the device in comparison to other devices. This is usually 100.
Check the existing values of the current nodes in the ring file using the swift-ring-builder
on the rings files alone:
# swift-ring-builder /etc/swift/account.builder
Remove the node you aim to replace from the account, container, and object rings. Run the following commands for each node:
# swift-ring-builder /etc/swift/account.builder remove IP # swift-ring-builder /etc/swift/container.builder remove IP # swift-ring-builder /etc/swift/object.builder remove IP
Replace IP
with the IP address of the node.
Redistribute the partitions across all the nodes:
# swift-ring-builder /etc/swift/account.builder rebalance # swift-ring-builder /etc/swift/container.builder rebalance # swift-ring-builder /etc/swift/object.builder rebalance
Change the ownership of all /etc/swift/
contents to the root
user and swift
group:
# chown -R root:swift /etc/swift
Restart the openstack-swift-proxy
service:
# systemctl restart openstack-swift-proxy.service
At this point, the ring files (*.ring.gz and *.builder) should be updated on the new node:
/etc/swift/account.builder /etc/swift/account.ring.gz /etc/swift/container.builder /etc/swift/container.ring.gz /etc/swift/object.builder /etc/swift/object.ring.gz
Copy these files to /etc/swift/
on the Controller nodes and the existing Object Storage nodes (except for the node to remove). If necessary, transfer the files to the director host:
[root@overcloud-objectstorage-2 swift]# scp *.builder stack@192.1.2.1:~/ [root@overcloud-objectstorage-2 swift]# scp *.ring.gz stack@192.1.2.1:~/
Then copy the files to the /etc/swift/
on each node.
On each node, change the ownership of all /etc/swift/
contents to the root
user and swift
group:
# chown -R root:swift /etc/swift
The new node is added and a part of the ring. Before removing the old node from the ring, check that the new node completes a full data replication pass.
To remove the old node from the ring, reduce the ObjectStorageCount
to the omit the old ring. In this case, we reduce from 3 to 2:
parameter_defaults: SwiftRingBuild: false RingBuild: false ObjectStorageCount: 2
Create a new environment file (remove-object-node.yaml
) to identify and remove the old Object Storage node. In this case, we remove overcloud-objectstorage-1
:
parameter_defaults: ObjectStorageRemovalPolicies: [{'resource_list': ['1']}]
Include both environment files with the deployment command:
$ openstack overcloud deploy --templates -e swift-ring-prevent.yaml -e remove-object-node.yaml ...
The director deletes the Object Storage node from the Overcloud and updates the rest of the nodes on the Overcloud to accommodate the node removal.