Chapter 1. Introduction and planning an Instance HA deployment

High availability for Compute instances (Instance HA) is a tool that you can use to evacuate instances from a failed Compute node and re-create the instances on a different Compute node.

Instance HA works with shared storage or local storage environments, which means that evacuated instances maintain the same network configuration, such as static IP addresses and floating IP addresses. The re-created instances also maintain the same characteristics inside the new Compute node.

1.1. How Instance HA works

When a Compute node fails, the overcloud fencing agent fences the node, then the Instance HA agents evacuate instances from the failed Compute node to a different Compute node.

The following events occur when a Compute node fails and triggers Instance HA:

  1. At the time of failure, the IPMI agent performs first-layer fencing, which includes physically resetting the node to ensure that it shuts down and preventing data corruption or multiple identical instances on the overcloud. When the node is offline, it is considered fenced.
  2. After the physical IPMI fencing, the fence-nova agent automatically performs second-layer fencing and marks the fenced node with the "evacuate=yes" cluster per-node attribute by running the following command:

    $ attrd_updater -n evacuate -A name="evacuate" host="FAILEDHOST" value="yes"

    FAILEDHOST is the name of the failed Compute node.

  3. The nova-evacuate agent continually runs in the background and periodically checks the cluster for nodes with the "evacuate=yes" attribute. When nova-evacuate detects that the fenced node contains this attribute, the agent starts evacuating the node. The evacuation process is similar to the manual instance evacuation process that you can perform at any time.
  4. When the failed node restarts after the IPMI reset, the nova-compute process on that node also starts automatically. Because the node was previously fenced, it does not run any new instances until Pacemaker un-fences the node.
  5. When Pacemaker detects that the Compute node is online, it starts the compute-unfence-trigger resource agent on the node, which releases the node and so that it can run instances again.

Additional resources

1.2. Planning your Instance HA deployment

Before you deploy Instance HA, review the resource names for compliance and configure your storage and networking based on your environment.

  • Compute node host names and Pacemaker remote resource names must comply with the W3C naming conventions. For more information, see Declaring Namespaces and Names and Tokens in the W3C documentation.
  • Typically, Instance HA requires that you configure shared storage for disk images of instances. Therefore, if you attempt to use the no-shared-storage option, you might receive an InvalidSharedStorage error during evacuation, and the instances will not start on another Compute node.

    However, if all your instances are configured to boot from an OpenStack Block Storage (cinder) volume, you do not need to configure shared storage for the disk image of the instances, and you can evacuate all instances using the no-shared-storage option.

    During evacuation, if your instances are configured to boot from a Block Storage volume, any evacuated instances boot from the same volume on another Compute node. Therefore, the evacuated instances immediately restart their jobs because the OS image and the application data are stored on the OpenStack Block Storage volume.

  • If you deploy Instance HA in a Spine-Leaf environment, you must define a single internal_api network for the Controller and Compute nodes. You can then define a subnet for each leaf. For more information about configuring Spine-Leaf networks, see Creating a roles data file in the Spine Leaf Networking guide.
  • From Red Hat OpenStack Platform 13 and later, you use director to upgrade Instance HA as a part of the overcloud upgrade. For more information about upgrading the overcloud, see Keeping Red Hat OpenStack Platform Updated guide.
  • Disabling Instance HA with the director after installation is not supported. For a workaround to manually remove Instance HA components from your deployment, see the article How can I remove Instance HA components from the controller nodes? .

    Important

    This workaround is not verified for production environments. You must verify the procedure in a test environment before you implement it in a production environment.

1.3. Instance HA resource agents

Instance HA uses the fence_compute, NovaEvacuate, and comput-unfence-trigger resource agents to evacuate and re-created instance if a Compute node fails.

Agent nameName inside clusterRole

fence_compute

fence-nova

Marks a Compute node for evacuation when the node becomes unavailable.

NovaEvacuate

nova-evacuate

Evacuates instances from failed nodes. This agent runs on one of the Controller nodes.

Dummy

compute-unfence-trigger

Releases a fenced node and enables the node to run instances again.