Chapter 2. How Instance HA Works

OpenStack uses Instance HA to automate the process of evacuating instances from a Compute node when that node fails. The following procedure describes the sequence of events that are triggered when a Compute node fails.

  1. At the time of failure, the IPMI agent performs first-layer fencing and physically resets the node to ensure that it is powered off. Evacuating instances from online Compute nodes might result in data corruption or in multiple identical instances running on the overcloud. When the node is powered off, it is considered fenced.
  2. After the physical IPMI fencing, the fence-nova agent performs second-layer fencing and marks the fenced node with the “evacuate=yes” cluster per-node attribute. To do this, the agent runs the following command:

    $ attrd_updater -n evacuate -A name="evacuate" host="FAILEDHOST" value="yes"

    Where FAILEDHOST is the hostname of the failed Compute node.

  3. The nova-evacuate agent continually runs in the background, periodically checking the cluster for nodes with the “evacuate=yes” attribute. When nova-evacuate detects that the fenced node contains this attribute, the agent starts evacuating the node using the process described in Evacuate Instances.
  4. While the failed node boots up from the IPMI reset, the nova-compute process on that node starts automatically. Because the node was fenced earlier, it does not run any new instances until Pacemaker unfences it.
  5. When Pacemaker sees that the Compute node is online again, it tries to start the compute-unfence-trigger resource on the node, reverting the force-down API call and setting the node as enabled again.

2.1. Designating specific instances to be evacuated

By default, all instances are to be evacuated, but it is also possible to tag images or flavors for evacuation.

To tag an image:

$ openstack image set --tag evacuable ID-OF-THE-IMAGE

To tag a flavor:

$ nova flavor-key ID-OF-THE-FLAVOR set evacuable=true