Chapter 2. How Instance HA Works

OpenStack uses Instance HA to automate the process of evacuating instances from a Compute node when that node fails. The following procedure describes the sequence of events that are triggered when a Compute node fails.

  1. At the time of failure, the IPMI agent performs first-layer fencing and physically resets the node to ensure that it is powered off. Evacuating instances from online Compute nodes might result in data corruption or in multiple identical instances running on the overcloud. When the node is powered off, it is considered fenced.
  2. After the physical IPMI fencing, the fence-nova agent performs second-layer fencing and marks the fenced node with the “evacuate=yes” cluster per-node attribute. To do this, the agent runs the following command:

    $ attrd_updater -n evacuate -A name="evacuate" host="FAILEDHOST" value="yes"

    Where FAILEDHOST is the hostname of the failed Compute node.

    Note

    By default, all instances are to be evacuated, but it is also possible to tag images or flavors for evacuation.

    To tag an image:

    $ openstack image set --tag evacuable ID-OF-THE-IMAGE

    To tag a flavor:

    $ nova flavor-key ID-OF-THE-FLAVOR set evacuable=true
  3. The nova-evacuate agent continually runs in the background, periodically checking the cluster for nodes with the “evacuate=yes” attribute. When nova-evacuate detects that the fenced node contains this attribute, the agent starts evacuating the node using the process described in Evacuate Instances.
  4. While the failed node is booting up from the IPMI reset, the nova-compute process on that node will start automatically. Because the node was fenced earlier, it will not be able to run any new instance until Pacemaker unfences it.
  5. When Pacemaker sees that the Compute node is online again, it tries to start the compute-unfence-trigger resource on the node, reverting the force-down API call and setting the node as enabled again.