Appendix A. Automated Evacuation Through Instance HA
With Instance HA, OpenStack automates the process of evacuating instances from a Compute node when that node fails. The following process describes the sequence of events triggered in the event of a Compute node failure.
When a Compute node fails, the
IPMIagent performs first-level fencing and physically resets the node to ensure that it is powered off. Evacuating instances from online Compute nodes could result in data corruption or multiple identical instances running on the overcloud. Once the node is powered off, it is considered fenced.
After the physical IPMI fencing, the
fence-novaagent performs second-level fencing and marks the fenced node with the
“evacuate=yes”cluster per-node attribute. To do this, the agent runs:
$ attrd_updater -n evacuate -A name="evacuate" host="FAILEDHOST" value="yes"
Where FAILEDHOST is the hostname of the failed Compute node.
nova-evacuateagent constantly runs in the background, periodically checking the cluster for nodes with the
nova-evacuatedetects that the fenced node has this attribute, the agent starts evacuating the node using the same process as described in Evacuate Instances.
Meanwhile, while the failed node is booting up from the IPMI reset, the
nova-compute-checkevacuateagent will wait (by default, for 120 seconds) before checking whether
nova-evacuateis finished with evacuation. If not, it will check again after the same time interval.
nova-compute-checkevacuateverifies that the instances are fully evacuated, it triggers another process to make the fenced node available again for hosting instances.