Does host power management need to be enabled for Virtual Machine HA to be enabled?

Latest response

We have RHV clusters that had host power management enabled until a network communication issue between hosts and the hosted engine caused the hosts to reboot (nothing wrong with the hosts other than the HE could not communicate with it through the network). This reboot of hosts led to an outage until the perfectly healthy hosts came back online. Because of this, we disabled power management.

fast-forward a few weeks and we had another incident where the UCS blade housing one of the RHV hosts had a DIMM issue. Once the DIMM was replaced on the blade server, all was well; however, the VM's on the host that had the memory issue never restarted on the other healthy hosts. RHEL support reviewed the issue and determined that we do not have VM HA enabled - which brings us full-circle...

In reading the recommended documentation sent by RHEL support, I see that VM HA requires that host power management is enabled. We need to find a solution with the following characteristics:

  • VM HA
  • host/hosted engine configs that allow for VM HA and persistent hosts in the event of poor or broken network communication to the HE

Responses

Hello, I'd try and get to the bottom of the network timeout issue. If it happens regularly, might be worth trying to increase the timeout. Do you use your management network for migrations, and/or storage traffic? If so, investigate moving this off to a separate network, to increase reliability.

If a HA VM runs on two hosts simultaneously, the filesystem will get corrupted. This is why the failed hosts needs to be fenced first (ie, to ensure that there is no lingering qemu-kvm process accessing storage), or manually acknowledge a reboot via the GUI (human intervention required). For this reason, power management is needed for the VM to 'highly available'.

Thanks @Marcus - The network issue was addressed when it happened. Our concern was that one issue was leading to another, more severe issue. We had, however, only a few hosts of the cluster reboot from the network issue. Therefore, if HA was enabled on the VMs, they would have restarted somewhere else. We Plan on enabling power management again, setting the time-out to a higher value, like 60 or 90 seconds, instead of the default 30, and enabling HA on the VMs.