We have RHV clusters that had host power management enabled until a network communication issue between hosts and the hosted engine caused the hosts to reboot (nothing wrong with the hosts other than the HE could not communicate with it through the network). This reboot of hosts led to an outage until the perfectly healthy hosts came back online. Because of this, we disabled power management.
fast-forward a few weeks and we had another incident where the UCS blade housing one of the RHV hosts had a DIMM issue. Once the DIMM was replaced on the blade server, all was well; however, the VM's on the host that had the memory issue never restarted on the other healthy hosts. RHEL support reviewed the issue and determined that we do not have VM HA enabled - which brings us full-circle...
In reading the recommended documentation sent by RHEL support, I see that VM HA requires that host power management is enabled. We need to find a solution with the following characteristics:
- VM HA
- host/hosted engine configs that allow for VM HA and persistent hosts in the event of poor or broken network communication to the HE