7.6. Host Resilience

7.6.1. Host High Availability

The Red Hat Virtualization Manager uses fencing to keep the hosts in a cluster responsive. A Non Responsive host is different from a Non Operational host. Non Operational hosts can be communicated with by the Manager, but have an incorrect configuration, for example a missing logical network. Non Responsive hosts cannot be communicated with by the Manager.
If a host with a power management device loses communication with the Manager, it can be fenced (rebooted) from the Administration Portal. All the virtual machines running on that host are stopped, and highly available virtual machines are started on a different host.
All power management operations are done using a proxy host, as opposed to directly by the Red Hat Virtualization Manager. At least two hosts are required for power management operations.
Fencing allows a cluster to react to unexpected host failures as well as enforce power saving, load balancing, and virtual machine availability policies. You should configure the fencing parameters for your host's power management device and test their correctness from time to time.
Hosts can be fenced automatically using the power management parameters, or manually by right-clicking on a host and using the options on the menu. In a fencing operation, an unresponsive host is rebooted, and if the host does not return to an active status within a prescribed time, it remains unresponsive pending manual intervention and troubleshooting.
If the host is required to run virtual machines that are highly available, power management must be enabled and configured.

7.6.2. Power Management by Proxy in Red Hat Virtualization

The Red Hat Virtualization Manager does not communicate directly with fence agents. Instead, the Manager uses a proxy to send power management commands to a host power management device. The Manager uses VDSM to execute power management device actions, so another host in the environment is used as a fencing proxy.
You can select between:
  • Any host in the same cluster as the host requiring fencing.
  • Any host in the same data center as the host requiring fencing.
A viable fencing proxy host has a status of either UP or Maintenance.

7.6.3. Setting Fencing Parameters on a Host

The parameters for host fencing are set using the Power Management fields on the New Host or Edit Host windows. Power management enables the system to fence a troublesome host using an additional interface such as a Remote Access Card (RAC).
All power management operations are done using a proxy host, as opposed to directly by the Red Hat Virtualization Manager. At least two hosts are required for power management operations.

Procedure 7.16. Setting fencing parameters on a host

  1. Use the Hosts resource tab, tree mode, or the search function to find and select the host in the results list.
  2. Click Edit to open the Edit Host window.
  3. Click the Power Management tab.
    Power Management Settings

    Figure 7.2. Power Management Settings

  4. Select the Enable Power Management check box to enable the fields.
  5. Select the Kdump integration check box to prevent the host from fencing while performing a kernel crash dump.

    Important

    When you enable Kdump integration on an existing host, the host must be reinstalled for kdump to be configured. See Section 7.5.11, “Reinstalling Hosts”.
  6. Optionally, select the Disable policy control of power management check box if you do not want your host's power management to be controlled by the Scheduling Policy of the host's cluster.
  7. Click the plus (+) button to add a new power management device. The Edit fence agent window opens.
    Edit fence agent

    Figure 7.3. Edit fence agent

  8. Enter the Address, User Name, and Password of the power management device.
  9. Select the power management device Type from the drop-down list.

    Note

    For more information on how to set up a custom power management device, see https://access.redhat.com/articles/1238743.
  10. Enter the SSH Port number used by the power management device to communicate with the host.
  11. Enter the Slot number used to identify the blade of the power management device.
  12. Enter the Options for the power management device. Use a comma-separated list of 'key=value' entries.
  13. Select the Secure check box to enable the power management device to connect securely to the host.
  14. Click the Test button to ensure the settings are correct. Test Succeeded, Host Status is: on will display upon successful verification.

    Warning

    Power management parameters (userid, password, options, etc) are tested by Red Hat Virtualization Manager only during setup and manually after that. If you choose to ignore alerts about incorrect parameters, or if the parameters are changed on the power management hardware without the corresponding change in Red Hat Virtualization Manager, fencing is likely to fail when most needed.
  15. Click OK to close the Edit fence agent window.
  16. In the Power Management tab, optionally expand the Advanced Parameters and use the up and down buttons to specify the order in which the Manager will search the host's cluster and dc (datacenter) for a fencing proxy.
  17. Click OK.
You are returned to the list of hosts. Note that the exclamation mark next to the host's name has now disappeared, signifying that power management has been successfully configured.

7.6.4. fence_kdump Advanced Configuration

kdump

Select a host to view the status of the kdump service in the General tab of the details pane:

  • Enabled: kdump is configured properly and the kdump service is running.
  • Disabled: the kdump service is not running (in this case kdump integration will not work properly).
  • Unknown: happens only for hosts with an older VDSM version that does not report kdump status.
For more information on installing and using kdump, see the Red Hat Enterprise Linux 7 Kernel Crash Dump Guide.
fence_kdump

Enabling Kdump integration in the Power Management tab of the New Host or Edit Host window configures a standard fence_kdump setup. If the environment's network configuration is simple and the Manager's FQDN is resolvable on all hosts, the default fence_kdump settings are sufficient for use.

However, there are some cases where advanced configuration of fence_kdump is necessary. Environments with more complex networking may require manual changes to the configuration of the Manager, fence_kdump listener, or both. For example, if the Manager's FQDN is not resolvable on all hosts with Kdump integration enabled, you can set a proper host name or IP address using engine-config:
engine-config -s FenceKdumpDestinationAddress=A.B.C.D
The following example cases may also require configuration changes:
  • The Manager has two NICs, where one of these is public-facing, and the second is the preferred destination for fence_kdump messages.
  • You need to execute the fence_kdump listener on a different IP or port.
  • You need to set a custom interval for fence_kdump notification messages, to prevent possible packet loss.
Customized fence_kdump detection settings are recommended for advanced users only, as changes to the default configuration are only necessary in more complex networking setups. For configuration options for the fence_kdump listener see Section 7.6.4.1, “fence_kdump listener Configuration”. For configuration of kdump on the Manager see Section 7.6.4.2, “Configuring fence_kdump on the Manager”.

7.6.4.1. fence_kdump listener Configuration

Edit the configuration of the fence_kdump listener. This is only necessary in cases where the default configuration is not sufficient.

Procedure 7.17. Manually Configuring the fence_kdump Listener

  1. Create a new file (for example, my-fence-kdump.conf) in /etc/ovirt-engine/ovirt-fence-kdump-listener.conf.d/
  2. Enter your customization with the syntax OPTION=value and save the file.

    Important

    The edited values must also be changed in engine-config as outlined in the fence_kdump Listener Configuration Options table in Section 7.6.4.2, “Configuring fence_kdump on the Manager”.
  3. Restart the fence_kdump listener:
    # systemctl restart ovirt-fence-kdump-listener.service
The following options can be customized if required:

Table 7.9. fence_kdump Listener Configuration Options

Variable Description Default Note
LISTENER_ADDRESS Defines the IP address to receive fence_kdump messages on. 0.0.0.0 If the value of this parameter is changed, it must match the value of FenceKdumpDestinationAddress in engine-config.
LISTENER_PORT Defines the port to receive fence_kdump messages on. 7410 If the value of this parameter is changed, it must match the value of FenceKdumpDestinationPort in engine-config.
HEARTBEAT_INTERVAL Defines the interval in seconds of the listener's heartbeat updates. 30 If the value of this parameter is changed, it must be half the size or smaller than the value of FenceKdumpListenerTimeout in engine-config.
SESSION_SYNC_INTERVAL Defines the interval in seconds to synchronize the listener's host kdumping sessions in memory to the database. 5 If the value of this parameter is changed, it must be half the size or smaller than the value of KdumpStartedTimeout in engine-config.
REOPEN_DB_CONNECTION_INTERVAL Defines the interval in seconds to reopen the database connection which was previously unavailable. 30 -
KDUMP_FINISHED_TIMEOUT Defines the maximum timeout in seconds after the last received message from kdumping hosts after which the host kdump flow is marked as FINISHED. 60 If the value of this parameter is changed, it must be double the size or higher than the value of FenceKdumpMessageInterval in engine-config.

7.6.4.2. Configuring fence_kdump on the Manager

Edit the Manager's kdump configuration. This is only necessary in cases where the default configuration is not sufficient. The current configuration values can be found using:
# engine-config -g OPTION

Procedure 7.18. Manually Configuring Kdump with engine-config

  1. Edit kdump's configuration using the engine-config command:
    # engine-config -s OPTION=value

    Important

    The edited values must also be changed in the fence_kdump listener configuration file as outlined in the Kdump Configuration Options table. See Section 7.6.4.1, “fence_kdump listener Configuration”.
  2. Restart the ovirt-engine service:
    # systemctl restart ovirt-engine.service
  3. Reinstall all hosts with Kdump integration enabled, if required (see the table below).
The following options can be configured using engine-config:

Table 7.10. Kdump Configuration Options

Variable Description Default Note
FenceKdumpDestinationAddress Defines the hostname(s) or IP address(es) to send fence_kdump messages to. If empty, the Manager's FQDN is used. Empty string (Manager FQDN is used) If the value of this parameter is changed, it must match the value of LISTENER_ADDRESS in the fence_kdump listener configuration file, and all hosts with Kdump integration enabled must be reinstalled.
FenceKdumpDestinationPort Defines the port to send fence_kdump messages to. 7410 If the value of this parameter is changed, it must match the value of LISTENER_PORT in the fence_kdump listener configuration file, and all hosts with Kdump integration enabled must be reinstalled.
FenceKdumpMessageInterval Defines the interval in seconds between messages sent by fence_kdump. 5 If the value of this parameter is changed, it must be half the size or smaller than the value of KDUMP_FINISHED_TIMEOUT in the fence_kdump listener configuration file, and all hosts with Kdump integration enabled must be reinstalled.
FenceKdumpListenerTimeout Defines the maximum timeout in seconds since the last heartbeat to consider the fence_kdump listener alive. 90 If the value of this parameter is changed, it must be double the size or higher than the value of HEARTBEAT_INTERVAL in the fence_kdump listener configuration file.
KdumpStartedTimeout Defines the maximum timeout in seconds to wait until the first message from the kdumping host is received (to detect that host kdump flow has started). 30 If the value of this parameter is changed, it must be double the size or higher than the value of SESSION_SYNC_INTERVAL in the fence_kdump listener configuration file, and FenceKdumpMessageInterval.

7.6.5. Soft-Fencing Hosts

Hosts can sometimes become non-responsive due to an unexpected problem, and though VDSM is unable to respond to requests, the virtual machines that depend upon VDSM remain alive and accessible. In these situations, restarting VDSM returns VDSM to a responsive state and resolves this issue.
"SSH Soft Fencing" is a process where the Manager attempts to restart VDSM via SSH on non-responsive hosts. If the Manager fails to restart VDSM via SSH, the responsibility for fencing falls to the external fencing agent if an external fencing agent has been configured.
Soft-fencing over SSH works as follows. Fencing must be configured and enabled on the host, and a valid proxy host (a second host, in an UP state, in the data center) must exist. When the connection between the Manager and the host times out, the following happens:
  1. On the first network failure, the status of the host changes to "connecting".
  2. The Manager then makes three attempts to ask VDSM for its status, or it waits for an interval determined by the load on the host. The formula for determining the length of the interval is configured by the configuration values TimeoutToResetVdsInSeconds (the default is 60 seconds) + [DelayResetPerVmInSeconds (the default is 0.5 seconds)]*(the count of running virtual machines on host) + [DelayResetForSpmInSeconds (the default is 20 seconds)] * 1 (if host runs as SPM) or 0 (if the host does not run as SPM). To give VDSM the maximum amount of time to respond, the Manager chooses the longer of the two options mentioned above (three attempts to retrieve the status of VDSM or the interval determined by the above formula).
  3. If the host does not respond when that interval has elapsed, vdsm restart is executed via SSH.
  4. If vdsm restart does not succeed in re-establishing the connection between the host and the Manager, the status of the host changes to Non Responsive and, if power management is configured, fencing is handed off to the external fencing agent.

Note

Soft-fencing over SSH can be executed on hosts that have no power management configured. This is distinct from "fencing": fencing can be executed only on hosts that have power management configured.

7.6.6. Using Host Power Management Functions

Summary

When power management has been configured for a host, you can access a number of options from the Administration Portal interface. While each power management device has its own customizable options, they all support the basic options to start, stop, and restart a host.

Procedure 7.19. Using Host Power Management Functions

  1. Use the Hosts resource tab, tree mode, or the search function to find and select the host in the results list.
  2. Click the Power Management drop-down menu.
  3. Select one of the following options:
    • Restart: This option stops the host and waits until the host's status changes to Down. When the agent has verified that the host is down, the highly available virtual machines are restarted on another host in the cluster. The agent then restarts this host. When the host is ready for use its status displays as Up.
    • Start: This option starts the host and lets it join a cluster. When it is ready for use its status displays as Up.
    • Stop: This option powers off the host. Before using this option, ensure that the virtual machines running on the host have been migrated to other hosts in the cluster. Otherwise the virtual machines will crash and only the highly available virtual machines will be restarted on another host. When the host has been stopped its status displays as Non-Operational.

    Important

    When two fencing agents are defined on a host, they can be used concurrently or sequentially. For concurrent agents, both agents have to respond to the Stop command for the host to be stopped; and when one agent responds to the Start command, the host will go up. For sequential agents, to start or stop a host, the primary agent is used first; if it fails, the secondary agent is used.
  4. Selecting one of the above options opens a confirmation window. Click OK to confirm and proceed.
Result

The selected action is performed.

7.6.7. Manually Fencing or Isolating a Non Responsive Host

Summary

If a host unpredictably goes into a non-responsive state, for example, due to a hardware failure; it can significantly affect the performance of the environment. If you do not have a power management device, or it is incorrectly configured, you can reboot the host manually.

Warning

Do not use the Confirm host has been rebooted option unless you have manually rebooted the host. Using this option while the host is still running can lead to a virtual machine image corruption.

Procedure 7.20. Manually fencing or isolating a non-responsive host

  1. On the Hosts tab, select the host. The status must display as non-responsive.
  2. Manually reboot the host. This could mean physically entering the lab and rebooting the host.
  3. On the Administration Portal, right-click the host entry and select the Confirm Host has been rebooted button.
  4. A message displays prompting you to ensure that the host has been shut down or rebooted. Select the Approve Operation check box and click OK.
Result

You have manually rebooted your host, allowing highly available virtual machines to be started on active hosts. You confirmed your manual fencing action in the Administrator Portal, and the host is back online.