Red Hat Training

A Red Hat training course is available for Red Hat Virtualization

7.6. Host Resilience

7.6.1. Host High Availability

The Red Hat Virtualization Manager uses fencing to keep hosts in a cluster responsive. A Non Responsive host is different from a Non Operational host. Non Operational hosts can be communicated with by the Manager, but have an incorrect configuration, for example a missing logical network. Non Responsive hosts cannot be communicated with by the Manager.

Fencing allows a cluster to react to unexpected host failures and enforce power saving, load balancing, and virtual machine availability policies. You should configure the fencing parameters for your host’s power management device and test their correctness from time to time. In a fencing operation, a non-responsive host is rebooted, and if the host does not return to an active status within a prescribed time, it remains non-responsive pending manual intervention and troubleshooting.

Note

To automatically check the fencing parameters, you can configure the PMHealthCheckEnabled (false by default) and PMHealthCheckIntervalInSec (3600 sec by default) engine-config options.

When set to true, PMHealthCheckEnabled will check all host agents at the interval specified by PMHealthCheckIntervalInSec, and raise warnings if it detects issues. See Section 18.2.2, “Syntax for the engine-config Command” for more information about configuring engine-config options.

Power management operations can be performed by Red Hat Virtualization Manager after it reboots, by a proxy host, or manually in the Administration Portal. All the virtual machines running on the non-responsive host are stopped, and highly available virtual machines are started on a different host. At least two hosts are required for power management operations.

After the Manager starts up, it automatically attempts to fence non-responsive hosts that have power management enabled after the quiet time (5 minutes by default) has elapsed. The quiet time can be configured by updating the DisableFenceAtStartupInSec engine-config option.

Note

The DisableFenceAtStartupInSec engine-config option helps prevent a scenario where the Manager attempts to fence hosts while they boot up. This can occur after a data center outage because a host’s boot process is normally longer than the Manager boot process.

Hosts can be fenced automatically by the proxy host using the power management parameters, or manually by right-clicking on a host and using the options on the menu.

Important

If a host runs virtual machines that are highly available, power management must be enabled and configured.

7.6.2. Power Management by Proxy in Red Hat Virtualization

The Red Hat Virtualization Manager does not communicate directly with fence agents. Instead, the Manager uses a proxy to send power management commands to a host power management device. The Manager uses VDSM to execute power management device actions, so another host in the environment is used as a fencing proxy.

You can select between:

  • Any host in the same cluster as the host requiring fencing.
  • Any host in the same data center as the host requiring fencing.

A viable fencing proxy host has a status of either UP or Maintenance.

7.6.3. Setting Fencing Parameters on a Host

The parameters for host fencing are set using the Power Management fields on the New Host or Edit Host windows. Power management enables the system to fence a troublesome host using an additional interface such as a Remote Access Card (RAC).

All power management operations are done using a proxy host, as opposed to directly by the Red Hat Virtualization Manager. At least two hosts are required for power management operations.

Setting fencing parameters on a host

  1. Click ComputeHosts and select the host.
  2. Click Edit.
  3. Click the Power Management tab.
  4. Select the Enable Power Management check box to enable the fields.
  5. Select the Kdump integration check box to prevent the host from fencing while performing a kernel crash dump.

    Important

    If you enable or disable Kdump integration on an existing host, you must reinstall the host.

  6. Optionally, select the Disable policy control of power management check box if you do not want your host’s power management to be controlled by the Scheduling Policy of the host’s cluster.
  7. Click the + button to add a new power management device. The Edit fence agent window opens.
  8. Enter the Address, User Name, and Password of the power management device.
  9. Select the power management device Type from the drop-down list.

    Note

    For more information on how to set up a custom power management device, see https://access.redhat.com/articles/1238743.

  10. Enter the SSH Port number used by the power management device to communicate with the host.
  11. Enter the Slot number used to identify the blade of the power management device.
  12. Enter the Options for the power management device. Use a comma-separated list of 'key=value' entries.
  13. Select the Secure check box to enable the power management device to connect securely to the host.
  14. Click the Test button to ensure the settings are correct. Test Succeeded, Host Status is: on will display upon successful verification.

    Warning

    Power management parameters (userid, password, options, etc) are tested by Red Hat Virtualization Manager only during setup and manually after that. If you choose to ignore alerts about incorrect parameters, or if the parameters are changed on the power management hardware without the corresponding change in Red Hat Virtualization Manager, fencing is likely to fail when most needed.

  15. Click OK to close the Edit fence agent window.
  16. In the Power Management tab, optionally expand the Advanced Parameters and use the up and down buttons to specify the order in which the Manager will search the host’s cluster and dc (datacenter) for a fencing proxy.
  17. Click OK.

You are returned to the list of hosts. Note that the exclamation mark next to the host’s name has now disappeared, signifying that power management has been successfully configured.

7.6.4. fence_kdump Advanced Configuration

kdump

Click the name of a host to view the status of the kdump service in the General tab of the details view:

  • Enabled: kdump is configured properly and the kdump service is running.
  • Disabled: the kdump service is not running (in this case kdump integration will not work properly).
  • Unknown: happens only for hosts with an earlier VDSM version that does not report kdump status.

For more information on installing and using kdump, see the Red Hat Enterprise Linux 7 Kernel Crash Dump Guide.

fence_kdump

Enabling Kdump integration in the Power Management tab of the New Host or Edit Host window configures a standard fence_kdump setup. If the environment’s network configuration is simple and the Manager’s FQDN is resolvable on all hosts, the default fence_kdump settings are sufficient for use.

However, there are some cases where advanced configuration of fence_kdump is necessary. Environments with more complex networking may require manual changes to the configuration of the Manager, fence_kdump listener, or both. For example, if the Manager’s FQDN is not resolvable on all hosts with Kdump integration enabled, you can set a proper host name or IP address using engine-config:

engine-config -s FenceKdumpDestinationAddress=A.B.C.D

The following example cases may also require configuration changes:

  • The Manager has two NICs, where one of these is public-facing, and the second is the preferred destination for fence_kdump messages.
  • You need to execute the fence_kdump listener on a different IP or port.
  • You need to set a custom interval for fence_kdump notification messages, to prevent possible packet loss.

Customized fence_kdump detection settings are recommended for advanced users only, as changes to the default configuration are only necessary in more complex networking setups. For configuration options for the fence_kdump listener see Section 7.6.4.1, “fence_kdump listener Configuration”. For configuration of kdump on the Manager see Section 7.6.4.2, “Configuring fence_kdump on the Manager”.

7.6.4.1. fence_kdump listener Configuration

Edit the configuration of the fence_kdump listener. This is only necessary in cases where the default configuration is not sufficient.

Manually Configuring the fence_kdump Listener

  1. Create a new file (for example, my-fence-kdump.conf) in /etc/ovirt-engine/ovirt-fence-kdump-listener.conf.d/.
  2. Enter your customization with the syntax OPTION=value and save the file.

    Important

    The edited values must also be changed in engine-config as outlined in the fence_kdump Listener Configuration Options table in Section 7.6.4.2, “Configuring fence_kdump on the Manager”.

  3. Restart the fence_kdump listener:

    # systemctl restart ovirt-fence-kdump-listener.service

The following options can be customized if required:

Table 7.9. fence_kdump Listener Configuration Options

VariableDescriptionDefaultNote

LISTENER_ADDRESS

Defines the IP address to receive fence_kdump messages on.

0.0.0.0

If the value of this parameter is changed, it must match the value of FenceKdumpDestinationAddress in engine-config.

LISTENER_PORT

Defines the port to receive fence_kdump messages on.

7410

If the value of this parameter is changed, it must match the value of FenceKdumpDestinationPort in engine-config.

HEARTBEAT_INTERVAL

Defines the interval in seconds of the listener’s heartbeat updates.

30

If the value of this parameter is changed, it must be half the size or smaller than the value of FenceKdumpListenerTimeout in engine-config.

SESSION_SYNC_INTERVAL

Defines the interval in seconds to synchronize the listener’s host kdumping sessions in memory to the database.

5

If the value of this parameter is changed, it must be half the size or smaller than the value of KdumpStartedTimeout in engine-config.

REOPEN_DB_CONNECTION_INTERVAL

Defines the interval in seconds to reopen the database connection which was previously unavailable.

30

-

KDUMP_FINISHED_TIMEOUT

Defines the maximum timeout in seconds after the last received message from kdumping hosts after which the host kdump flow is marked as FINISHED.

60

If the value of this parameter is changed, it must be double the size or higher than the value of FenceKdumpMessageInterval in engine-config.

7.6.4.2. Configuring fence_kdump on the Manager

Edit the Manager’s kdump configuration. This is only necessary in cases where the default configuration is not sufficient. The current configuration values can be found using:

# engine-config -g OPTION

Manually Configuring Kdump with engine-config

  1. Edit kdump’s configuration using the engine-config command:

    # engine-config -s OPTION=value
    Important

    The edited values must also be changed in the fence_kdump listener configuration file as outlined in the Kdump Configuration Options table. See Section 7.6.4.1, “fence_kdump listener Configuration”.

  2. Restart the ovirt-engine service:

    # systemctl restart ovirt-engine.service
  3. Reinstall all hosts with Kdump integration enabled, if required (see the table below).

The following options can be configured using engine-config:

Table 7.10. Kdump Configuration Options

VariableDescriptionDefaultNote

FenceKdumpDestinationAddress

Defines the hostname(s) or IP address(es) to send fence_kdump messages to. If empty, the Manager’s FQDN is used.

Empty string (Manager FQDN is used)

If the value of this parameter is changed, it must match the value of LISTENER_ADDRESS in the fence_kdump listener configuration file, and all hosts with Kdump integration enabled must be reinstalled.

FenceKdumpDestinationPort

Defines the port to send fence_kdump messages to.

7410

If the value of this parameter is changed, it must match the value of LISTENER_PORT in the fence_kdump listener configuration file, and all hosts with Kdump integration enabled must be reinstalled.

FenceKdumpMessageInterval

Defines the interval in seconds between messages sent by fence_kdump.

5

If the value of this parameter is changed, it must be half the size or smaller than the value of KDUMP_FINISHED_TIMEOUT in the fence_kdump listener configuration file, and all hosts with Kdump integration enabled must be reinstalled.

FenceKdumpListenerTimeout

Defines the maximum timeout in seconds since the last heartbeat to consider the fence_kdump listener alive.

90

If the value of this parameter is changed, it must be double the size or higher than the value of HEARTBEAT_INTERVAL in the fence_kdump listener configuration file.

KdumpStartedTimeout

Defines the maximum timeout in seconds to wait until the first message from the kdumping host is received (to detect that host kdump flow has started).

30

If the value of this parameter is changed, it must be double the size or higher than the value of SESSION_SYNC_INTERVAL in the fence_kdump listener configuration file, and FenceKdumpMessageInterval.

7.6.5. Soft-Fencing Hosts

Hosts can sometimes become non-responsive due to an unexpected problem, and though VDSM is unable to respond to requests, the virtual machines that depend upon VDSM remain alive and accessible. In these situations, restarting VDSM returns VDSM to a responsive state and resolves this issue.

"SSH Soft Fencing" is a process where the Manager attempts to restart VDSM via SSH on non-responsive hosts. If the Manager fails to restart VDSM via SSH, the responsibility for fencing falls to the external fencing agent if an external fencing agent has been configured.

Soft-fencing over SSH works as follows. Fencing must be configured and enabled on the host, and a valid proxy host (a second host, in an UP state, in the data center) must exist. When the connection between the Manager and the host times out, the following happens:

  1. On the first network failure, the status of the host changes to "connecting".
  2. The Manager then makes three attempts to ask VDSM for its status, or it waits for an interval determined by the load on the host. The formula for determining the length of the interval is configured by the configuration values TimeoutToResetVdsInSeconds (the default is 60 seconds) + [DelayResetPerVmInSeconds (the default is 0.5 seconds)]*(the count of running virtual machines on host) + [DelayResetForSpmInSeconds (the default is 20 seconds)] * 1 (if host runs as SPM) or 0 (if the host does not run as SPM). To give VDSM the maximum amount of time to respond, the Manager chooses the longer of the two options mentioned above (three attempts to retrieve the status of VDSM or the interval determined by the above formula).
  3. If the host does not respond when that interval has elapsed, vdsm restart is executed via SSH.
  4. If vdsm restart does not succeed in re-establishing the connection between the host and the Manager, the status of the host changes to Non Responsive and, if power management is configured, fencing is handed off to the external fencing agent.
Note

Soft-fencing over SSH can be executed on hosts that have no power management configured. This is distinct from "fencing": fencing can be executed only on hosts that have power management configured.

7.6.6. Using Host Power Management Functions

When power management has been configured for a host, you can access a number of options from the Administration Portal interface. While each power management device has its own customizable options, they all support the basic options to start, stop, and restart a host.

Using Host Power Management Functions

  1. Click ComputeHosts and select the host.
  2. Click the Management drop-down menu and select one of the following Power Management options:

    • Restart: This option stops the host and waits until the host’s status changes to Down. When the agent has verified that the host is down, the highly available virtual machines are restarted on another host in the cluster. The agent then restarts this host. When the host is ready for use its status displays as Up.
    • Start: This option starts the host and lets it join a cluster. When it is ready for use its status displays as Up.
    • Stop: This option powers off the host. Before using this option, ensure that the virtual machines running on the host have been migrated to other hosts in the cluster. Otherwise the virtual machines will crash and only the highly available virtual machines will be restarted on another host. When the host has been stopped its status displays as Non-Operational.

      Note

      If Power Management is not enabled, you can restart or stop the host by selecting it, clicking the Management drop-down menu, and selecting an SSH Management option, Restart or Stop.

      Important

      When two fencing agents are defined on a host, they can be used concurrently or sequentially. For concurrent agents, both agents have to respond to the Stop command for the host to be stopped; and when one agent responds to the Start command, the host will go up. For sequential agents, to start or stop a host, the primary agent is used first; if it fails, the secondary agent is used.

  3. Click OK.

7.6.7. Manually Fencing or Isolating a Non Responsive Host

If a host unpredictably goes into a non-responsive state, for example, due to a hardware failure, it can significantly affect the performance of the environment. If you do not have a power management device, or if it is incorrectly configured, you can reboot the host manually.

Warning

Do not use the Confirm host has been rebooted option unless you have manually rebooted the host. Using this option while the host is still running can lead to a virtual machine image corruption.

Manually fencing or isolating a non-responsive host

  1. In the Administration Portal, click ComputeHosts and confirm the host’s status is Non Responsive.
  2. Manually reboot the host. This could mean physically entering the lab and rebooting the host.
  3. In the Administration Portal, select the host and click More ActionsConfirm 'Host has been Rebooted'.
  4. Select the Approve Operation check box and click OK.
  5. If your hosts take an unusually long time to boot, you can set ServerRebootTimeout to specify how many seconds to wait before determining that the host is Non Responsive:

    # engine-config --set ServerRebootTimeout=integer