7.7. Host Resilience
- 7.7.1. Host High Availability
- 7.7.2. Power Management by Proxy in Red Hat Enterprise Virtualization
- 7.7.3. Setting Fencing Parameters on a Host
- 7.7.4. fence_kdump Advanced Configuration
- 7.7.5. Soft-Fencing Hosts
- 7.7.6. Using Host Power Management Functions
- 7.7.7. Manually Fencing or Isolating a Non Responsive Host
7.7.1. Host High Availability
The Red Hat Enterprise Virtualization Manager uses fencing to keep the hosts in a cluster responsive. A Non Responsive host is different from a Non Operational host. Non Operational hosts can be communicated with by the Manager, but have an incorrect configuration, for example a missing logical network. Non Responsive hosts cannot be communicated with by the Manager.
If a host with a power management device loses communication with the Manager, it can be fenced (rebooted) from the Administration Portal. All the virtual machines running on that host are stopped, and highly available virtual machines are started on a different host.
All power management operations are done using a proxy host, as opposed to directly by the Red Hat Enterprise Virtualization Manager. At least two hosts are required for power management operations.
Fencing allows a cluster to react to unexpected host failures as well as enforce power saving, load balancing, and virtual machine availability policies. You should configure the fencing parameters for your host's power management device and test their correctness from time to time.
Hosts can be fenced automatically using the power management parameters, or manually by right-clicking on a host and using the options on the menu. In a fencing operation, an unresponsive host is rebooted, and if the host does not return to an active status within a prescribed time, it remains unresponsive pending manual intervention and troubleshooting.
If the host is required to run virtual machines that are highly available, power management must be enabled and configured.
7.7.2. Power Management by Proxy in Red Hat Enterprise Virtualization
The Red Hat Enterprise Virtualization Manager does not communicate directly with fence agents. Instead, the Manager uses a proxy to send power management commands to a host power management device. The Manager uses VDSM to execute power management device actions, so another host in the environment is used as a fencing proxy.
You can select between:
- Any host in the same cluster as the host requiring fencing.
- Any host in the same data center as the host requiring fencing.
A viable fencing proxy host has a status of either UP or Maintenance.
7.7.3. Setting Fencing Parameters on a Host
The parameters for host fencing are set using the Power Management fields on the New Host or Edit Host windows. Power management enables the system to fence a troublesome host using an additional interface such as a Remote Access Card (RAC).
All power management operations are done using a proxy host, as opposed to directly by the Red Hat Enterprise Virtualization Manager. At least two hosts are required for power management operations.
Procedure 7.21. Setting fencing parameters on a host
- Use the Hosts resource tab, tree mode, or the search function to find and select the host in the results list.
- Click to open the Edit Host window.
- Click the Power Management tab.
- Select the Enable Power Management check box to enable the fields.
- Select the Kdump integration check box to prevent the host from fencing while performing a kernel crash dump.
Important
When you enable Kdump integration on an existing host, the host must be reinstalled for kdump to be configured. See Section 7.5.12, “Reinstalling Virtualization Hosts”. - The Primary option is selected by default if you are configuring a new power management device. If you are adding a new device, set it to Secondary.
- Select the Concurrent check box to enable multiple fence agents to be used concurrently.
- Enter the Address, User Name, and Password of the power management device.
- Select the power management device Type from the drop-down menu.
Note
With the Red Hat Enterprise Virtualization 3.5 release, you now have the option to use a custom power management device. For more information on how to set up a custom power management device, see https://access.redhat.com/articles/1238743. - Enter the Port number used by the power management device to communicate with the host.
- Enter the specific Options of the power management device. Use a comma-separated list of 'key=value' or 'key' entries.
- Click the button to test the power management device. Test Succeeded, Host Status is: on will display upon successful verification.
Warning
Power management parameters (userid, password, options, etc) are tested by Red Hat Enterprise Virtualization Manager only during setup and manually after that. If you choose to ignore alerts about incorrect parameters, or if the parameters are changed on the power management hardware without the corresponding change in Red Hat Enterprise Virtualization Manager, fencing is likely to fail when most needed. - Click to save the changes and close the window.
Result
You are returned to the list of hosts. Note that the exclamation mark next to the host's name has now disappeared, signifying that power management has been successfully configured.
7.7.4. fence_kdump Advanced Configuration
kdump
The kdump service is available by default on new Red Hat Enterprise Linux 6.6 and 7.1 hosts and Hypervisors. On older hosts, Kdump integration cannot be enabled; these hosts must be upgraded in order to use this feature.
Select a host to view the status of the kdump service in the General tab of the details pane:
- Enabled: kdump is configured properly and the kdump service is running.
- Disabled: the kdump service is not running (in this case kdump integration will not work properly).
- Unknown: happens only for hosts with an older VDSM version that does not report kdump status.
For more information on installing and using kdump, see the Kernel Crash Dump Guide for Red Hat Enterprise Linux 7, or the kdump Crash Recovery Service section of the Deployment Guide for Red Hat Enterprise Linux 6.
fence_kdump
Enabling Kdump integration in the Power Management tab of the New Host or Edit Host window configures a standard fence_kdump setup. If the environment's network configuration is simple and the Manager's FQDN is resolvable on all hosts, the default fence_kdump settings are sufficient for use.
However, there are some cases where advanced configuration of fence_kdump is necessary. Environments with more complex networking may require manual changes to the configuration of the Manager, fence_kdump listener, or both. For example, if the Manager's FQDN is not resolvable on all hosts with Kdump integration enabled, you can set a proper host name or IP address using
engine-config:
engine-config -s FenceKdumpDestinationAddress=A.B.C.D
The following example cases may also require configuration changes:
- The Manager has two NICs, where one of these is public-facing, and the second is the preferred destination for fence_kdump messages.
- You need to execute the fence_kdump listener on a different IP or port.
- You need to set a custom interval for fence_kdump notification messages, to prevent possible packet loss.
Customized fence_kdump detection settings are recommended for advanced users only, as changes to the default configuration are only necessary in more complex networking setups. For configuration options for the fence_kdump listener see Section 7.7.4.1, “fence_kdump listener Configuration”. For configuration of kdump on the Manager see Section 7.7.4.2, “Configuring fence_kdump on the Manager”.
7.7.4.1. fence_kdump listener Configuration
Edit the configuration of the fence_kdump listener. This is only necessary in cases where the default configuration is not sufficient.
Procedure 7.22. Manually Configuring the fence_kdump Listener
- Create a new file (for example,
my-fence-kdump.conf) in/etc/ovirt-engine/ovirt-fence-kdump-listener.conf.d/ - Enter your customization with the syntax OPTION=value and save the file.
Important
The edited values must also be changed inengine-configas outlined in the fence_kdump Listener Configuration Options table in Section 7.7.4.2, “Configuring fence_kdump on the Manager”. - Restart the fence_kdump listener:
# service ovirt-fence-kdump-listener restart
The following options can be customized if required:
Table 7.6. fence_kdump Listener Configuration Options
| Variable | Description | Default | Note |
|---|---|---|---|
| LISTENER_ADDRESS | Defines the IP address to receive fence_kdump messages on. | 0.0.0.0 | If the value of this parameter is changed, it must match the value of FenceKdumpDestinationAddress in engine-config. |
| LISTENER_PORT | Defines the port to receive fence_kdump messages on. | 7410 | If the value of this parameter is changed, it must match the value of FenceKdumpDestinationPort in engine-config. |
| HEARTBEAT_INTERVAL | Defines the interval in seconds of the listener's heartbeat updates. | 30 | If the value of this parameter is changed, it must be half the size or smaller than the value of FenceKdumpListenerTimeout in engine-config. |
| SESSION_SYNC_INTERVAL | Defines the interval in seconds to synchronize the listener's host kdumping sessions in memory to the database. | 5 | If the value of this parameter is changed, it must be half the size or smaller than the value of KdumpStartedTimeout in engine-config. |
| REOPEN_DB_CONNECTION_INTERVAL | Defines the interval in seconds to reopen the database connection which was previously unavailable. | 30 | - |
| KDUMP_FINISHED_TIMEOUT | Defines the maximum timeout in seconds after the last received message from kdumping hosts after which the host kdump flow is marked as FINISHED. | 60 | If the value of this parameter is changed, it must be double the size or higher than the value of FenceKdumpMessageInterval in engine-config. |
7.7.4.2. Configuring fence_kdump on the Manager
Edit the Manager's kdump configuration. This is only necessary in cases where the default configuration is not sufficient. The current configuration values can be found using:
# engine-config -g OPTION
Procedure 7.23. Manually Configuring Kdump with engine-config
- Edit kdump's configuration using the
engine-configcommand:# engine-config -s OPTION=value
Important
The edited values must also be changed in the fence_kdump listener configuration file as outlined in theKdump Configuration Optionstable. See Section 7.7.4.1, “fence_kdump listener Configuration”. - Restart the
ovirt-engineservice:# service ovirt-engine restart
- Reinstall all hosts with Kdump integration enabled, if required (see the table below).
The following options can be configured using
engine-config:
Table 7.7. Kdump Configuration Options
| Variable | Description | Default | Note |
|---|---|---|---|
| FenceKdumpDestinationAddress | Defines the hostname(s) or IP address(es) to send fence_kdump messages to. If empty, the Manager's FQDN is used. | Empty string (Manager FQDN is used) | If the value of this parameter is changed, it must match the value of LISTENER_ADDRESS in the fence_kdump listener configuration file, and all hosts with Kdump integration enabled must be reinstalled. |
| FenceKdumpDestinationPort | Defines the port to send fence_kdump messages to. | 7410 | If the value of this parameter is changed, it must match the value of LISTENER_PORT in the fence_kdump listener configuration file, and all hosts with Kdump integration enabled must be reinstalled. |
| FenceKdumpMessageInterval | Defines the interval in seconds between messages sent by fence_kdump. | 5 | If the value of this parameter is changed, it must be half the size or smaller than the value of KDUMP_FINISHED_TIMEOUT in the fence_kdump listener configuration file, and all hosts with Kdump integration enabled must be reinstalled. |
| FenceKdumpListenerTimeout | Defines the maximum timeout in seconds since the last heartbeat to consider the fence_kdump listener alive. | 90 | If the value of this parameter is changed, it must be double the size or higher than the value of HEARTBEAT_INTERVAL in the fence_kdump listener configuration file. |
| KdumpStartedTimeout | Defines the maximum timeout in seconds to wait until the first message from the kdumping host is received (to detect that host kdump flow has started). | 30 | If the value of this parameter is changed, it must be double the size or higher than the value of SESSION_SYNC_INTERVAL in the fence_kdump listener configuration file, and FenceKdumpMessageInterval. |
7.7.5. Soft-Fencing Hosts
Sometimes a host becomes non-responsive due to an unexpected problem, and though VDSM is unable to respond to requests, the virtual machines that depend upon VDSM remain alive and accessible. In these situations, restarting VDSM returns VDSM to a responsive state and resolves this issue.
Red Hat Enterprise Virtualization 3.3 introduces "soft-fencing over SSH". Prior to Red Hat Enterprise Virtualization 3.3, non-responsive hosts were fenced only by external fencing devices. In Red Hat Enterprise Virtualization 3.3, the fencing process has been expanded to include "SSH Soft Fencing", a process whereby the Manager attempts to restart VDSM via SSH on non-responsive hosts. If the Manager fails to restart VDSM via SSH, the responsibility for fencing falls to the external fencing agent if an external fencing agent has been configured.
Soft-fencing over SSH works as follows. Fencing must be configured and enabled on the host, and a valid proxy host (a second host, in an UP state, in the data center) must exist. When the connection between the Manager and the host times out, the following happens:
- On the first network failure, the status of the host changes to "connecting".
- The Manager then makes three attempts to ask VDSM for its status, or it waits for an interval determined by the load on the host. The formula for determining the length of the interval is configured by the configuration values TimeoutToResetVdsInSeconds (the default is 60 seconds) + [DelayResetPerVmInSeconds (the default is 0.5 seconds)]*(the count of running VMs on host) + [DelayResetForSpmInSeconds (the default is 20 seconds)] * 1 (if host runs as SPM) or 0 (if the host does not run as SPM). To give VDSM the maximum amount of time to respond, the Manager chooses the longer of the two options mentioned above (three attempts to retrieve the status of VDSM or the interval determined by the above formula).
- If the host does not respond when that interval has elapsed,
vdsm restartis executed via SSH. - If
vdsm restartdoes not succeed in re-establishing the connection between the host and the Manager, the status of the host changes toNon Responsiveand, if power management is configured, fencing is handed off to the external fencing agent.
Note
Soft-fencing over SSH can be executed on hosts that have no power management configured. This is distinct from "fencing": fencing can be executed only on hosts that have power management configured.
7.7.6. Using Host Power Management Functions
Summary
When power management has been configured for a host, you can access a number of options from the Administration Portal interface. While each power management device has its own customizable options, they all support the basic options to start, stop, and restart a host.
Procedure 7.24. Using Host Power Management Functions
- Use the Hosts resource tab, tree mode, or the search function to find and select the host in the results list.
- Click the Power Management drop-down menu.
- Select one of the following options:
- Restart: This option stops the host and waits until the host's status changes to
Down. When the agent has verified that the host is down, the highly available virtual machines are restarted on another host in the cluster. The agent then restarts this host. When the host is ready for use its status displays asUp. - Start: This option starts the host and lets it join a cluster. When it is ready for use its status displays as
Up. - Stop: This option powers off the host. Before using this option, ensure that the virtual machines running on the host have been migrated to other hosts in the cluster. Otherwise the virtual machines will crash and only the highly available virtual machines will be restarted on another host. When the host has been stopped its status displays as
Non-Operational.
Important
When two fencing agents are defined on a host, they can be used concurrently or sequentially. For concurrent agents, both agents have to respond to the Stop command for the host to be stopped; and when one agent responds to the Start command, the host will go up. For sequential agents, to start or stop a host, the primary agent is used first; if it fails, the secondary agent is used. - Selecting one of the above options opens a confirmation window. Click OK to confirm and proceed.
Result
The selected action is performed.
7.7.7. Manually Fencing or Isolating a Non Responsive Host
Summary
If a host unpredictably goes into a non-responsive state, for example, due to a hardware failure; it can significantly affect the performance of the environment. If you do not have a power management device, or it is incorrectly configured, you can reboot the host manually.
Warning
Do not use the Confirm host has been rebooted option unless you have manually rebooted the host. Using this option while the host is still running can lead to a virtual machine image corruption.
Procedure 7.25. Manually fencing or isolating a non-responsive host
- On the Hosts tab, select the host. The status must display as
non-responsive. - Manually reboot the host. This could mean physically entering the lab and rebooting the host.
- On the Administration Portal, right-click the host entry and select the button.
- A message displays prompting you to ensure that the host has been shut down or rebooted. Select the Approve Operation check box and click OK.
Result
You have manually rebooted your host, allowing highly available virtual machines to be started on active hosts. You confirmed your manual fencing action in the Administrator Portal, and the host is back online.
