Warning message

Log in to add comments or rate this document.

Support for Red Hat Enterprise Linux High Availability Cluster Stretch and Multi-Site Architectures

Updated 2015-10-06T19:54:22+00:00


The purpose of the High Availability Add On for Red Hat Enterprise Linux (RHEL) 5, 6, and 7 is to provide an environment where critical services can run with minimal downtime in the face of unanticipated failures. Highly available clusters are designed so that if one or more cluster members (cluster nodes) fail, the clustered services can continue to run by relocating to other members of the cluster. If a sufficient number of members fail, the cluster will reach a point where manual intervention is required to resume services. The number of member losses a cluster can tolerate can vary depending on cluster architecture. Normal clusters will lose quorum when 50% or more of the nodes fail. Clusters with a quorum disk can withstand more failures, although use of a quorum disk introduces additional complexity. This article will explain the requirements for a multi-site cluster utilizing Red Hat Enterprise Linux(RHEL) Server 5, 6, or 7 with the High Availability Add-On.


  • Red Hat Enterprise Linux (RHEL) 5, 6, and 7 with the High Availability or Resilient Storage Add On

Types of Multi-Site Clusters

Multi-Site Clusters

The term multi-site cluster can refer to several types of cluster configurations. The most common are multi-site disaster recovery clusters and stretch clusters. As described below, multi-site clusters are supported with the caveat that the site-to-site failover operation is carried out manually by the cluster administrator. Only certain configurations of stretch clusters can be supported at this time.

Multi-Site Disaster Recovery Clusters

A multi-site cluster established for disaster recovery comprises two completely different clusters. These clusters typically have the same configuration, with one active and the other passive (and sometimes powered off). If the primary site fails, the secondary site is manually activated and takes over all services.

Multi-site clusters are generally supported without any special considerations, since implementation involves two separate clusters with the same configuration/architecture at two physical locations. Shared storage must be replicated from the primary to the back-up site using array-based replication. During a site failover, the cluster administrator must first toggle the directionality of the storage replication so that the back-up site becomes the primary and then start up the back-up cluster. These steps cannot be automated since using heuristics like site-to-site link failure might result in primary/back-up toggling when there are intermittent network failures.

Stretch Clusters

Multi-site or stretch clusters are designed to withstand the loss or failure of all members at a given physical site. This can be a challenge for a number of reasons:

  • A large percentage of cluster members might be lost simultaneously.
  • Loss of connectivity to all members at a given site might be more likely because site-to-site network and storage connectivity is often less redundant, more expensive, and less reliable than single-site connectivity.
  • Some method of multi-site storage replication is required so that clustered services data is still available after site loss.

For the purposes of this document, a stretch cluster is one that comprises a single infrastructure and membership spanning all sites. Membership of the cluster is logically divided into two groups so that cluster services can continue with minimal disruption when an entire group fails or becomes unreachable. If there is shared storage, it is replicated via either hardware or software replication mechanisms so that each group has access to a replica. The groups are typically, but not necessarily, at different physical locations, often with reduced communication inter-connectivity and increased delay compared to a single site.

The following is some examples of what qualifies as a stretch cluster:

  • Multiple connected physical chassis where no chassis has a majority of the cluster nodes.
  • Cluster members that are located in the same room or data-center but are not all connected to the same switch in 1 hop.
  • Cluster members that are located in different physical sites connected by physical site link.

The limitations, requirements, and guidelines listed in the remainder of this document generally apply to stretch clusters

Support Requirements

Limitations and Requirements

Only certain configurations of stretch clusters can be supported by Red Hat. All stretch clusters must be approved through a formal architectural review from Red Hat Global Support Services in order for the deployment to be officially supported, to ensure that the deployed cluster meets established guidelines.

In addition to the specific restrictions and limitations noted below, the guidelines in Red Hat Enterprise Linux Cluster, High Availability, and GFS Deployment Best Practices also should be reviewed and incorporated into the design of the cluster.

Stretch-cluster deployments should have a burn-in/testing period during which the architecture is validated in a non-production environment, but with production loads and under a variety of failure conditions to adequately test the configuration and ensure the behavior of the cluster adequately meets the requirements of the deployment.

All stretch clusters must meet these requirements:

  • Both physical sites must be connected by a network interconnect (for example, a site to site fiber interconnect) that provides LAN-like latency that is less than or equal to 2ms (<=2ms RTT). Higher latency site-to-site connections are not supported. For measuring latency see the following article: How can I determine the latency of my Multi-site cluster?.

  • A stretch cluster can only span a maximum of 2 physical sites (not including any quorum device configured from a third neutral site).

  • The cluster nodes must be distributed evenly across the two physical sites. Each physical site has to contain an equal number of cluster nodes.

  • Stretch clusters can have a minimum of 2 and a maximum of 16 cluster nodes, in total, across all physical sites.

  • All cluster nodes must have fencing configured.

    • NOTE: Usage of fence_scsi is unsupported in RHEL 5 and has special requirements in RHEL 6 and 7.
  • Both physical sites must be on the same logical network, and routing between the two physical sites is not supported.

  • Using a quorum server (an odd-numbered extra node) as the tie-breaker is not supported in stretch clusters.

  • RHEL 5 and 6 Only: A quorum device is required for all stretch clusters composed of 4 or more cluster nodes.

  • RHEL 7 Only: auto_tie_breaker should be enabled in the corosync configuration for all stretch clusters

  • GFS, GFS2, clvmd, cmirror are not supported in stretch clusters.

  • The cluster node IP addresses used for cluster heartbeat traffic should be organized so that the cluster heartbeat ring crosses from site to site only twice. The reason that sequential address numbering is required is because corosync uses the IP addresses to order which cluster node is in which ring. An invalid configuration would cause dramatic performance degradation. Example configurations that are valid and invalid are below:

VALID Cluster Node IP Address Configuration
site 1 nodeA: x.x.x.1  
site 1 nodeB: x.x.x.2  
site 2 nodeC: x.x.x.3
site 2 nodeD: x.x.x.4

INVALID Cluster Node IP Address Configuration
site 1 nodeA: x.x.x.1  
site 1 nodeB: x.x.x.3  
site 2 nodeC: x.x.x.2
site 2 nodeD: x.x.x.4

NOTE: Cluster architectures that deviate significantly from the architectures described in this section are unlikely to be approved for supported production deployments. Red Hat recommends that you keep your architectures as close as possible to the architectures described as supported.

Handling Site Failures or Site-to-Site Link Failures

For all supported use cases, site-to-site link failure or complete site failure may require human intervention to continue cluster operation since a site-to-site link failure will often prevent fencing from working between the sites, unless special measures are taken to allow for successful fencing in such an architecture. Recovering from such failures can be achieved by using fence_ack_manual to restore operation after fencing has failed due to the site-to-site link being lost. The administrator must confirm that the nodes being manually fenced are completely shut down and not using shared resources prior to issuing the fence_ack_manual command.

In some environments, fence_scsi may be used as a backup fencing method or even the primary method, to allow for fencing to complete even when the sites have split or one site cannot reach the other's fence devices. However, limitations exist around using fence_scsi in RHEL 5 or with certain storage-replication mechanisms, as described in the limitations above. If independent, redundant links can be provided between sites, then fencing may be less likely to fail as a result of a single network link going down, and may reduce the need for manual intervention. However an entire site failure may still necessitate such intervention.

Supported Storage Architectures

There are four use cases which are supported and that described below:

Quorum Preservation

Because stretch-clusters must have their membership split evenly across sites, the failure of a site or the site-to-site link would cause a loss of quorum for both sites without special measures being taken to configure around this. The available methods vary depending on the RHEL release and size of cluster.

RHEL 5 or 6

For two-node clusters quorum-preservation is handled by two_node mode in /etc/cluster/cluster.conf:

<cman two_node="1" expected_votes="1"/>

Clusters with 4 or more cluster nodes must use an iSCSI-based quorum disk located at a third (neutral) site.


RHEL 7 corosync offers several options for maintaining quorum in splits or situations where quorum would otherwise normally be lost.

Using a Quorum Disk (RHEL 5 and 6 only)

If a quorum disk is used and is located at a third (neutral) site, fencing must still be overridden via fence_ack_manual in the event of a site-split or site-failure where fencing is failing.

The use case Fully Interconnected SAN with Disaster Recovery Mirroring may have a quorum disk located on the primary storage array. Since this use case requires manual restart of the cluster to fail storage over from the primary to the secondary site, the quorum disk can be effectively relocated from the primary to secondary site in the event of a primary-site failure. This is possible because the quorum disk does not contain any persistent data.The administrator must make sure to initialize a quorum disk on the secondary array using mkqdisk, and to configure the secondary cluster nodes to use it after a primary site/storage array failure.

Modifying the expected_votes for the cluster to be one less than the actual number of votes available will not work. CMAN enforces the rule that the number of votes expected must be at least equal to the number of total nodes. Solutions involving letting both sides win by using an algorithm similar to CMAN two_node mode will not work with more than two cluster nodes since a temporary inter-site link failure would lead to a non-deterministic fencing race once the link is restored.

It is important to understand that a quorum disk is not an arbitration mechanism by itself; that is, availability of a quorum disk is not sufficient for a site to continue operation after a the other site has failed. Clusters still require fencing even when a quorum disk is used. The failure of an inter-site link or a total site failure will cause fencing to fail, which is why manual intervention is often required to restore the cluster to operation after a site failure.

Red Hat does not support any of the following quorum-disk replication methods:

  • Using LVM mirroring to replicate the quorum disk.
  • Using mdraid to replicate the quorum disk (possibility of race conditions because writing to the quorum disk is not an atomic action if storage-based replication is used).
  • Using Disaster Recovery Mirroring (both sites need write access).

NOTE: Although using a quorum disk on Synchronous or Coherent Array-Based Replication Shared Storage may work in some scenarios, please note that this type of configuration is untested by Red Hat. It is recommended that testing be carried out for the different failure scenarios and there is a procedure in place for how to bring the cluster back up or failover safely.

For more information on quorum disk integration with a stretch cluster then review the following document: Is a quorum disk supported or required on a Red Hat High Availability Cluster stretch cluster?

Shared Storage

If the cluster is using shared storage, it must be configured so that either site can maintain connectivity to a storage replica when the other site, or the site-to-site connectivity, fails.

When a stretch cluster is using multiple storage arrays for redundancy, some form of data replication between them must be used. Data replication between the two sites can be handled by vendor-specific replication mechanisms or by operating-system–level data replication (LVM mirroring). Non-clustered (tagged) LVM mirroring can be used for HA-LVM configurations. For configurations using operating-system based replication such as LVM mirroring, each storage array must be fully interconnected to all nodes in the cluster throughout all sites.

If array-based replication is used, there is additional complexity. Generally, one storage target will be designated as the primary and will be replicated automatically by the storage array to the back-up target, for each given LUN or individual logical device. The primary target will be read/write accessible, and the back-up will be read-only, and the replication direction can be flipped in the event of a failure of the primary. Array-based replication may only be used as described in the Fully Interconnected SAN with Disaster Recovery Mirroring use case.

There are currently no resource agents provided in Red Hat Enterprise Linux that elect one site over another as the master node for a given logical device. Therefore, toggling of storage replication when switching a site from back-up to primary must be done manually. Since site failure requires `fence_ack_manual and administrator intervention regardless, this step should be added to the disaster recovery plan.

Fully Interconnected SAN with LVM Mirroring

There are two storage arrays, one at each physical site, with full SAN connectivity between all arrays and all cluster nodes at each site.

LVM mirroring is used to synchronously replicate storage between the two arrays. If the mirror is broken due to site or array failure, mirror recovery must be done manually, as the cluster software does not handle reconstituting a broken mirror. Mirror recovery must be performed while the mirror is not being accessed by a live service. This requires planned downtime as services must be disabled or frozen.

For more information about LVM mirroring then see the following article: Red Hat Enterprise Linux Cluster, High Availability, and GFS Deployment Best Practices.

Synchronous or Coherent Array-Based Replication

There are two storage arrays, one at each physical site, with the primary site using the primary-site storage array and the secondary site using the secondary-site storage array. Both arrays are used by the cluster simultaneously. The array implements the site-to-site data replication. The data replication is done in a synchronous manner and data coherency is maintained between the two sites. In this configuration, the storage array replication is completely transparent to the RHEL HA cluster and does not need any special support from the Cluster software to operate. As long as the two storage arrays present to each set of nodes at each site a consistent and coherent view of storage, this configuration is supportable.

Examples of storage technologies that would be supported:

The fencing agent fence_scsi is unsupported on NetApp MetroCluster.

Fully Interconnected SAN with Disaster Recovery Mirroring

In this configuration there are two storage arrays, one at each physical site, with full SAN connectivity between all arrays and all nodes at each site. However, only one array is used by the cluster at one time (active array) and the other array is used for replication and site failover purposes (passive array). The cluster nodes at the secondary site may be active and accessing the one active array. Array-based replication can be used to keep the active and passive arrays in sync. Below is a list of supported array-based replication methods (and there may be others that work just as well):

  • EMC SRDF (MirrorView)
  • Hitachi TrueCopy

If the passive array located in the secondary site fails, no specific action needs to be taken by the clustering software (although some repair action is required in order to restore the passive array). If the secondary site fails, services running on nodes in the secondary site can be relocated to the primary site after fencing is confirmed (this might require manual override of fencing). If the primary site or storage array fails, cluster services halt and the cluster must be manually stopped and reconfigured at the secondary site to use the secondary storage array.

Storage Asynchronous or Active/Passive Array-Based Replication

In this unsupported configuration, there are two storage arrays, one at each physical site, with the primary site using the primary-site storage array and the secondary site using the secondary-site storage array. Both arrays are used by the cluster simultaneously. The array implements the site-to-site data replication (possibly using EMC SRDF or MirrorView, Hitachi TrueCopy, etc.).

This configuration is not supported directly by Red Hat because there is currently no integration between the array-replication mechanisms and the cluster-infrastructure software. There are third-party software packages that integrate with the RHEL HA stack to provide array based replication for stretch clusters.These third-party solutions would be supported by their respective software/hardware vendors. Please see the following for more information on one such solution:

Third Party Software

Software provided by Red Hat's partners (and other third parties) which integrates with Red Hat's cluster software, including resource agents, fencing agents, application scripts, and event scripts, are not supported by Red Hat directly. Please contact your vendor for support information regarding these technologies. In some cases, we will work with our partner to triage the issue. Notable examples are: