Skip to navigation

Warning message

log in to add comments or rate this document

Support for Red Hat Enterprise Linux Cluster and High Availability Stretch Architectures

Updated 2013-11-29T06:58:50+00:00

Introduction

The purpose of the High Availability Add On for Red Hat Enterprise Linux (RHEL) 5 and 6 is to provide an environment where critical services can run with minimal downtime in the face of unanticipated failures. Highly available clusters are designed so that if one or more cluster members (cluster nodes) fail, the clustered services can continue to run by relocating to other members of the cluster. If a sufficient number of members fail, the cluster will reach a point where manual intervention is required to resume services. The number of member losses a cluster can tolerate can vary depending on cluster architecture. Normal clusters will lose quorum when 50% or more of the nodes fail. Clusters with a quorum disk can withstand more failures, although use of a quorum disk introduces additional complexity. This article will explain the requirements for a multi-site cluster utilizing Red Hat Enterprise Linux(RHEL) Server 5 or 6 with the High Availability Add-On.

Environment

  • Red Hat Enterprise Linux Server 5 (with the High Availability and Resilient Storage Add Ons)
  • Red Hat Enterprise Linux Server 6 (with the High Availability and Resilient Storage Add Ons)

Multi-Site Clusters

The term multi-site cluster can refer to several types of cluster configurations. The most common are multi-site disaster recovery clusters and stretch clusters. As described below, multi-site clusters are supported with the caveat that the site-to-site failover operation is carried out manually by the cluster administrator. Only certain configurations of stretch clusters can be supported at this time.

Multi-Site Disaster Recovery Clusters

A multi-site cluster established for disaster recovery comprises two completely different clusters. These clusters typically have the same configuration, with one active and the other passive (and sometimes powered off). If the primary site fails, the secondary site is manually activated and takes over all services.

Multi-site clusters are supported since implementation involves two separate clusters with the same configuration/architecture at two physical locations. Shared storage must be replicated from the primary to the back-up site using array-based replication. During a site failover, the cluster administrator must first toggle the directionality of the storage replication so that the back-up site becomes the primary and then start up the back-up cluster. These steps cannot be automated since using heuristics like site-to-site link failure might result in primary/back-up toggling when there are intermittent network failures.

Stretch Clusters

Multi-site or stretch clusters are designed to withstand the loss or failure of all members at a given physical site. This can be a challenge for a number of reasons:

  • A large percentage of cluster members might be lost simultaneously.
  • Loss of connectivity to all members at a given site might be more likely because site-to-site network and storage connectivity is often less redundant, more expensive, and less reliable than single-site connectivity.
  • Some method of multi-site storage replication is required so that clustered services data is still available after site loss.

For the purposes of this document, a stretch cluster is one that comprises a single infrastructure. There is one membership that spans all sites. Membership of the cluster is logically divided into two groups so that cluster services can continue with minimal disruption when an entire group fails or becomes unreachable. If there is shared storage, it is replicated via either hardware or software replication mechanisms so that each group has access to a replica. The groups are typically, but not necessarily, at different physical locations, often with reduced communication inter-connectivity and increased delay compared to a single site.

The notion of different physical sites can seem arbitrary at times since nodes can be separated by as little as a single room in the same building, datacenter, or different building. Because of this ambiguity, this document outlines several use cases that fall into the stretch cluster category and defines the supportability status for each case.

Supported Use Cases

This section describes supported use cases for stretch clusters. Only certain configurations of stretch clusters can be supported at this time. All stretch clusters require obtaining a formal architecture review from Red Hat Support to ensure that the deployed cluster meets established guidelines.

By using the following guidelines below in conjunction with an architecture review by Red Hat Support, you can deploy supported stretch clusters that are similar to the supported use cases above. In addition to the specific restrictions and limitations noted below, the guidelines in Red Hat Enterprise Linux Cluster, High Availability, and GFS Deployment Best Practices also should be followed.

Stretch-cluster deployments must have a burn-in/testing period during which the architecture is validated in a non-production environment, but with production loads and under a variety of failure conditions to adequately test the cluster.

All stretch clusters must meet these requirements:

  • Both physical sites must be connected by a network interconnect (for example, a site to site fiber interconnect) that provides LAN-like latency that is less than or equal to 2ms (<=2ms RTT). Higher latency site-to-site connections are not supported.
  • A stretch cluster can only span a maximum of 2 physical sites (not including the a quorum disk at a neutral site`).
  • The cluster nodes must be distributed evenly across the two physical sites. Each physical site has to contain an equal number of cluster nodes.
  • Stretch clusters can have a minimum of 2 and a maximum of 16 cluster nodes, in total, across all physical sites.
  • All cluster nodes must have fencing configured. There is limited use cases on using fence_scsi as the fencing agent.
  • Both physical sites must be on the same logical network, and routing between the two physical sites is not supported. One of the following supported communication methods must be used for your RHEL release.
  • A quorum disk is required for all stretch clusters that are composed of 4 or more cluster nodes. Using a quorum server as the tie-breaker is not supported on a stretch cluster.
  • GFS, GFS2, clvmd, cmirror are not supported on stretch clusters.
  • The cluster node IP addresses used for cluster heartbeat traffic should be organized so that the cluster heartbeat ring crosses from site to site only twice. The reason that sequential address numbering is required is because corosync uses the IP addresses to order which cluster node is in which ring. An invalid configuration would cause dramatic performance degradation. Example configurations that are valid and invalid are below:
VALID Cluster Node IP Address Configuration
site 1 nodeA: x.x.x.1  
site 1 nodeB: x.x.x.2  
site 2 nodeC: x.x.x.3
site 2 nodeD: x.x.x.4


INVALID Cluster Node IP Address Configuration
site 1 nodeA: x.x.x.1  
site 1 nodeB: x.x.x.3  
site 2 nodeC: x.x.x.2
site 2 nodeD: x.x.x.4

Cluster architectures that deviate significantly from the architectures described in this section are unlikely to be approved for supported production deployments. Red Hat recommends that you keep your architectures as close as possible to the architectures described as supported.

For all supported use cases, site-to-site link failure or complete site failure requires human intervention to continue cluster operation since a site-to-site link failure would prevent fencing from working between the sites. This can be achieved by using fence_ack_manual to restore operation after fencing has failed due to the site-to-site link being lost. The administrator must confirm that the site is down prior to issuing the fence_ack_manual command. There are four use cases which are supported and that described below:

Quorum Preservation

Quorum preservation during a site failure can be handled in one of the following ways for No Shared Storage, Fully Interconnected SAN with LVM Mirroring, Synchronous/Coherent Array-Based Replication, Fully Interconnected SAN with Disaster Recovery Mirroring:

  • For two-node clusters it is handled by two_node mode.
<cman two_node="1" expected_votes="1"/>
  • Clusters with 4 or more cluster nodes should use an iSCSI-based quorum disk located at a third (neutral) site.

Quorum Disk

If a quorum disk is used and is located at a third (neutral) site, fencing must still be overridden via fence_ack_manual; therefore, relocation of services from the failed site to the back-up site will still require manual intervention.

The use case Fully Interconnected SAN with Disaster Recovery Mirroring may have a quorum disk located on the primary storage array. Since this use case requires manual restart of the cluster to fail storage over from the primary to the secondary site, the quorum disk can be effectively relocated from the primary to secondary site in the event of a primary-site failure. This is possible because the quorum disk does not contain any persistent data.The administrator must make sure to initialize a quorum disk on the secondary array using mkqdisk, and to configure the secondary cluster nodes to use it after a primary site/storage array failure.

Modifying the expected_votes for the cluster to be one less than the actual number of votes available will not work. CMAN enforces the rule that the number of votes expected must be at least equal to the number of total nodes. Solutions involving letting both sides win by using an algorithm similar to CMAN two_node mode will not work with more than two cluster nodes since a temporary inter-site link failure would lead to a non-deterministic fencing race once the link is restored.

It is important to understand that a quorum disk is not an arbitration mechanism by itself; that is, availability of a quorum disk is not sufficient for a site to continue operation after a the other site has failed. Clusters still require fencing even when a quorum disk is used. The failure of an inter-site link or a total site failure will cause fencing to fail, which is why manual intervention is often required to restore the cluster to operation after a site failure.

Red Hat does not support any of the following quorum-disk replication methods:

  • Using LVM mirroring to replicate the quorum disk
  • Using mdraid to replicate the quorum disk (possibility of race conditions because writing to the quorum disk is not an atomic action if storage-based replication is used)
  • Using Disaster Recovery Mirroring (both sites need write access)

SAN Shared Storage

The SAN must be configured so that either site can maintain connectivity to a storage replica when the other site, or the site-to-site connectivity, fails.

If array-based replication is used, there is additional complexity. Generally, one SAN (or LUN within a SAN) will be primary and will be replicated to the back-up. The primary will be read/write accessible, and the back-up will be read-only. Because resource agents that fail array-replication direction are not available, array-based replication may only be used as described above in the Fully Interconnected SAN with Disaster Recovery Mirroring use case.

Also, storage arrays must use data replication. Possibilities include hardware array-based replication and operating-system–based replication (non-clustered LVM mirroring). Non-clustered (tagged) LVM mirroring can be used for HA-LVM configurations. If using this type of replication, each storage array must be fully interconnected to all nodes on the cluster at each site.

Data replication between the two sites can be handled by vendor-specific SAN hardware or by operating-system–level data replication (LVM mirroring).

There are currently no resource agents provided in Red Hat Enterprise Linux that elect one site over another as the master node for a given LUN. Therefore, toggling of storage replication when toggling a site from back-up to primary must be done manually. Since site failure requires `fence_ack_manual and administrator intervention regardless, this step should be added to the disaster recovery plan.

Fully Interconnected SAN with LVM Mirroring

There are two storage arrays, one at each physical site, with full SAN connectivity between all arrays and all cluster nodes at each site.

LVM mirroring is used to synchronously replicate storage between the two arrays. If the mirror is broken due to site or array failure, mirror recovery must be done manually, as the cluster software does not handle reconstituting a broken mirror. Mirror recovery must be performed while the mirror is not being accessed by a live service. This requires planned downtime as services must be shut down.

For more information about LVM mirroring then see the following article:Red Hat Enterprise Linux Cluster, High Availability, and GFS Deployment Best Practices.

Synchronous or Coherent Array-Based Replication

There are two storage arrays, one at each physical site, with the primary site using the primary-site storage array and the secondary site using the secondary-site storage array. Both arrays are used by the cluster simultaneously. The array implements the site-to-site data replication. The data replication is done in a synchronous manner and data coherency is maintained between the two sites. In this configuration, the storage array replication is completely transparent to the RHEL HA Cluster and does not need any special support from the Cluster software to operate. As long as the two storage arrays present to each set of nodes at each site a consistent and coherent view of storage, this configuration is supportable.

Examples of storage technologies that would be supported:

The fencing agent fence_scsi is unsupported on NetApp MetroCluster.

Fully Interconnected SAN with Disaster Recovery Mirroring

In this configuration there are two storage arrays, one at each physical site, with full SAN connectivity between all arrays and all nodes at each site. However, only one array is used by the cluster at one time (active array) and the other array is used for replication and site failover purposes (passive array). The cluster nodes at the secondary site may be active and accessing the one active array. Array-based replication can be used to keep the active and passive arrays in sync. Below is a list of supported array-based replication methods(and there may be others that work just as well):

  • EMC SRDF(MirrorView)
  • Hitachi TrueCopy

If the passive array located in the secondary site fails, no specific action needs to be taken by the clustering software (although some repair action is required in order to restore the passive array). If the secondary site fails, services running on nodes in the secondary site can be relocated to the primary site after fencing is confirmed (this might require manual override of fencing). If the primary site or storage array fails, cluster services halt and the cluster must be manually stopped and reconfigured at the secondary site to use the secondary storage array.

Storage Asynchronous or Active/Passive Array-Based Replication

In this unsupported configuration, there are two storage arrays, one at each physical site, with the primary site using the primary-site storage array and the secondary site using the secondary-site storage array. Both arrays are used by the cluster simultaneously. The array implements the site-to-site data replication (possibly using EMC SRDF or MirrorView, Hitachi TrueCopy, etc.).

This configuration is not supported directly by Red Hat because there is currently no integration between the array-replication mechanisms and the cluster-infrastructure software. There are third-party software packages that integrate with the RHEL HA stack to provide array based replication for stretch clusters.These third-party solutions would be supported by their respective software/hardware vendors. Please see the following for more information:

Third Party Software

Software provided by Red Hat's partners (and other third parties) which integrates with Red Hat's cluster software, including resource agents, fencing agents, application scripts, and event scripts, are not supported by Red Hat directly. Please contact your vendor for support information regarding these technologies. In some cases, we will work with our partner to triage the issue. Notable examples are: