This document describes recommended practices for deploying and upgrading Red Hat Enterprise Linux clusters using the High Availability Add-On and Red Hat Global File System (GFS/GFS2).
As clusters can be used in a wide range of both supported and unsupported scenarios to meet different business needs, this document describes certain exceptions and/or conditions that warrant a review by Red Hat to determine whether or not a cluster is supportable. An exception does not necessarily mean that a cluster is not supportable by Red Hat: while some exceptions result in unsupportable scenarios, others just need careful review and analysis before proceeding.
To obtain an architecture review, follow the instructions provided in What information is required for a Architecture Review of Red Hat Enterprise Linux High Availability and Resilient Storage?.
- Red Hat Enterprise Linux 5+ with the High Availability Add On or Resilient Storage Add On
- Red Hat Enterprise Linux 6+ with the High Availability Add On or Resilient Storage Add On
Additional Support Links
These additional articles describe some of the below concepts in more detail. Please refer to them as needed:
Architecture Review Process Details:
- What information is required for a Architecture Review of Red Hat Enterprise Linux High Availability and Resilient Storage?
General Architecture Support Articles:
- Support for Red Hat Enterprise Linux Cluster and High Availability Stretch Architectures
- Single-Node Support in Red Hat Enterprise Linux 5 and Red Hat Enterprise Linux 6
- Support for Broadcast Mode in Red Hat Enterprise Linux Clustering and High Availability Environments
- Is unicast supported in the High Availability Add On or Clustering for Red Hat Enterprise Linux?
Hardware Support Information:
- Fence Support Matrix for Red Hat Enterprise Linux High Availability/Clustering and Red Hat Enterprise Virtualization
- Fence Device and Agent Information for Red Hat Enterprise Linux
Third Party Hardware/Software Integration:
- Support for an EMC PowerPath-Managed LUN Used as a Quorum Device on Red Hat Enterprise Linux
- Does RHEL-6 support DRBD?
- Configuring GFS2 for optimal performance with SAS Grid Manager in RHEL
- Can Oracle Database 10g and 11g be managed by a Red Hat High Availability cluster?
Deployment Recommended Practices
This section describes recommended practices for cluster deployment using the Red Hat High Availability and Resilient Storage Add-Ons.
Choosing a Cluster Architecture
Red Hat Clustering is designed to run at a single physical site where latencies are expected to be LAN-like (under 2 msec). This remains the most recommended and rigorously tested configuration for running the Red Hat High Availability and Resilient Storage Add-Ons.
Multi-site or disaster-tolerant clusters are separate clusters that run at different physical sites, typically using SAN-based storage replication to replicate data. Multi-site clusters are usually used in an active/passive manner for disaster recovery with manual failover of the active cluster to the passive cluster.
Stretch clusters are single-cluster configurations that span multiple physical sites.
Additional details on the supportability of multi-site and stretch clusters can be found in Support for Red Hat Enterprise Linux Cluster and High Availability Stretch Architectures and please note that all stretch clusters require and an architecture review.
Selecting Cluster Node Hardware
The maximum number of cluster nodes supported by the High Availability Add-On is 16, and the same is true for the Resilient Storage Add-On which includes GFS2 and CLVM. For more information then please see the following article. However, the majority of Red Hat’s clustering customers use node counts much lower than the maximum. In general, if your cluster requires node counts higher than eight, it is advisable to verify your cluster architecture with Red Hat before deployment to confirm that it is supportable. Red Hat's clustering solution is primarily designed to provide high availability and cold application failover, and it is not meant to be used for either high-performance or load-sharing clusters.
Homogeneous hardware configurations (nodes with similar specifications in terms of CPU sockets, cores, memory, etc.) are recommended. If you require a heterogeneous cluster with nodes that have diverse hardware configurations (for example, if Node1 has four processors with 8 GB of RAM while Node2 has two processors with 2 GB of RAM), Red Hat recommends that you create an initial proof-of-concept cluster to ensure that you encounter no surprises during production. You may also submit your cluster architecture for review by Red Hat prior to deployment.
NOTE: Certain commercial hardware variants (servers and network switches) have known issues with multicast, which means that usage of GFS2 with a high volume of POSIX (fcntl) lock activity can result in performance and stability issues.
Selecting and Configuring Cluster Storage
There are no recommended practices for selecting cluster storage hardware.
When using LVM in a cluster:
- When using HA-LVM, if a synchronization is required (for example, after adding a leg to a mirror or restoring a failed mirror leg), it is recommended to disable the service utilizing that mirror. If this is not possible, an administrator may freeze (clusvcadm -Z) the service in place to prevent failover while the mirror is rebuilding. Once the mirror synchronization has completed, administrators must unfreeze (clusvcadm -U) the service.
- There are known issues if a service relocation occurs while synchronizing a mirror leg, notably bug #692186
- If utilizing LVM mirroring and synchronization of a mirror is required while an application is utilizing the mirror, it is recommended practice to test this process on a non-production cluster with realistic application load to ensure that the application is not adversely impacted by the I/O performance drop caused by the mirror synchronization.
Setting Up Cluster Networks
It is highly recommended to separate the private network for cluster heartbeat from the application network. This should be done wherever possible.
Red Hat only supports the use of a redundant ring that provides additional resilience when the cluster heartbeat network fails on RHEL 6 Update 4 and later and RHEL 7.
Red Hat recommends that NIC bonding is used to avoid having a single link failure prevent communication between nodes, however special attention must be given to what bonding mode is configured, as there are limitations on the supported modes in each RHEL release.
Network infrastructure used for the cluster heartbeat network must be switched and should be configured to eliminate single points of failure that could lead to split-brain situations (for example, it would not be recommended to have four nodes on Switch A and four nodes on Switch B with only a single interconnect between Switch A and Switch B). In addition, Red Hat does not support the use of crossover cables for the cluster heartbeat network between nodes in a two-node cluster.
Red Hat Clustering in Red Hat Enterprise Linux 5 and the High Availability Add-On in Red Hat Enterprise Linux 6 use multicasting for cluster membership. If multicasting cannot be enabled in your production network, broadcast may be considered as an alternative in RHEL 5.6+. In RHEL 6 you must contact Red Hat to discuss alternate networking options before proceeding with deployment.
Using Quorum Disk and Tweaking Cluster Membership Timers
The use of qdiskd with Red Hat Clustering is optional in most cases. The exception is a configuration with a two-node cluster where the fence devices are on a separate network from the cluster heartbeat network, requiring a quorum disk to prevent split-brain situations from causing a fence race.
The use of qdiskd for clusters with more than four nodes is not recommended as it adds additional complexity with very little benefit. Because it is highly unlikely that more than 50% of the nodes will fail in a cluster that has more than four nodes at the same time, users are advised against using qdiskd in such situations.
Red Hat does not recommend changing the default timer values associated with cluster membership as changing them could have a cascading impact on other timers and overall cluster behavior. If you need to tweak any of the cluster membership timer values (for example, token and consensus timeout), you must obtain Red Hat's approval before deploying your cluster for production.
Using Highly Available Resources
Red Hat Clustering works best with resources in an active/passive mode, requiring a cold failover or a restart if the current active application instance fails. Use of Red Hat Clustering or the Red Hat High Availability Add-On is not advisable for applications running in an active/active load-sharing mode.
Supported file-system resources are as follows:
- Red Hat Enterprise Linux 5: ext3, XFS (RHEL 5.7+), GFS and GFS2
- Red Hat Enterprise Linux 6: ext3, ext4, XFS (RHEL 6.2+), GFS2
Contact Red Hat to obtain an approval if your cluster requires any of the following types of resources:
- NFS on top of GFS/GFS2
- Active/active (load balancing) resource configurations
- Usage of third-party software resource agents that ship with Red Hat Enterprise Linux (SAP, Sybase ASE, Oracle 10g/11g)
- Usage of DRBD (Please see article Does RHEL-6 support DRBD?)
- Custom resource agents that are not shipped with Red Hat Enterprise Linux
- NOTE: Custom resource agents are allowed, but only the resource agents shipped with Red Hat Enterprise Linux are fully supported by Red Hat.
Using GFS and GFS2
Please refer to the following article for detailed GFS and GFS2 recommended practices: How to Improve GFS/GFS2 File System Performance and Prevent Processes from Hanging
The following features/scenarios will result in an unsupported cluster deployment.
Items marked for Technology Preview indicate that while the item is presently unsupported, Red Hat is working to fully support the feature in a future release of Red Hat Enterprise Linux.
Each of these items applies to both Red Hat Enterprise Linux 5 and Red Hat Enterprise Linux 6 deployments, except where specifically noted:
- Overall architecture
- Oracle RAC on GFS2 is unsupported
- Staged/rolling upgrades between any major release is not supported. For example, a rolling upgrade of Red Hat Enterprise Linux 5 to Red Hat Enterprise Linux 6.
- Cluster node count greater than 16 is unsupported
- Usage of MD RAID for cluster storage is unsupported
- Snapshotting of clustered logical volumes is unsupported unless that volume has been activated exclusively on one node (as of release lvm2-2.02.84-1.el5 in RHEL 5.7 or lvm2-2.02.83-3.el6 in RHEL 6.1)
- Using multiple SAN devices to mirror GFS/GFS2 or clustered logical volumes across different subsets of the cluster nodes is unsupported
- Corosync using broadcast instead of multicast in RHEL 6 is unsupported (except for demo and pre-sales engagements)
- In RHEL 5.6+ broadcast mode is supported with certain restrictions as an alternative to multicast.
- In RHEL 6.2+ UDP unicast is is fully supported as an alternative to multicast.
- Corosync's Redundant Ring Protocol is a Technology Preview in RHEL 6.0 - 6.3, it because fully supported on RHEL 6.4+ as described in the following article.
- The supported limits for the heartbeat
tokentimeout are described in the following reference: What are the supported limits for heartbeat token timeout in Red Hat Cluster Suite?
- Corosync using broadcast instead of multicast in RHEL 6 is unsupported (except for demo and pre-sales engagements)
- High Availability Resources
- Usage of NFS in an active/active configuration on top of either GFS or GFS2 is unsupported
- Usage of NFS and Samba on top of same GFS/GFS2 instance is unsupported
- Running Red Hat High Availability Add-On or clusters on virtualized guests his limited support. For details see the following article: Virtualization Support for High Availability in Red Hat Enterprise Linux 5 and 6
- Red Hat Enterprise Linux
- Article Type