Red Hat Enterprise Linux Cluster, High Availability, and GFS Deployment Recommended Practices

Updated -

Overview

This document describes recommended practices for deploying and upgrading Red Hat Enterprise Linux clusters using the High Availability Add-On and Red Hat Global File System (GFS/GFS2).

As clusters can be used in a wide range of both supported and unsupported scenarios to meet different business needs, this document describes certain exceptions and/or conditions that warrant a review by Red Hat to determine whether or not a cluster is supportable. An exception does not necessarily mean that a cluster is not supportable by Red Hat: while some exceptions result in unsupportable scenarios, others just need careful review and analysis before proceeding.

To obtain an architecture review, follow the instructions provided in What information is required for a Architecture Review of Red Hat Enterprise Linux High Availability and Resilient Storage?.

Environment

  • Red Hat Enterprise Linux 5+ with the High Availability Add On or Resilient Storage Add On
  • Red Hat Enterprise Linux 6+ with the High Availability Add On or Resilient Storage Add On

These additional articles describe some of the below concepts in more detail. Please refer to them as needed:

Reference Articles

Architecture Review Process Details

General Architecture Support Articles

Hardware Support Information

Third Party Hardware/Software Integration

This section describes recommended practices for cluster deployment using the Red Hat High Availability and Resilient Storage Add-Ons.

Choosing a Cluster Architecture

Red Hat Clustering is designed to run at a single physical site where latencies are expected to be LAN-like (under 2 msec). This remains the most recommended and rigorously tested configuration for running the Red Hat High Availability and Resilient Storage Add-Ons.

Multi-Site Clusters

Multi-site or disaster-tolerant clusters are separate clusters that run at different physical sites, typically using SAN-based storage replication to replicate data. Multi-site clusters are usually used in an active/passive manner for disaster recovery with manual failover of the active cluster to the passive cluster.

Stretch Clusters

Stretch clusters are single-cluster configurations that span multiple physical sites.

Additional details on the supportability of multi-site and stretch clusters can be found in Support for Red Hat Enterprise Linux Cluster and High Availability Stretch Architectures.

Selecting Cluster Node Hardware

The maximum number of cluster nodes supported by the High Availability Add-On is 16, and the same is true for the Resilient Storage Add-On which includes GFS2 and CLVM. For more information then please see the following article. However, the majority of Red Hat’s clustering customers use node counts much lower than the maximum. In general, if your cluster requires node counts higher than eight, it is advisable to verify your cluster architecture with Red Hat before deployment to confirm that it is supportable. Red Hat's clustering solution is primarily designed to provide high availability and cold application failover, and it is not meant to be used for either high-performance or load-sharing clusters.

Homogeneous hardware configurations (nodes with similar specifications in terms of CPU sockets, cores, memory, etc.) are recommended. If you require a heterogeneous cluster with nodes that have diverse hardware configurations (for example, if Node1 has four processors with 8 GB of RAM while Node2 has two processors with 2 GB of RAM), Red Hat recommends that you create an initial proof-of-concept cluster to ensure that you encounter no surprises during production. You may also submit your cluster architecture for review by Red Hat prior to deployment.

NOTE: Certain commercial hardware variants (servers and network switches) have known issues with multicast, which means that usage of GFS2 with a high volume of POSIX (fcntl) lock activity can result in performance and stability issues.

Selecting and Configuring Cluster Storage

There are no recommended practices for selecting cluster storage hardware.

When using LVM in a cluster:

  • When using HA-LVM, if a synchronization is required (for example, after adding a leg to a mirror or restoring a failed mirror leg), it is recommended to disable the service utilizing that mirror.  If this is not possible, an administrator may freeze (clusvcadm -Z) the service in place to prevent failover while the mirror is rebuilding.  Once the mirror synchronization has completed, administrators must unfreeze (clusvcadm -U) the service. 
    • There are known issues if a service relocation occurs while synchronizing a mirror leg, notably bug #692186
    • If utilizing LVM mirroring and synchronization of a mirror is required while an application is utilizing the mirror, it is recommended practice to test this process on a non-production cluster with realistic application load to ensure that the application is not adversely impacted by the I/O performance drop caused by the mirror synchronization.

Setting Up Cluster Networks

It is highly recommended to separate the private network for cluster heartbeat from the application network. This should be done wherever possible.

Red Hat only supports the use of a redundant ring that provides additional resilience when the cluster heartbeat network fails on RHEL 6 Update 4 and later and RHEL 7.

Red Hat recommends that NIC bonding is used to avoid having a single link failure prevent communication between nodes, however special attention must be given to what bonding mode is configured, as there are limitations on the supported modes in each RHEL release.

Network infrastructure used for the cluster heartbeat network must be switched and should be configured to eliminate single points of failure that could lead to split-brain situations (for example, it would not be recommended to have four nodes on Switch A and four nodes on Switch B with only a single interconnect between Switch A and Switch B). In addition, Red Hat does not support the use of crossover cables for the cluster heartbeat network between nodes in a two-node cluster.

Red Hat Clustering in Red Hat Enterprise Linux 5 and the High Availability Add-On in Red Hat Enterprise Linux 6 use multicasting for cluster membership. If multicasting cannot be enabled in your production network, broadcast may be considered as an alternative in RHEL 5.6+. In RHEL 6 you must contact Red Hat to discuss alternate networking options before proceeding with deployment.

Using Quorum Disk and Tweaking Cluster Membership Timers

The use of qdiskd with Red Hat Clustering is optional in most cases. The exception is a configuration with a two-node cluster where the fence devices are on a separate network from the cluster heartbeat network, requiring a quorum disk to prevent split-brain situations from causing a fence race.

The use of qdiskd for clusters with more than four nodes is not recommended as it adds additional complexity with very little benefit. Because it is highly unlikely that more than 50% of the nodes will fail in a cluster that has more than four nodes at the same time, users are advised against using qdiskd in such situations.

Red Hat does not recommend changing the default timer values associated with cluster membership as changing them could have a cascading impact on other timers and overall cluster behavior. If you need to tweak any of the cluster membership timer values (for example, token and consensus timeout), you must obtain Red Hat's approval before deploying your cluster for production.

Using Highly Available Resources

Red Hat Clustering works best with resources in an active/passive mode, requiring a cold failover or a restart if the current active application instance fails. Use of Red Hat Clustering or the Red Hat High Availability Add-On is not advisable for applications running in an active/active load-sharing mode.

Supported file-system resources are as follows:

  • Red Hat Enterprise Linux 5: ext3, XFS (RHEL 5.7+), GFS and GFS2
  • Red Hat Enterprise Linux 6: ext3, ext4, XFS (RHEL 6.2+), GFS2

Contact Red Hat to obtain an approval if your cluster requires any of the following types of resources:

  • NFS on top of GFS/GFS2
  • Active/active (load balancing) resource configurations
  • Usage of third-party software resource agents that ship with Red Hat Enterprise Linux (SAP, Sybase ASE, Oracle 10g/11g)
  • Usage of DRBD (Please see article Does RHEL-6 support DRBD?)
  • Custom resource agents that are not shipped with Red Hat Enterprise Linux
    • NOTE: Custom resource agents are allowed, but only the resource agents shipped with Red Hat Enterprise Linux are fully supported by Red Hat.

Using GFS and GFS2

Please refer to the following article for detailed GFS and GFS2 recommended practices: How to Improve GFS/GFS2 File System Performance and Prevent Processes from Hanging

Unsupported Items

The following features/scenarios will result in an unsupported cluster deployment.

Items marked for Technology Preview indicate that while the item is presently unsupported, Red Hat is working to fully support the feature in a future release of Red Hat Enterprise Linux.

Each of these items applies to both Red Hat Enterprise Linux 5 and Red Hat Enterprise Linux 6 deployments, except where specifically noted:

  • Overall architecture 
    • Oracle RAC on GFS2 is unsupported on all versions of RHEL
    • Staged/rolling upgrades between any major release is not supported. For example, a rolling upgrade of Red Hat Enterprise Linux 5 to Red Hat Enterprise Linux 6.
  • Hardware
    • Cluster node count greater than 16 is unsupported
  • Storage
    • Usage of MD RAID for cluster storage is unsupported
    • Snapshotting of clustered logical volumes is unsupported unless that volume has been activated exclusively on one node (as of release lvm2-2.02.84-1.el5 in RHEL 5.7 or lvm2-2.02.83-3.el6 in RHEL 6.1)
    • Using multiple SAN devices to mirror GFS/GFS2 or clustered logical volumes across different subsets of the cluster nodes is unsupported
  • Networking
  • High Availability Resources
    • Usage of NFS in an active/active configuration on top of either GFS or GFS2 is unsupported
    • Usage of NFS and Samba on top of same GFS/GFS2 instance is unsupported
  • Running Red Hat High Availability Add-On or clusters on virtualized guests his limited support. For details see the following article: Virtualization Support for High Availability in Red Hat Enterprise Linux 5 and 6

Table of Contents

Automatically generate a table of contents

11 Comments

The article https://access.redhat.com/kb/docs/DOC-35662 referenced here is unavailable to non-Red Hat associates.

The link about DRBD support (https://access.redhat.com/kb/docs/DOC-50386) references to an archived (not accessible) document.

This document states several times that broadcast/unicast is tech preview in RHEL 6 and unsupported, but it is supported in 6.2 and 6.3.

The link to "Virtualization Support for High Availability in Red Hat Enterprise Linux 5 and 6" should be https://access.redhat.com/site/articles/29440

Thanks Don! I've fixed the link.

I'm looking for information about HA support for powerlinux, is supported?

Hi, can someone please help differentiate when to use high availability or resilient storage add-ons?

Thank you!

Lemuel,

In the event this information is still applicable, Resilient Storage includes Cluster Logical Volume Manager daemon and GFS2 related packages and support. High Availability does not include these packages or support for their use. Otherwise, the products are identical.

Dear sir,

Myself Krishna Reddy. We are working on clustering in THEL 7.1.

We have configured basic clustering set up for 2 nodes in RHEL 7.1 and following services:

  1. Virtual IP resource added using ocf:Heartbeat:IPAddr2 in-built service
  2. Apache web server resource added for website functioning when one node fails.

Our Module Description: We have an module which consists one shell script which in-turn executes multiple binary executables to complete some task.

We have the following queries regarding our module to support in clustering:

  1. Whether RHEL 7.1 clustering will support for configuring customized user applications consists of binary executables??
  2. Is there any mechanism to detect and sending notification message to passive node when master node fails??
  3. How much time it will take to switch over and start executing in passive node when master node fails??
  4. How to make passive node as Master node when Master node fails due to network failure??
  5. How to detect and send notification message to passive node when one of the ethernet interface link goes donw/failure??
  6. How to detect when heartbeat fails between the services??
  7. We used Keepalived service for redundancy, but it is taking 3 seconds advertisement window time for switch over. Because of this delay we are losing 3 seconds data. So, is there any solution to minimize this loss using keeplaived service or using RHEL 7.1 clustering.

Support from RHEL Team:

  1. Please provide procedure to deploy/enable user applications in RHEL clustering in order to monitor the node failure and switch over to the other node.
  2. Please provide the procedure how to detect node failure and send notification to active node.

We request you to please clarify the above issues.

Please do the needful. Awaiting your reply...

This article doesn't look to be updated with RHEL7.x (pcs), please update this information.

Hello Sadashiva,

While some of these guidelines are applicable to RHEL 7 as well as RHEL 5 and 6, you can refer to this article for more information regarding support policies for RHEL 7 Resilient Storage clusters, and our official deployment guide for assistance configuring a RHEL 7 pacemaker cluster. If you have any specific supportability questions, please don't hesitate to reach out to technical support for more assistance!

Regards,

Cole Towsley, RHCE Software Maintenance Engineer Red Hat Inc.