Support Policies for RHEL High Availability Clusters - General Requirements for Fencing/STONITH

Updated -

Contents

Overview

Applicable Environments

  • Red Hat Enterprise Linux (RHEL) with the High Availability Add-On

Useful References and Guides

Introduction

Fencing, also known as STONITH ("Shoot The Other Node In The Head") is a key aspect of a stable High Availability cluster design.

This guide offers Red Hat's policies and requirement around fencing, fence-devices, and STONITH in a RHEL High Availability cluster, including those deployed in conjunction with other products, such as RHEL Resilient Storage, Red Hat Openstack Platform, Red Hat Storage, Red Hat Satellite, and any others. Users of RHEL High Availability clusters should adhere to these policies in order to be eligible for support from Red Hat with the appropriate product support subscriptions.

Policies

STONITH/fencing must be enabled: Anywhere the RHEL High Availability software offers any ability to disable STONITH, fenced, or fencing functionality - Red Hat does not support clusters having their fencing functionality disabled via those mechanisms.

  • pacemaker clusters: Cluster property stonith-enabled=false is not a supported configuration in pacemaker clusters. It must be set to true - the default value - for the cluster deployment in question to receive support and consideration from Red Hat on any High-Availability-related concern, whether that concern be inherently related to fencing or not.

  • cman clusters: FENCE_JOIN=no is not a supported configuration in cman clusters. It must be set to yes - the default value - for the cluster deployment in question to receive support and consideration from Red Hat on any High-Availability-related concern, whether that concern be inherently related to fencing or not.


Every node must be managed by a fence device: For a cluster to receive support and consideration from Red Hat, every node in that cluster must have a configured fence device associated with it.

  • pacemaker clusters: pacemaker offers many ways for the cluster to dynamically or statically determine that a node can be managed by a particular stonith device in the configuration. Administrators should ensure that every node in the cluster should be manageable by some stonith device configured in that cluster.

  • cman clusters without pacemaker: For every node in the cluster, there must exist at least one device in that node's <fence/> stanza in /etc/cluster/cluster.conf


sbd watchdog-timeout fencing instead of a stonith device: sbd with watchdog-timeout fencing can be used in pacemaker clusters as an alternative to a fence-agent-based device - if configured according to all other relevant support policies applicable to sbd. All nodes must run sbd or else have an associated stonith device.


Clusters with shared block storage or DLM require power or storage-based devices: If the nodes of a cluster share access to block storage devices in any way - even if only in an active/passive manner - or has any components that use DLM, then the cluster is subject to more stringent requirements around fencing. In such clusters, all nodes in that cluster must be managed by a device that controls either:

  • The power state of that node (sbd qualifies here), or
  • Access to all block storage devices available to that node that are shared with other nodes.

If a node in a cluster with shared storage or DLM is associated only with a device using an alternative agent that does not manage power or storage access - such as fence_kdump - then that cluster will not receive support or consideration from Red Hat.

If a node in a cluster with shared storage is associated only with a device using a storage-based agent that does not control access to all block storage devices shared by the cluster, then that cluster will not receive support or consideration from Red Hat.

If a node is associated only with a device using a power-based agent that does not authoritatively control that node's power state, then that cluster will not receive support or consideration from Red Hat. For instance, if a node has a power-based device but that server has a redundant or independent power source that can keep the server operational through the disabling of the cluster-managed device, then that device does not meet the requirements for support.


Clusters with no shared block storage or DLM may use alternative agents and manual fencing: Clusters that do not share block storage in any way and do not use DLM may use devices with alternative agents that do not control power or storage access - such as fence_kdump - as their only automatic means of fencing. Red Hat's support for such use cases is subject to the following conditions:

  • Events which trigger fencing will execute the configured agent, and if that operation fails, an administrator must intervene to manually fence the node by powering it off. After manual fencing by powering off, the administrator can acknowledge to the cluster that manual fencing has taken place using the appropriate command - [pacemaker clusters] | [cman clusters]
  • Red Hat does does not place a high priority on development of features or behaviors specific to the case where such a fence-agent is in use that does not manage access to shared resources. Cluster functionality is designed around configurations that employ proper power or storage-based fence mechanisms, and alternative mechanisms will not receive high priority in development.
  • Even without shared storage, some applications may behave incorrectly or present conflicts in some manner if manual fencing is acknowledged without the node in question having been properly powered off. Red Hat Support will not provide support or consideration for behaviors following manual-fence acknowledgement where it cannot be proven that the manually-fenced node was fully powered off before acknowledgement was provided.
  • Red Hat still recommends the usage of a power-based agent or sbd for optimal behavior in the cluster.

NOTE: Most Red Hat Openstack Platform (RH-OSP) deployments with highly available controllers fall into this category of clusters without shared storage. While RH-OSP deployments may utilize distributed storage throughout such a cluster, these mechanisms do not carry the same conditions and considerations as true shared-block-storage setups. Red Hat still recommends power-based fencing or sbd in such setups, but these clusters may be used with alternative agents and manual fencing if preferred.


Limited support for environments using fence agents not provided by Red Hat: In cluster deployments utilizing any fence agent that is not distributed or supported by Red Hat, Red Hat Support may not assist with investigations or engagements in which fencing activity is involved. If problematic behavior results from or follows usage of a third-party fence agent, Red Hat may require that the behavior be reproduced in a configuration using only Red Hat provided components in order for the investigation to proceed. Red Hat recommends using one of the power or storage fence-agents it provides, or sbd.


Limitations around acknowledgement of manual fencing: Acknowledgement of manual fencing - [pacemaker clusters] | [cman clusters] - is intended only for execution by an administrator after a node has been confirmed to be powered off completely. Any behavior or scenario resulting from any other usage of such acknowledgement will not be considered or supported by Red Hat.

Table of Contents

No

2 Comments

I do not prefer this KB article, for two reasons: 1) this is information is provided by Red Hat in a Knowledge Base article - while all this should be in the official Configuration and Administration manuals. Currently the manuals fail to provide the information provided in this KB-article - and even partly contradict the information provided here. Would Red Hat expect users to read both Manuals and all KB articles for understanding how to setup clusters and which configurations are supported (or more generally, to use any of the Red Hat products...)? 2) this article is used by Red Hat support to refuse all support for all configurations where "stonith-enabled=false". But there's no indication why fencing/stonith should be enabled for all configurations - even for configurations that do not have any shared resources...

stonith-enabled should be enabled and fencing devices should be configured because if a node fails to properly recover, the remaining nodes will elect a fencing host that will remotely reboot the node that failed to properly recover. When stonith is not enabled and fencing devices are not configured, this will never happen and the node that's in a doubtful state might impair the remaining services. For example, if pacemaker failed to stop rabbitmq on a node and the next step would be to force a reboot of the node to recover from that failure, it won't happen and rabbitmq might fail on the other nodes because of that. Having this mechanism ensure that the node is no longer reachable by rebooting it and in the expectation that it will properly resume it's services once back up.