High Availability Cluster nodes shutdown gracefully rather than powering off when fenced in RHEL

Solution Verified - Updated -

Environment

  • Red Hat Cluster Suite 4+
  • Red Hat Enterprise Linux Server 5 (with the High Availability Add on)
  • Red Hat Enterprise Linux Server 6 (with the High Availability Add on)
  • The daemon'acpid' is running, and/or acpi enabled in the kernel
  • Integrated fencing device (any IPMI BMC, IBM RSA, HP iLO, Dell DRAC, etc).

For Red Hat Enterprise Linux 7, please refer to Solution 1578823.

Issue

  • When a cluster node is fenced, a graceful shutdown occurs. The node issues an init 6.  We expect it to be powered off immediately.
  • A fencing failure such as kernel panic occurs and node does not power off immediately.
  • Fenced server does not reboot cluster node
  • A pair of HP DL380 servers are managed under RHCS. Sometime, fence does not work and need to unplugged and replugged the power cord to restart node.

Resolution

Stop and disable the acpid daemon:

# service acpid stop  
# chkconfig --level 123456 acpid off

Or check the following parts of Cluster Administration documentation for more alternatives:

Root Cause

Fence agents for integrated fencing devices (such as an IPMI compliant BMC, or a lights-out device such as the IBM RSA-II, the HP iLO or the Dell DRAC) explicitly sends a power down command to the fence device. However, some fence devices will generate an ACPI power-off event instead, which the acpid daemon on the host will intercept triggering a graceful reboot instead of a hard power-off.

When a power fence device is used as a fence agent for a cluster node then that node should almost immediately shutdown as if the power cable was pulled.

A clean shutdown can take an excessive amount of time to complete, thus delaying the fence operation. Even worse, if the clean shutdown gets stuck at some point, or if the node panics or freezes, the fence operation won't succeed, preventing the failover, and requiring manual intervention in order to return the cluster to an operational state.

The behavior of the fence operation through integrated devices, such as BMCs or lights-out management heavily depends on the vendor specific model and firmware version of the device. Therefore, stopping the acpid daemon and disabling it is a recommended best practice for any cluster node which uses an integrated fencing mechanism.

Below is a list of some of the most common fence agents that could trigger acpid to do a graceful shutdown. If graceful shutdown is occuring then acpid should be disabled.

  • fence_bladecente
  • fence_drac
  • fence_drac5
  • fence_ilo
  • fence_ilo_mp
  • fence_ipmilan
  • fence_rsa

Please note that disabling the acpid daemon doesn't disable the power management and saving features, which are handled by the ACPI kernel module. The 'acpid' daemon notifies user-space programs of ACPI events.

Some fencing devices delay for a few seconds prior to powering the machine off, simulating an administrator holding the ACPI-controlled power button for a few seconds. Fencing will not complete until the fencing device reports that fencing was successful.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

1 Comments

This is quite interesting. It is a wonder that this this was not known when RHCS was coming out. After acpid is not a new thing.