Resources run on two nodes simultaneously, data on shared storage is corrupted, and/or other unexpected behavior occurs in a RHEL High Availability cluster using fence_ipmilan with method=cycle

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8 with the High Availability Add On
  • One or more stonith or fence devices configured to use agent: fence_ipmilan, fence_ilo3, fence_ilo4, fence_imm, and fence_idrac
  • Or a fencing device configured with method=cycle in /etc/cluster/cluster.conf in cman-based clusters or in the CIB for pacemaker-based clusters

Issue

  • A node had trouble communicating and the cluster decided to fence it and take over its resources, but it seems that another node mounted file system resources before the node got powered off, and data was corrupted.
  • fence_ipmilan returns success before a node actually gets powered off
  • A node failed to stop a resource and so needed to be fenced, and somehow that node was still alive to log the completion of that fence action from another node. How can this be possible if the node should have powered off before fencing completed?
Aug 17 08:33:08 node2 stonith-ng[17738]:   notice: remote_op_done: Operation reboot of node2 by node1 for stonith_admin.cman.120215@sapha014hb0.ee6744ed: OK
  • When a node is fenced in my pacemaker cluster due to a resource stop timeout, the rest of the cluster logs "telling cman to remove nodeid 9 from cluster", the membership changes, but GFS2 access stays blocked. All nodes log "Trying to acquire journal lock" but nothing else happens. We only see this behavior with method="cycle" in our stonith device.

Resolution

IMPORTANT: Configure all IMPI based fencing agent such as fence_ipmilan, fence_ilo3, fence_ilo4, fence_imm, and fence_idrac devices to use method=onoff (the default in most cases) instead of cycle and make sure that cluster node is configured to power off immediately for RHEL 5, 6 cluster nodes or powered off immediately for RHEL 7 cluster nodes.

If you have declared the attribute method to have a value of cycle for any fence-agent then you should modify it so that themethod attribute has a value of onoff.

RHEL 6

There are multiple fence-agents that have a default of cycle for the method attribute. If you are using one of the following fence-agents below then add the attribute method=onoff to those configured fence-agents.

  • fence_ipmilan
  • fence_ilo3
  • fence_ilo4
  • fence_imm
  • fence_idrac
pacemaker-based clusters

Update the stonith device configurations to not specify a method, or use method=onoff instead. Leaving the value off of an attribute when updating causes it to be un-set and uses the default which we do not want to do in this case.

# pcs stonith update node1-ipmi method=onoff
Purely cman-based clusters

Update any fencedevice definitions in /etc/cluster/cluster.conf to use method="onoff" instead.

<fencedevice name="node1-ipmi" agent="fence_ipmilan" ipaddr="node1-ipmi.example.com" userid="myuser" password="StrongPassword" lanplus="1" method="onoff"/>

RHEL 7 or later

The only fencing agent that defaults to method=cycle on RHEL 7 is fence_ilo3. There are two ways to change this:

pacemaker-based clusters
  • Update the fence-agents packages with the following errata RHBA-2018:0758. The errata changes the default of method to onoff for the fencing agentfence_ilo3.
  • WORKAROUND: Update the stonith device configurations to not specify a method unless fence_ilo3 (and before errata above or later was installed), or use method=onoff instead. Leaving the value off of an attribute when updating causes it to be un-set and uses the default.
# pcs stonith update node1-ipmi method=onoff

NOTE: RHEL 8 or later defaults to onoff for the attribute method for all fence-agents that use the method attribute.

Root Cause

fence_ipmilan offers a special method attribute that controls how a reboot operation is carried out. If using the default value of onoff, then the agent sends a power-off command to the device, then sends a power-on, and evaluates the results of those and reports that back as the exit status. This ensures that no successful return code can be sent back to the cluster stack until a node is successfully powered off.

However, the alternate value of cycle results in the agent issuing a single command to the hardware device telling it to cycle the node itself. This relies on the device firmware carrying out the action in the proper way and reporting the status successfully, since both before and after the status of the server will be "on", so there is no way to confirm that it actually powered off. Some server make/model firmwares might actually return a successful status from this cycle request before proceeding to power off the server. The end result is that the fence agent may believe the operation was a success several seconds or more before a node actually powers off.

This can cause problems for the cluster stack in a few ways, the most significant of which being that the successful completion of fencing signals to the resource manager on other nodes to start recovering resources that were running on fenced nodes, meaning those resources have the potential to run on two nodes simultaneously. If one node thinks the other has powered off and, for example, takes over a file system resource, mounts it, and submits I/O to it, all while the other node is still issuing I/O to it itself, data corruption could ensue.

While this ultimately would be a problem on the IPMI-device firmware side, Red Hat is considering whether a change is necessary to prevent usage of the cycle method within High Availability clusters, or whether there is some alternative solution that could prevent issues like this. This investigation is occurring in Red Hat Bugzilla #1271780.

This applies to all IMPI based fencing agent such as fence_ipmilan, fence_ilo3, fence_ilo4, fence_ilo5, fence_imm, and fence_idrac.

Diagnostic Steps

  • To demonstrate the nature of this problem, simply execute fence_ipmilan -o reboot -m cycle [...] from one node against another node's fence device, then interact with a console on that fenced node constantly while waiting for the fence_ipmilan operation to complete. If the node is still responsive on its console or ssh session after the fence_ipmilan command has exited with a success status, then the cluster is susceptible to unexpected behavior when using the cycle method, and it should be avoided.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments