Fencing fails in a RHEL 7, 8, 9 High Availability cluster because systemd initiates a graceful shutdown
Issue
- fencing fails because
systemd-logind
handles the "power button" signal and initiates a graceful shutdown instead of powercycling the system. - When a node fenced the other, we see that node process a power-button press and starts to shut down. All the while, fencing fails on the other node, seemingly for taking too long
- Do we need to disable acpi / acpid in RHEL 7 clusters like we did in previous releases?
- Do I need to do anything in addition to disabling ACPI on RHEL 7 cluster nodes to avoid it softly shutting down? For example:
Aug 13 21:07:22 node01 systemd-logind: Power key pressed.
Aug 13 21:07:22 node01 systemd-logind: Powering Off...
Aug 13 21:07:22 node01 systemd-logind: System is powering down.
Aug 13 21:07:42 node02 stonith-ng[2803]: notice: log_operation: Operation 'reboot' [3114] for device 'node01-ilo' returned: -62 (Timer expired)
- A cluster node gracefully rebooted instead of being hard killed on RHEL 7:
Nov 2 10:57:01 node41 stonith-ng[8161]: notice: Operation reboot of node42 by node42 for crmd.20238@uxplpsgrd03.8b66209c: OK
Nov 2 10:57:01 node42 crmd[20238]: crit: We were allegedly just fenced by node41 for node42!
Nov 2 10:57:01 node42 stonith-ng[20234]: notice: Operation reboot of node42 by node41 for crmd.20238@node42.8b66209c: OK
Nov 2 10:57:01 node42 systemd-logind: Power key pressed.
- A cluster node gracefully rebooted instead of being hard killed on RHEL 8:
Sep 18 16:19:11 rhel8-1 stonith-ng[8161]: notice: Operation reboot of rhel8-1 by rhel8-2 for crmd.20238@uxplpsgrd03.8b66209c: OK
Sep 18 16:19:11 rhel8-1 crmd[20238]: crit: We were allegedly just fenced by rhel8-1 for rhel8-2!
Sep 18 16:19:11 rhel8-1 systemd-logind[792]: Session 1 logged out. Waiting for processes to exit.
Sep 18 16:19:11 rhel8-1 systemd-logind[792]: Removed session 1.
Environment
- Red Hat Enterprise Linux (RHEL) 7, 8, 9 with the High Availability Add-On
- One or more
pacemaker cluster nodes (or
pacemakerremote nodes) associated with a
stonith` device that uses a power-method which connects to a BMC or system-management controller like an iLO, RSA, DRAC, iDRAC, etc.
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.