Why does a fencing operation complete with success but the node is not rebooted and instead pacemaker is shut down on that node?
Issue
-
While performing a manual fence operation using
pcs stonith fence node2
, the command completes without error and actually thenode2
does not get rebooted - it remains up and running and just the pacemaker stack gets shut down on thenode2
:[root@node1 ~]# pcs stonith fence node2 Node: node2 fenced
Logs from
node1
:stonith-ng[1485]: notice: Client stonith_admin.6884.651b3028 wants to fence (reboot) 'node2' with device '(any)' stonith-ng[1485]: notice: Requesting peer fencing (reboot) of node2 stonith-ng[1485]: notice: fence_node1 can not fence (reboot) node2: static-list stonith-ng[1485]: notice: fence_node2 can fence (reboot) node2 (aka. 'vm-node2'): static-list stonith-ng[1485]: notice: fence_node1 can not fence (reboot) node2: static-list stonith-ng[1485]: notice: fence_node2 can fence (reboot) node2 (aka. 'vm-node2'): static-list stonith-ng[1485]: notice: Operation 'reboot' [6885] (call 2 from stonith_admin.6884) for host 'node2' with device 'fence_node2' returned: 0 (OK) stonith-ng[1485]: notice: Operation reboot of node2 by node1 for stonith_admin.6884@node1.671eb0a4: OK crmd[1490]: notice: Peer node2 was terminated (reboot) by node1 on behalf of stonith_admin.6884: OK
Logs from
node2
:stonith-ng[3038]: notice: fence_node1 can not fence (reboot) node2: static-list stonith-ng[3038]: notice: fence_node2 can fence (reboot) node2 (aka. 'vm-node2'): static-list stonith-ng[3038]: notice: Operation reboot of node2 by node1 for stonith_admin.6884@node1.671eb0a4: OK crmd[3042]: crit: We were allegedly just fenced by node1 for node1! cib[3037]: warning: new_event_notification (3037-3042-13): Broken pipe (32) cib[3037]: warning: Notification of client crmd/0fff530d-0e73-42f0-bbac-c0100a4ab62b failed pacemakerd[3036]: warning: The crmd process (3042) can no longer be respawned, shutting the cluster down. lrmd[3039]: warning: new_event_notification (3039-3042-8): Bad file descriptor (9) lrmd[3039]: warning: Could not notify client crmd/e36d493d-4cd9-4e96-946d-cf1165dbfe2c: Bad file descriptor pacemakerd[3036]: notice: Shutting down Pacemaker pacemakerd[3036]: notice: Stopping pengine pengine[3041]: notice: Caught 'Terminated' signal pacemakerd[3036]: notice: Stopping attrd attrd[3040]: notice: Caught 'Terminated' signal lrmd[3039]: notice: Caught 'Terminated' signal pacemakerd[3036]: notice: Stopping lrmd pacemakerd[3036]: notice: Stopping stonith-ng stonith-ng[3038]: notice: Caught 'Terminated' signal cib[3037]: warning: new_event_notification (3037-3038-11): Broken pipe (32) cib[3037]: warning: Notification of client stonithd/7c8647e4-b100-4d2b-b9bf-b4c94fdc6e80 failed cib[3037]: warning: new_event_notification (3037-3040-12): Broken pipe (32) cib[3037]: warning: Notification of client attrd/0bcd95ee-116f-4423-8b89-666d97583822 failed pacemakerd[3036]: notice: Stopping cib cib[3037]: notice: Caught 'Terminated' signal cib[3037]: notice: Disconnected from Corosync cib[3037]: notice: Disconnected from Corosync pacemakerd[3036]: notice: Shutdown complete pacemakerd[3036]: notice: Attempting to inhibit respawning after fatal error
-
The same behavior is seen during stonith operations issued by the cluster itself but it cannot be reproduced when issuing the operation manually using the command
fence_vmware_soap
orfence_vmware_rest
. -
The hypervisor may report any of the following messages in the logs of the virtual machine which was fenced:
Task: Reset virtual machine
or:
Task: Reconfigure virtual machine
-
From the point of view of the rest of the nodes in the cluster, the node which was fenced appears as
Pending
.
Environment
Red Hat Enterprise Linux 7 with High-Availability or Resilient Storage Add-Ons
Pacemaker cluster
VMware hypervisor
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.