fence_aws fence action fails with "Timed out waiting to power OFF" and then "Unable to obtain correct plug status or plug is not available" when a node is panicked in a High Availability cluster
Issue
- When an AWS Pacemaker cluster node experiences a kernel panic or is crashed by running
echo c > /proc/sysrq-trigger
, the fence action against it fails with"Timed out waiting to power OFF"
. When the action is retried, it fails repeatedly with"Unable to obtain correct plug status or plug is not available"
. - Fencing the node manually with
pcs stonith fence
succeeds. - Issue may be intermittent.
Apr 2 22:18:04 ip-10-0-0-17 corosync[14259]: [TOTEM ] A processor failed, forming new configuration.
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [TOTEM ] A new membership (10.0.0.17:177) was formed. Members left: 2
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [TOTEM ] Failed to receive the leave message. failed: 2
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [CPG ] downlist left_list: 1 received
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [QUORUM] Members[1]: 1
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 2 22:18:05 ip-10-0-0-17 pacemakerd[14283]: notice: Node node2 state is now lost
...
Apr 2 22:18:06 ip-10-0-0-17 crmd[14289]: notice: Requesting fencing (reboot) of node node2
...
Apr 2 22:19:11 ip-10-0-0-17 fence_aws: Failed: Timed out waiting to power OFF
...
Apr 2 22:19:11 ip-10-0-0-17 stonith-ng[14285]: error: Operation 'reboot' [18866] (call 42 from crmd.14289) for host 'node2' with device 'aws_fence' returned: -62 (Timer expired)
...
Apr 2 22:19:11 ip-10-0-0-17 crmd[14289]: notice: Peer node2 was not terminated (reboot) by node1 on behalf of crmd.14289: Timer expired
...
Apr 2 22:19:11 ip-10-0-0-17 pengine[14288]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
...
Apr 2 22:19:11 ip-10-0-0-17 crmd[14289]: notice: Requesting fencing (reboot) of node node2
...
Apr 2 22:19:12 ip-10-0-0-17 fence_aws: Failed: Unable to obtain correct plug status or plug is not available
...
Apr 2 22:19:14 ip-10-0-0-17 fence_aws: Failed: Unable to obtain correct plug status or plug is not available
...
Apr 2 22:19:14 ip-10-0-0-17 stonith-ng[14285]: error: Operation 'reboot' [19126] (call 43 from crmd.14289) for host 'node2' with device 'aws_fence' returned: -201 (Generic Pacemaker error)
Apr 2 22:19:14 ip-10-0-0-17 stonith-ng[14285]: notice: Couldn't find anyone to fence (reboot) node2 with any device
Apr 2 22:19:14 ip-10-0-0-17 stonith-ng[14285]: error: Operation reboot of node2 by <no-one> for crmd.14289@node1.5abdec11: No route to host
Environment
- Red Hat Enterprise Linux 7 or 8 (with the High Availability Add-on)
- Amazon Web Services (AWS) EC2 instances as cluster nodes
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.