fence_aws fence action fails with "Timed out waiting to power OFF" and then "Unable to obtain correct plug status or plug is not available" when a node is panicked in a High Availability cluster

Solution In Progress - Updated -

Issue

  • When an AWS Pacemaker cluster node experiences a kernel panic or is crashed by running echo c > /proc/sysrq-trigger, the fence action against it fails with "Timed out waiting to power OFF". When the action is retried, it fails repeatedly with "Unable to obtain correct plug status or plug is not available".
  • Fencing the node manually with pcs stonith fence succeeds.
  • Issue may be intermittent.
Apr  2 22:18:04 ip-10-0-0-17 corosync[14259]: [TOTEM ] A processor failed, forming new configuration.
Apr  2 22:18:05 ip-10-0-0-17 corosync[14259]: [TOTEM ] A new membership (10.0.0.17:177) was formed. Members left: 2
Apr  2 22:18:05 ip-10-0-0-17 corosync[14259]: [TOTEM ] Failed to receive the leave message. failed: 2
Apr  2 22:18:05 ip-10-0-0-17 corosync[14259]: [CPG   ] downlist left_list: 1 received
Apr  2 22:18:05 ip-10-0-0-17 corosync[14259]: [QUORUM] Members[1]: 1
Apr  2 22:18:05 ip-10-0-0-17 corosync[14259]: [MAIN  ] Completed service synchronization, ready to provide service.
Apr  2 22:18:05 ip-10-0-0-17 pacemakerd[14283]:  notice: Node node2 state is now lost
...
Apr  2 22:18:06 ip-10-0-0-17 crmd[14289]:  notice: Requesting fencing (reboot) of node node2
...
Apr  2 22:19:11 ip-10-0-0-17 fence_aws: Failed: Timed out waiting to power OFF
...
Apr  2 22:19:11 ip-10-0-0-17 stonith-ng[14285]:   error: Operation 'reboot' [18866] (call 42 from crmd.14289) for host 'node2' with device 'aws_fence' returned: -62 (Timer expired)
...
Apr  2 22:19:11 ip-10-0-0-17 crmd[14289]:  notice: Peer node2 was not terminated (reboot) by node1 on behalf of crmd.14289: Timer expired
...
Apr  2 22:19:11 ip-10-0-0-17 pengine[14288]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
...
Apr  2 22:19:11 ip-10-0-0-17 crmd[14289]:  notice: Requesting fencing (reboot) of node node2
...
Apr  2 22:19:12 ip-10-0-0-17 fence_aws: Failed: Unable to obtain correct plug status or plug is not available
...
Apr  2 22:19:14 ip-10-0-0-17 fence_aws: Failed: Unable to obtain correct plug status or plug is not available
...
Apr  2 22:19:14 ip-10-0-0-17 stonith-ng[14285]:   error: Operation 'reboot' [19126] (call 43 from crmd.14289) for host 'node2' with device 'aws_fence' returned: -201 (Generic Pacemaker error)
Apr  2 22:19:14 ip-10-0-0-17 stonith-ng[14285]:  notice: Couldn't find anyone to fence (reboot) node2 with any device
Apr  2 22:19:14 ip-10-0-0-17 stonith-ng[14285]:   error: Operation reboot of node2 by <no-one> for crmd.14289@node1.5abdec11: No route to host

Environment

  • Red Hat Enterprise Linux 7 or 8 (with the High Availability Add-on)
  • Amazon Web Services (AWS) EC2 instances as cluster nodes

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In