High Availability クラスターでノードがパニック状態になると、fence_aws のフェンスアクションが "Timed out waiting to power OFF" および "Unable to obtain correct plug status or plug is not available" で失敗する
Issue
- AWS 上の Pacemaker クラスターノードでカーネルパニックが発生したとき、または
echo c > /proc/sysrq-triggerを実行してクラッシュさせたとき、そのノードに対するフェンスアクションが"Timed out waiting to power OFF"で失敗します。アクションを再試行すると、"Unable to obtain correct plug status or plug is not available"で繰り返し失敗します。 pcs stonith fenceコマンドを使用してノードを手動でフェンスすることは成功します。- この問題は断続的に発生することがあります。
Apr 2 22:18:04 ip-10-0-0-17 corosync[14259]: [TOTEM ] A processor failed, forming new configuration.
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [TOTEM ] A new membership (10.0.0.17:177) was formed. Members left: 2
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [TOTEM ] Failed to receive the leave message. failed: 2
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [CPG ] downlist left_list: 1 received
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [QUORUM] Members[1]: 1
Apr 2 22:18:05 ip-10-0-0-17 corosync[14259]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 2 22:18:05 ip-10-0-0-17 pacemakerd[14283]: notice: Node node2 state is now lost
...
Apr 2 22:18:06 ip-10-0-0-17 crmd[14289]: notice: Requesting fencing (reboot) of node node2
...
Apr 2 22:19:11 ip-10-0-0-17 fence_aws: Failed: Timed out waiting to power OFF
...
Apr 2 22:19:11 ip-10-0-0-17 stonith-ng[14285]: error: Operation 'reboot' [18866] (call 42 from crmd.14289) for host 'node2' with device 'aws_fence' returned: -62 (Timer expired)
...
Apr 2 22:19:11 ip-10-0-0-17 crmd[14289]: notice: Peer node2 was not terminated (reboot) by node1 on behalf of crmd.14289: Timer expired
...
Apr 2 22:19:11 ip-10-0-0-17 pengine[14288]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
...
Apr 2 22:19:11 ip-10-0-0-17 crmd[14289]: notice: Requesting fencing (reboot) of node node2
...
Apr 2 22:19:12 ip-10-0-0-17 fence_aws: Failed: Unable to obtain correct plug status or plug is not available
...
Apr 2 22:19:14 ip-10-0-0-17 fence_aws: Failed: Unable to obtain correct plug status or plug is not available
...
Apr 2 22:19:14 ip-10-0-0-17 stonith-ng[14285]: error: Operation 'reboot' [19126] (call 43 from crmd.14289) for host 'node2' with device 'aws_fence' returned: -201 (Generic Pacemaker error)
Apr 2 22:19:14 ip-10-0-0-17 stonith-ng[14285]: notice: Couldn't find anyone to fence (reboot) node2 with any device
Apr 2 22:19:14 ip-10-0-0-17 stonith-ng[14285]: error: Operation reboot of node2 by <no-one> for crmd.14289@node1.5abdec11: No route to host
Environment
- Red Hat Enterprise Linux 7、8、9、10 (High Availability アドオン使用)
- クラスターノードとして Amazon Web Services (AWS) EC2 インスタンスを使用
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.