ノードが、フェンスされて corosync と pacemaker を再起動した後、pacemaker をシャットダウンする
Issue
-
理論的には、この問題は、タイミングが悪ければどのプラットフォームでも発生する可能性があります。ただし、
fence_gce
フェンスエージェントによる再起動方法が原因で、Google Cloud Platform で発生する可能性が高くなる場合があります。-
一般的なケース: トークンが失われたことで、ノードが
corosync
メンバーシップを離れました。このノードに対して stonith アクションが開始された後、ノードが再起動される前にcorosync
メンバーシップに再び参加しました。ノードは、再起動してクラスターサービスを開始した後、"We were allegedly just fenced"
というメッセージを受信し、pacemaker
サービスとcorosync
サービスをシャットダウンしました。# # In this example, token loss occurred at 01:27:57 due to a network issue, after the token timeout expired. # # A new one-node membership reflecting token loss formed at 01:28:21, after the consensus timeout expired. # # Node 1 initiated a stonith action against node 2. # # Node 2 rejoined the corosync membership at 01:28:23, when the network issue was resolved. # # A new two-node membership formed, with node 2 back in the CPG group. May 4 01:27:57 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A processor failed, forming new configuration. May 4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A new membership (1.116f4) was formed. Members left: 2 May 4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]: [QUORUM] Members[1]: 1 ... May 4 01:28:22 fastvm-rhel-8-0-23 pacemaker-schedulerd[1739]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster ... May 4 01:28:22 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Delaying 'reboot' action targeting node2 using xvm2 for 20s May 4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A new membership (1.116f8) was formed. Members joined: 2 May 4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]: [QUORUM] Members[2]: 1 2 May 4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]: [MAIN ] Completed service synchronization, ready to provide service. # # At 01:28:45, node 1 received confirmation that node 2 had been successfully rebooted. May 4 01:28:45 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' [43895] (call 28 from pacemaker-controld.1740) targeting node2 using xvm2 returned 0 (OK) # # In order to fully complete the stonith action, it needed to deliver the confirmation message to # # all nodes in the CPG group. Node 2 was still in the CPG group from the rejoin at 01:28:23. # # A new membership without node 2 had not yet been formed, because # # (token timeout + consensus timeout) had not yet expired since the reboot. # # So the message was not delivered until node 2 started cluster services after boot. # # In receiving this message, node 2 received notification that it had been fenced. # # So it shut itself down in response. # # Node 1: May 4 01:29:01 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A processor failed, forming new configuration. May 4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A new membership (1.116fc) was formed. Members joined: 2 left: 2 May 4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]: [QUORUM] Members[2]: 1 2 May 4 01:29:09 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK May 4 01:29:09 fastvm-rhel-8-0-23 pacemaker-controld[1740]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1740: OK ... May 4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]: [CFG ] Node 2 was shut down by sysadmin May 4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A new membership (1.11700) was formed. Members left: 2 May 4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]: [QUORUM] Members[1]: 1 May 4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]: [MAIN ] Completed service synchronization, ready to provide service. # # Node 2: May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice [TOTEM ] A new membership (1.116fc) was formed. Members joined: 1 May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice [QUORUM] Members[2]: 1 2 May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-fenced [1319] (remote_op_done) notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK | id=b69b57a1 May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-controld [1323] (tengine_stonith_notify) crit: We were allegedly just fenced by node1 for node1!
-
GCP の場合: Google Compute Engine (GCE) 仮想マシンが
fence_gce
エージェントによってフェンスされ、再起動しました。フェンスアクションが完了する前に、クラスターに再参加しました。その後すぐに、pacemaker
サービスとcorosync
サービスをシャットダウンし、クラスターを離れました。# # In this example, node 2 was rebooted at 23:27:15. # # It rejoined the cluster at 23:27:23. # # Then at 23:28:12, the fence action was declared complete, # # and node 2 shut down its cluster services. # # Node 1 Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]: notice: Client stonith_admin.1366.66468bec wants to fence (reboot) 'nwahl-rhel7-node2' with device '(any)' Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]: notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2 Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]: notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]: notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list Dec 11 23:27:22 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A processor failed, forming new configuration. Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A new membership (10.138.0.2:169) was formed. Members left: 2 Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] Failed to receive the leave message. failed: 2 Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [CPG ] downlist left_list: 1 received Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1 Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [MAIN ] Completed service synchronization, ready to provide service. ... Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2 ... Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]: notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK) # # Node 2 Dec 11 23:26:44 nwahl-rhel7-node2 systemd: Started Session 1 of user nwahl. Dec 11 23:27:25 nwahl-rhel7-node2 journal: Runtime journal is using 8.0M (max allowed 365.8M, trying to leave 548.7M free of 3.5G available → current limit 365.8M). ... Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366@nwahl-rhel7-node1.c3382af8: OK Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: warning: Can't create a sane reply Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]: crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1! Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure
-
Environment
- * Red Hat Enterprise Linux 7 (High Availability Add-on 使用)
- * Red Hat Enterprise Linux 8 (High Availability Add-on 使用)
- Red Hat Enterprise Linux 9 (High Availability Add-on 使用)
- Google Cloud Platform (オプション)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.