Translated message

A translation of this page exists in English.

ノードが、フェンスされて corosync と pacemaker を再起動した後、pacemaker をシャットダウンする

Solution In Progress - Updated -

Issue

  • 理論的には、この問題は、タイミングが悪ければどのプラットフォームでも発生する可能性があります。ただし、fence_gce フェンスエージェントによる再起動方法が原因で、Google Cloud Platform で発生する可能性が高くなる場合があります。

    • 一般的なケース: トークンが失われたことで、ノードが corosync メンバーシップを離れました。このノードに対して stonith アクションが開始された後、ノードが再起動される前に corosync メンバーシップに再び参加しました。ノードは、再起動してクラスターサービスを開始した後、"We were allegedly just fenced" というメッセージを受信し、pacemaker サービスと corosync サービスをシャットダウンしました。

      # # In this example, token loss occurred at 01:27:57 due to a network issue, after the token timeout expired.
      # # A new one-node membership reflecting token loss formed at 01:28:21, after the consensus timeout expired.
      # # Node 1 initiated a stonith action against node 2.
      # # Node 2 rejoined the corosync membership at 01:28:23, when the network issue was resolved.
      # # A new two-node membership formed, with node 2 back in the CPG group.
      May  4 01:27:57 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
      May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116f4) was formed. Members left: 2
      May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[1]: 1
      ...
      May  4 01:28:22 fastvm-rhel-8-0-23 pacemaker-schedulerd[1739]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
      ...
      May  4 01:28:22 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Delaying 'reboot' action targeting node2 using xvm2 for 20s
      May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116f8) was formed. Members joined: 2
      May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
      May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.
      
      # # At 01:28:45, node 1 received confirmation that node 2 had been successfully rebooted.
      May  4 01:28:45 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' [43895] (call 28 from pacemaker-controld.1740) targeting node2 using xvm2 returned 0 (OK)
      
      # # In order to fully complete the stonith action, it needed to deliver the confirmation message to
      # # all nodes in the CPG group. Node 2 was still in the CPG group from the rejoin at 01:28:23.
      # # A new membership without node 2 had not yet been formed, because
      # # (token timeout + consensus timeout) had not yet expired since the reboot.
      # # So the message was not delivered until node 2 started cluster services after boot.
      # # In receiving this message, node 2 received notification that it had been fenced.
      # # So it shut itself down in response.
      # # Node 1:
      May  4 01:29:01 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
      May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116fc) was formed. Members joined: 2 left: 2
      May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
      May  4 01:29:09 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK
      May  4 01:29:09 fastvm-rhel-8-0-23 pacemaker-controld[1740]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1740: OK
      ...
      May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [CFG   ] Node 2 was shut down by sysadmin
      May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.11700) was formed. Members left: 2
      May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[1]: 1
      May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.
      
      # # Node 2:
      May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice  [TOTEM ] A new membership (1.116fc) was formed. Members joined: 1
      May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice  [QUORUM] Members[2]: 1 2
      May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-fenced    [1319] (remote_op_done)  notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK | id=b69b57a1
      May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-controld  [1323] (tengine_stonith_notify)  crit: We were allegedly just fenced by node1 for node1!
      
    • GCP の場合: Google Compute Engine (GCE) 仮想マシンが fence_gce エージェントによってフェンスされ、再起動しました。フェンスアクションが完了する前に、クラスターに再参加しました。その後すぐに、pacemaker サービスと corosync サービスをシャットダウンし、クラスターを離れました。

      # # In this example, node 2 was rebooted at 23:27:15.
      # # It rejoined the cluster at 23:27:23.
      # # Then at 23:28:12, the fence action was declared complete,
      # # and node 2 shut down its cluster services.
      
      # # Node 1
      Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Client stonith_admin.1366.66468bec wants to fence (reboot) 'nwahl-rhel7-node2' with device '(any)'
      Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2
      Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
      Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
      Dec 11 23:27:22 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A processor failed, forming new configuration.
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A new membership (10.138.0.2:169) was formed. Members left: 2
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] Failed to receive the leave message. failed: 2
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [CPG   ] downlist left_list: 1 received
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [MAIN  ] Completed service synchronization, ready to provide service.
      ...
      Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2
      ...
      Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK)
      
      # # Node 2
      Dec 11 23:26:44 nwahl-rhel7-node2 systemd: Started Session 1 of user nwahl.
      Dec 11 23:27:25 nwahl-rhel7-node2 journal: Runtime journal is using 8.0M (max allowed 365.8M, trying to leave 548.7M free of 3.5G available → current limit 365.8M).
      ...
      Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:  notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366@nwahl-rhel7-node1.c3382af8: OK
      Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:   error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL
      Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: warning: Can't create a sane reply
      Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]:    crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1!
      Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure
      

Environment

  • * Red Hat Enterprise Linux 7 (High Availability Add-on 使用)
  • * Red Hat Enterprise Linux 8 (High Availability Add-on 使用)
  • Red Hat Enterprise Linux 9 (High Availability Add-on 使用)
  • Google Cloud Platform (オプション)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content