A node shuts down pacemaker after getting fenced and restarting corosync and pacemaker

Solution Verified - Updated -

Issue

  • In theory, this issue can happen on any platform if timing is unlucky, though it may be more likely on Google Cloud Platform due to the way the fence_gce fence agent performs a reboot.

    • Generic case: A node left the corosync membership due to token loss. After a stonith action against the node was initiated and before the node was rebooted, the node rejoined the corosync membership. After the node rebooted and started cluster services, it received a "We were allegedly just fenced" message and shut down its pacemaker and corosync services.

      # # In this example, token loss occurred at 01:27:57 due to a network issue, after the token timeout expired.
      # # A new one-node membership reflecting token loss formed at 01:28:21, after the consensus timeout expired.
      # # Node 1 initiated a stonith action against node 2.
      # # Node 2 rejoined the corosync membership at 01:28:23, when the network issue was resolved.
      # # A new two-node membership formed, with node 2 back in the CPG group.
      May  4 01:27:57 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
      May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116f4) was formed. Members left: 2
      May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[1]: 1
      ...
      May  4 01:28:22 fastvm-rhel-8-0-23 pacemaker-schedulerd[1739]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
      ...
      May  4 01:28:22 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Delaying 'reboot' action targeting node2 using xvm2 for 20s
      May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116f8) was formed. Members joined: 2
      May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
      May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.
      
      # # At 01:28:45, node 1 received confirmation that node 2 had been successfully rebooted.
      May  4 01:28:45 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' [43895] (call 28 from pacemaker-controld.1740) targeting node2 using xvm2 returned 0 (OK)
      
      # # In order to fully complete the stonith action, it needed to deliver the confirmation message to
      # # all nodes in the CPG group. Node 2 was still in the CPG group from the rejoin at 01:28:23.
      # # A new membership without node 2 had not yet been formed, because
      # # (token timeout + consensus timeout) had not yet expired since the reboot.
      # # So the message was not delivered until node 2 started cluster services after boot.
      # # In receiving this message, node 2 received notification that it had been fenced.
      # # So it shut itself down in response.
      # # Node 1:
      May  4 01:29:01 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
      May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116fc) was formed. Members joined: 2 left: 2
      May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
      May  4 01:29:09 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK
      May  4 01:29:09 fastvm-rhel-8-0-23 pacemaker-controld[1740]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1740: OK
      ...
      May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [CFG   ] Node 2 was shut down by sysadmin
      May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.11700) was formed. Members left: 2
      May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[1]: 1
      May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.
      
      # # Node 2:
      May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice  [TOTEM ] A new membership (1.116fc) was formed. Members joined: 1
      May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice  [QUORUM] Members[2]: 1 2
      May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-fenced    [1319] (remote_op_done)  notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK | id=b69b57a1
      May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-controld  [1323] (tengine_stonith_notify)  crit: We were allegedly just fenced by node1 for node1!
      
    • GCP case: A Google Compute Engine (GCE) VM got fenced by the fence_gce agent and rebooted. It rejoined the cluster before the fence action completed. Shortly thereafter, it shut down its pacemaker and corosync services and left the cluster.

      # # In this example, node 2 was rebooted at 23:27:15.
      # # It rejoined the cluster at 23:27:23.
      # # Then at 23:28:12, the fence action was declared complete,
      # # and node 2 shut down its cluster services.
      
      # # Node 1
      Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Client stonith_admin.1366.66468bec wants to fence (reboot) 'nwahl-rhel7-node2' with device '(any)'
      Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2
      Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
      Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
      Dec 11 23:27:22 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A processor failed, forming new configuration.
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A new membership (10.138.0.2:169) was formed. Members left: 2
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] Failed to receive the leave message. failed: 2
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [CPG   ] downlist left_list: 1 received
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1
      Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [MAIN  ] Completed service synchronization, ready to provide service.
      ...
      Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2
      ...
      Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK)
      
      # # Node 2
      Dec 11 23:26:44 nwahl-rhel7-node2 systemd: Started Session 1 of user nwahl.
      Dec 11 23:27:25 nwahl-rhel7-node2 journal: Runtime journal is using 8.0M (max allowed 365.8M, trying to leave 548.7M free of 3.5G available → current limit 365.8M).
      ...
      Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:  notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366@nwahl-rhel7-node1.c3382af8: OK
      Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:   error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL
      Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: warning: Can't create a sane reply
      Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]:    crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1!
      Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure
      

Environment

  • Red Hat Enterprise Linux 7 (with the High Availability Add-on)
  • Red Hat Enterprise Linux 8 (with the High Availability Add-on)
  • Google Cloud Platform (optional)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content