A node shuts down Pacemaker after getting fenced by fence_gce and rejoining the cluster on Google Cloud Platform

Solution In Progress - Updated -

Issue

  • A Google Compute Engine (GCE) VM got fenced by the fence_gce agent and rebooted. It rejoined the cluster before the fence action completed. Shortly thereafter, it shut down its pacemaker and corosync services and left the cluster.
# # In this example, node 2 was rebooted at 23:27:15.
# # It rejoined the cluster at 23:27:23.
# # Then at 23:28:12, the fence action was declared complete,
# # and node 2 shut down its cluster services.

# # Node 1
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Client stonith_admin.1366.66468bec wants to fence (reboot) 'nwahl-rhel7-node2' with device '(any)'
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
Dec 11 23:27:22 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A processor failed, forming new configuration.
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A new membership (10.138.0.2:169) was formed. Members left: 2
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] Failed to receive the leave message. failed: 2
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [CPG   ] downlist left_list: 1 received
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [MAIN  ] Completed service synchronization, ready to provide service.
...
Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2
...
Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK)


# # Node 2
Dec 11 23:26:44 nwahl-rhel7-node2 systemd: Started Session 1 of user nwahl.
Dec 11 23:27:25 nwahl-rhel7-node2 journal: Runtime journal is using 8.0M (max allowed 365.8M, trying to leave 548.7M free of 3.5G available → current limit 365.8M).
...
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:  notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366@nwahl-rhel7-node1.c3382af8: OK
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:   error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: warning: Can't create a sane reply
Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]:    crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1!
Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure

Environment

  • Red Hat Enterprise Linux 7 (with the High Availability Add-on)
  • Red Hat Enterprise Linux 8 (with the High Availability Add-on)
  • Google Cloud Platform

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In