Multiple nodes get fenced when one node rejoins the corosync membership in a Pacemaker cluster

Solution Unverified - Updated -

Issue

  • I started corosync and pacemaker services on one cluster node. Shortly thereafter, multiple other nodes began leaving and rejoining, forming new corosync memberships. This caused pacemaker failures and eventually more fence events. The issue is difficult to describe succinctly, so below is an example sequence of events.
# # Node list:
# #  1: node1
# #  2: node2
# #  3: node3
# #  4: node4
# #  5: node5
# #  6: node6
# #  7: node7

# # node5 was rebooted and rejoined the cluster at 13:11.
Mar  1 13:11:32 node5 corosync[1661]: [QUORUM] Members[1]: 5
Mar  1 13:11:32 node5 corosync[1661]: [MAIN  ] Completed service synchronization, ready to provide service.

# # Pacemaker hit a "Timer expired" error shortly after the DC timeout.
Mar  1 13:11:55 node5 crmd[2107]: warning: Input I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
Mar  1 13:12:04 node5 crmd[2107]:   error: Node update 10 failed: Timer expired (-62)
Mar  1 13:12:04 node5 crmd[2107]:   error: Input I_ERROR received in state S_ELECTION from node_list_update_callback
Mar  1 13:12:04 node5 crmd[2107]:  notice: State transition S_ELECTION -> S_RECOVERY
Mar  1 13:12:04 node5 crmd[2107]: warning: Fast-tracking shutdown in response to errors
Mar  1 13:12:04 node5 crmd[2107]:   error: Input I_TERMINATE received in state S_RECOVERY from do_recover
Mar  1 13:12:04 node5 crmd[2107]:  notice: Disconnected from the LRM

# # At 13:12:49, pacemaker disconnected from corosync, threw some more errors, and then respawned.
# # This node still didn't obtain quorum
Mar  1 13:12:49 node5 corosync[1661]: [TOTEM ] A new membership (10.1.50.253:74214) was formed. Members
Mar  1 13:12:49 node5 crmd[2107]:  notice: Disconnected from Corosync
Mar  1 13:12:49 node5 corosync[1661]: [CPG   ] downlist left_list: 0 received
Mar  1 13:12:49 node5 corosync[1661]: [QUORUM] Members[1]: 5
Mar  1 13:12:49 node5 corosync[1661]: [MAIN  ] Completed service synchronization, ready to provide service.
Mar  1 13:12:49 node5 attrd[2105]:  notice: Recorded local node as attribute writer (was unset)
Mar  1 13:12:49 node5 cib[2102]: warning: new_event_notification (/dev/shm/qb-2102-2107-11-mAzzus/qb): Broken pipe (32)
Mar  1 13:12:49 node5 cib[2102]: warning: A-Sync reply to crmd failed: No message of desired type
Mar  1 13:12:49 node5 crmd[2107]:  notice: Disconnected from the CIB
Mar  1 13:12:49 node5 crmd[2107]:   error: Could not recover from internal error
Mar  1 13:12:49 node5 pacemakerd[2073]:   error: crmd[2107] exited with status 201 (Generic Pacemaker error)
Mar  1 13:12:49 node5 pacemakerd[2073]:  notice: Respawning failed child process: crmd
Mar  1 13:12:49 node5 crmd[13594]:  notice: Additional logging available in /var/log/cluster/corosync.log
Mar  1 13:12:49 node5 crmd[13594]:  notice: Connecting to cluster infrastructure: corosync
Mar  1 13:12:49 node5 crmd[13594]: warning: Quorum lost
...
Mar  1 13:13:11 node5 crmd[13594]:  notice: node5 was successfully unfenced by node5 (at the request of node5)

# # At 13:14:49, it finally formed a full, 7-node membership.
Mar  1 13:14:49 node5 corosync[1661]: [TOTEM ] A new membership (10.1.50.211:74218) was formed. Members joined: 3 4 1 6 2 7
Mar  1 13:14:49 node5 corosync[1661]: [TOTEM ] A new membership (10.1.50.211:74218) was formed. Members joined: 3 4 1 6 2 7
Mar  1 13:14:49 node5 corosync[1661]: [CPG   ] downlist left_list: 4 received
... Skipping some ...
Mar  1 13:14:49 node5 corosync[1661]: [CPG   ] downlist left_list: 3 received
Mar  1 13:14:49 node5 stonith-ng[2103]:  notice: Node node7 state is now member
Mar  1 13:14:49 node5 corosync[1661]: [QUORUM] This node is within the primary component and will provide service.
Mar  1 13:14:49 node5 corosync[1661]: [QUORUM] Members[7]: 3 4 1 6 2 5 7
Mar  1 13:14:49 node5 corosync[1661]: [MAIN  ] Completed service synchronization, ready to provide service.
Mar  1 13:14:49 node5 crmd[13594]:  notice: Quorum acquired

# # Two other nodes immediately left.
Mar  1 13:14:49 node5 corosync[1661]: [TOTEM ] A new membership (10.1.50.211:74227) was formed. Members joined: 6 7 left: 6 7
Mar  1 13:14:49 node5 corosync[1661]: [TOTEM ] Failed to receive the leave message. failed: 6 7

# # Pacemaker died 80 seconds later when the timer expired for quorum update. (There were 5 out of 7 members.)
Mar  1 13:16:09 node5 crmd[13594]:   error: Quorum update 66 failed: Timer expired (-62)
Mar  1 13:16:09 node5 crmd[13594]:   error: Node update 74 failed: Timer expired (-62)
Mar  1 13:16:09 node5 crmd[13594]:   error: Node update 75 failed: Timer expired (-62)
Mar  1 13:16:09 node5 crmd[13594]:   error: Input I_ERROR received in state S_ELECTION from crmd_node_update_complete
Mar  1 13:16:09 node5 crmd[13594]:  notice: State transition S_ELECTION -> S_RECOVERY
Mar  1 13:16:09 node5 crmd[13594]: warning: Fast-tracking shutdown in response to errors

# # Meanwhile, other nodes showed a plethora of membership formations
Mar  1 13:11:32 node5 corosync[1661]: [TOTEM ] A new membership (10.1.50.253:74210) was formed. Members joined: 5
Mar  1 13:12:49 node5 corosync[1661]: [TOTEM ] A new membership (10.1.50.253:74214) was formed. Members
Mar  1 13:14:04 node2 corosync[1641]: [TOTEM ] A new membership (10.1.50.234:74214) was formed. Members left: 3 4 1
Mar  1 13:14:04 node6 corosync[1643]: [TOTEM ] A new membership (10.1.50.234:74214) was formed. Members left: 3 4 1
Mar  1 13:14:04 node7 corosync[1633]: [TOTEM ] A new membership (10.1.50.234:74214) was formed. Members left: 3 4 1
Mar  1 13:14:04 node1 corosync[1614]: [TOTEM ] A new membership (10.1.50.211:74213) was formed. Members left: 4 6 2 7
Mar  1 13:14:49 node2 corosync[1641]: [TOTEM ] A new membership (10.1.50.211:74218) was formed. Members joined: 3 4 1 5
Mar  1 13:14:49 node2 corosync[1641]: [TOTEM ] A new membership (10.1.50.211:74227) was formed. Members joined: 6 7 left: 6 7
Mar  1 13:14:49 node5 corosync[1661]: [TOTEM ] A new membership (10.1.50.211:74218) was formed. Members joined: 3 4 1 6 2 7
Mar  1 13:14:49 node5 corosync[1661]: [TOTEM ] A new membership (10.1.50.211:74227) was formed. Members joined: 6 7 left: 6 7
Mar  1 13:14:49 node6 corosync[12068]: [TOTEM ] A new membership (10.1.50.211:74227) was formed. Members joined: 3 4 1 2 5 7
Mar  1 13:14:49 node6 corosync[12068]: [TOTEM ] A new membership (10.1.50.234:74223) was formed. Members joined: 6
Mar  1 13:14:49 node6 corosync[1643]: [TOTEM ] A new membership (10.1.50.211:74218) was formed. Members joined: 3 4 1 5
Mar  1 13:14:49 node7 corosync[14507]: [TOTEM ] A new membership (10.1.50.211:74227) was formed. Members joined: 3 4 1 6 2 5
Mar  1 13:14:49 node7 corosync[14507]: [TOTEM ] A new membership (10.1.50.254:74223) was formed. Members joined: 7
Mar  1 13:14:49 node7 corosync[1633]: [TOTEM ] A new membership (10.1.50.211:74218) was formed. Members joined: 3 4 1 5
Mar  1 13:14:49 node1 corosync[1614]: [TOTEM ] A new membership (10.1.50.211:74218) was formed. Members joined: 4 6 2 5 7
Mar  1 13:14:49 node1 corosync[1614]: [TOTEM ] A new membership (10.1.50.211:74227) was formed. Members joined: 6 7 left: 6 7

Environment

  • Red Hat Enterprise Linux 7 (with the High Availability Add-on)
  • "udp over multicast" corosync transport protocol

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content