A fenced node was rebooted twice in a row in a Pacemaker cluster

Solution In Progress - Updated -

Issue

  • A node got rebooted successfully by fencing once, but then the action was marked as timed out, and the node got fenced again shortly thereafter. As shown in the logs below, the fenced node didn't leave the corosync membership until a long time after the fence event was initiated.
Mar 19 22:13:30 fastvm-rhel-8-0-23 pacemaker-schedulerd[338749]: warning: Unexpected result (error) was recorded for monitor of dummy1 on node2 at Mar 19 22:13:30 2021
Mar 19 22:13:30 fastvm-rhel-8-0-23 pacemaker-schedulerd[338749]: warning: Cluster node node2 will be fenced: dummy1 failed there
Mar 19 22:13:30 fastvm-rhel-8-0-23 pacemaker-schedulerd[338749]: warning: Scheduling Node node2 for STONITH
...
Mar 19 22:13:30 fastvm-rhel-8-0-23 pacemaker-fenced[338746]: notice: Requesting that node1 perform 'reboot' action targeting node2
...
Mar 19 22:13:32 fastvm-rhel-8-0-23 pacemaker-fenced[338746]: notice: Operation 'reboot' [338790] (call 3 from pacemaker-controld.338750) targeting node2 using xvm2 returned 0 (OK)
...
Mar 19 22:13:37 fastvm-rhel-8-0-23 corosync[1729]:  [KNET  ] link: host: 2 link: 0 is down
Mar 19 22:13:37 fastvm-rhel-8-0-23 corosync[1729]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 19 22:13:37 fastvm-rhel-8-0-23 corosync[1729]:  [KNET  ] host: host: 2 has no active links
Mar 19 22:13:43 fastvm-rhel-8-0-23 corosync[1729]:  [TOTEM ] Token has not been received in 12750 ms
Mar 19 22:13:47 fastvm-rhel-8-0-23 corosync[1729]:  [TOTEM ] A processor failed, forming new configuration.
Mar 19 22:13:49 fastvm-rhel-8-0-23 pcsd[1375]: INFO:tornado.access:200 GET /remote/get_configs?cluster_name=testcluster (192.168.22.24) 36.51ms
Mar 19 22:14:08 fastvm-rhel-8-0-23 corosync[1729]:  [TOTEM ] A new membership (1.11620) was formed. Members left: 2
Mar 19 22:14:08 fastvm-rhel-8-0-23 corosync[1729]:  [TOTEM ] Failed to receive the leave message. failed: 2
Mar 19 22:14:08 fastvm-rhel-8-0-23 corosync[1729]:  [CPG   ] downlist left_list: 1 received
Mar 19 22:14:08 fastvm-rhel-8-0-23 corosync[1729]:  [QUORUM] Members[1]: 1
Mar 19 22:14:08 fastvm-rhel-8-0-23 corosync[1729]:  [MAIN  ] Completed service synchronization, ready to provide service.
...
Mar 19 22:14:08 fastvm-rhel-8-0-23 pacemaker-fenced[338746]: error: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.338750@node1: Timer expired
Mar 19 22:14:08 fastvm-rhel-8-0-23 pacemaker-fenced[338746]: error: Already sent notifications for 'reboot' targeting node2 by node1 for client pacemaker-controld.338750@node1: OK
Mar 19 22:14:08 fastvm-rhel-8-0-23 pacemaker-controld[338750]: notice: Stonith operation 3/5:5:0:c1e18fcd-50b5-44fa-bae0-49da438e92d7: Timer expired (-62)
Mar 19 22:14:08 fastvm-rhel-8-0-23 pacemaker-controld[338750]: notice: Stonith operation 3 for node2 failed (Timer expired): aborting transition.
Mar 19 22:14:08 fastvm-rhel-8-0-23 pacemaker-controld[338750]: notice: Transition 5 aborted: Stonith failed
Mar 19 22:14:08 fastvm-rhel-8-0-23 pacemaker-controld[338750]: notice: Peer node2 was not terminated (reboot) by node1 on behalf of pacemaker-controld.338750: Timer expired
...
Mar 19 22:14:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[338749]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster

Environment

  • Red Hat Enterprise Linux 7 (with the High Availability Add-on)
  • Red Hat Enterprise Linux 8 (with the High Availability Add-on)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In