Cluster resource management is delayed after a node is fenced due to a resource failure in a RHEL 7 High Availability cluster

Solution Unverified - Updated 2024-06-14T18:43:17+00:00 -

Issue

We have a high token setting in our cluster, and whenever a node gets fenced due to a resource failure, the cluster blocks for a long time before resuming activity. The delay seems to be waiting for that token timeout, when eventually corosync processes a membership change and activity resumes.
Cluster doesn't recover any resources for a period of time after a node gets fenced due to failing a resource op
I see my cluster resource sitting failed/stopped when a node gets fenced after a stop failure, and doesn't get recovered for a few minutes.

Red Hat Enterprise Linux (RHEL) 7 with the High Availability Add-On
pacemaker
Some resource configured with an op setting of on-fail=fence
- NOTE: This is the default for op stop on all resources unless otherwise specified in the configuration by administrators

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.