Cluster resource management is delayed after a node is fenced due to a resource failure in a RHEL 7 High Availability cluster

Solution Unverified - Updated -

Issue

  • We have a high token setting in our cluster, and whenever a node gets fenced due to a resource failure, the cluster blocks for a long time before resuming activity. The delay seems to be waiting for that token timeout, when eventually corosync processes a membership change and activity resumes.
  • Cluster doesn't recover any resources for a period of time after a node gets fenced due to failing a resource op
  • I see my cluster resource sitting failed/stopped when a node gets fenced after a stop failure, and doesn't get recovered for a few minutes.

Environment

  • Red Hat Enterprise Linux (RHEL) 7 with the High Availability Add-On
  • pacemaker
  • Some resource configured with an op setting of on-fail=fence
    • NOTE: This is the default for op stop on all resources unless otherwise specified in the configuration by administrators

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.