Cluster services are blocked after regaining quorum in a > 2-node cluster when all nodes lose contact with each other RHEL 5 or 6

Solution Verified - Updated -

Issue

  • Concurrently disconnected all the connectivity between the hosts (by shutting off the associated switch ports). This results, as expected, in all three nodes discovering that they have lost quorum and thus correctly stopping all clustered services which they are hosting. When we re-enable all the switch ports between the cluster hosts, the cluster notices that quorum has been regained but none of the nodes attempt to restart any previously stopped services.
  • Cluster services such as clvmd, GFS2, rgmanager, etc are blocked after all nodes in a cluster lose connectivity and then regain it.
  • In a 4-node cluster, there was a split down the middle in which 2 nodes could see each other, and the other 2 could see each other, but the two sides lost contact. After the network recovered, everything was still stuck.
  • Unexpected behaviour when testing network failure of 3 node cluster.
  • Services on every node in the cluster are down. Each node claimed it has lost quorum.

Environment

  • Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
  • More than 2 nodes in a cluster
  • An event that causes all nodes to lose contact with each other briefly, either from a network disruption or resource starvation of some kind

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content