Cluster services showing group state of JOIN_STOP_WAIT, LEAVE_STOP_WAIT, or FAIL_ALL_STOP after a cluster node reboot in RHEL 5 or RHEL 6

Solution Verified - Updated -

Issue

  • A node in the cluster had communication issues after starting, at which point we attempted to stop the cman service. After this the rest of the cluster never recovered, with cman_tool services showing groups stuck in LEAVE_STOP_WAIT or FAIL_ALL_STOP. No fencing or service management is taking place while in this state, and rejoining/removing the affected node does not cause the cluster to recover.

  • One of our servers under a cluster got fenced. I saw the fence was successfully done by another node in the cluster. After a reboot and when we start the cman, it hangs for around ~5 minutes in the Starting fencing and finally shows "OK" as the status, but then group_tool shows JOIN_STOP_WAIT where the state should be none.

[root@node2 ~]# group_tool ls
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4]

On other nodes, it shows FAIL_ALL_STOPPED

[root@node4 ~]# group_tool ls
type             level name       id       state
fence            0     default    00010002 FAIL_ALL_STOPPED
[1 2 3 4]
dlm              1     rgmanager  00020002 none
[1 2 3]
  • When a node rejoins the cluster after being fenced, it can't mount GFS2 file systems and gets the error "node not a member of the default fence domain"

Environment

  • Red Hat Enterprise Linux (RHEL) 5 and 6 with the High Availability Add On
  • A node recently stopped with service cman stop or fence_tool leave

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content