Cluster services showing group state of JOIN_STOP_WAIT, LEAVE_STOP_WAIT, or FAIL_ALL_STOP after a cluster node reboot in RHEL 5 or RHEL 6
Issue
-
A node in the cluster had communication issues after starting, at which point we attempted to stop the
cmanservice. After this the rest of the cluster never recovered, withcman_tool servicesshowing groups stuck inLEAVE_STOP_WAITorFAIL_ALL_STOP. No fencing or service management is taking place while in this state, and rejoining/removing the affected node does not cause the cluster to recover. -
One of our servers under a cluster got fenced. I saw the fence was successfully done by another node in the cluster. After a reboot and when we start the
cman, it hangs for around ~5 minutes in the Starting fencing and finally shows "OK" as the status, but thengroup_toolshowsJOIN_STOP_WAITwhere the state should benone.
[root@node2 ~]# group_tool ls
type level name id state
fence 0 default 00000000 JOIN_STOP_WAIT
[1 2 3 4]
On other nodes, it shows FAIL_ALL_STOPPED
[root@node4 ~]# group_tool ls
type level name id state
fence 0 default 00010002 FAIL_ALL_STOPPED
[1 2 3 4]
dlm 1 rgmanager 00020002 none
[1 2 3]
- When a node rejoins the cluster after being fenced, it can't mount GFS2 file systems and gets the error "node not a member of the default fence domain"
Environment
- Red Hat Enterprise Linux (RHEL) 5 and 6 with the High Availability Add On
- A node recently stopped with
service cman stoporfence_tool leave
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
