Cluster services showing group state of JOIN_STOP_WAIT, LEAVE_STOP_WAIT, or FAIL_ALL_STOP after a cluster node reboot in RHEL 5 or RHEL 6
Issue
-
A node in the cluster had communication issues after starting, at which point we attempted to stop the
cman
service. After this the rest of the cluster never recovered, withcman_tool services
showing groups stuck inLEAVE_STOP_WAIT
orFAIL_ALL_STOP
. No fencing or service management is taking place while in this state, and rejoining/removing the affected node does not cause the cluster to recover. -
One of our servers under a cluster got fenced. I saw the fence was successfully done by another node in the cluster. After a reboot and when we start the
cman
, it hangs for around ~5 minutes in the Starting fencing and finally shows "OK" as the status, but thengroup_tool
showsJOIN_STOP_WAIT
where the state should benone
.
[root@node2 ~]# group_tool ls
type level name id state
fence 0 default 00000000 JOIN_STOP_WAIT
[1 2 3 4]
On other nodes, it shows FAIL_ALL_STOPPED
[root@node4 ~]# group_tool ls
type level name id state
fence 0 default 00010002 FAIL_ALL_STOPPED
[1 2 3 4]
dlm 1 rgmanager 00020002 none
[1 2 3]
- When a node rejoins the cluster after being fenced, it can't mount GFS2 file systems and gets the error "node not a member of the default fence domain"
Environment
- Red Hat Enterprise Linux (RHEL) 5 and 6 with the High Availability Add On
- A node recently stopped with
service cman stop
orfence_tool leave
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.