A node in the cluster had communication issues after starting, at which point we attempted to stop the
cmanservice. After this the rest of the cluster never recovered, with
cman_tool servicesshowing groups stuck in
FAIL_ALL_STOP. No fencing or service management is taking place while in this state, and rejoining/removing the affected node does not cause the cluster to recover.
One of our servers under a cluster got fenced. I saw the fence was successfully done by another node in the cluster. After a reboot and when we start the
cman, it hangs for around ~5 minutes in the Starting fencing and finally shows "OK" as the status, but then
JOIN_STOP_WAITwhere the state should be
[root@node2 ~]# group_tool ls type level name id state fence 0 default 00000000 JOIN_STOP_WAIT [1 2 3 4]
On other nodes, it shows
[root@node4 ~]# group_tool ls type level name id state fence 0 default 00010002 FAIL_ALL_STOPPED [1 2 3 4] dlm 1 rgmanager 00020002 none [1 2 3]
- When a node rejoins the cluster after being fenced, it can't mount GFS2 file systems and gets the error "node not a member of the default fence domain"
- Red Hat Enterprise Linux (RHEL) 5 and 6 with the High Availability Add On
- A node recently stopped with
service cman stopor
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.