If you find that a node is being fenced at random, check for the following conditions.
The root cause of fences is always a node losing token, meaning that it lost communication with the rest of the cluster and stopped returning heartbeat.
Any situation that results in a system not returning heartbeat within the specified token interval could lead to a fence. By default the token interval is 10 seconds. It can be specified by adding the desired value (in milliseconds) to the token parameter of the totem tag in the
cluster.conf file (for example, setting
totem token="30000" for 30 seconds).
Ensure that the network is sound and working as expected.
Ensure that the interfaces the cluster uses for inter-node communication are not using any bonding mode other than 0, 1, or 2. (Bonding modes 0 and 2 are supported as of Red Hat Enterprise Linux 6.4.)
Take measures to determine if the system is "freezing" or kernel panicking. Set up the
kdump utility and see if you get a core during one of these fences.
Make sure some situation is not arising that you are wrongly attributing to a fence, for example the quorum disk ejecting a node due to a storage failure or a third party product like Oracle RAC rebooting a node due to some outside condition. The messages logs are often very helpful in determining such problems. Whenever fences or node reboots occur it should be standard practice to inspect the messages logs of all nodes in the cluster from the time the reboot/fence occurred.
Thoroughly inspect the system for hardware faults that may lead to the system not responding to heartbeat when expected.