RHEL5 Cluster on VMWare reboot/fencing randomly

Posted on

I have RHEL cluster on VMWare that was keep rebooting/fencing randomly for the past few months. I have checked my cluster log and messages log which never shows anything weird. I didnt see any log also shows failure in network heartbeat in dmesg. I'm unable to configure dlm debug to the cluster since it probably require downtime.
Everytime the fencing happens it will always come out like the following:

Mar 27 04:30:26 node1 ntpd[6570]: synchronized to LOCAL(0), stratum 10
Mar 27 04:30:26 node1 ntpd[6570]: kernel time sync enabled 0001
Mar 27 04:31:32 node1 ntpd[6570]: synchronized to 10.171.8.4, stratum 3
Mar 27 04:54:02 node1 ntpd[6570]: synchronized to 10.171.8.5, stratum 3
Mar 27 06:06:32 node1 kernel: dlm: closing connection to node 2
Mar 27 06:07:02 node1 fenced[6051]:node1 -c not a cluster member after 30 sec post_fail_delay
Mar 27 06:07:02 node1 fenced[6051]: fencing node "node1 -c"

Could someone guide me on how the other things should I troubleshoot for since I'm quite new with RHEL cluster? I need to know why when the dlm close connection to node 2, the node 2 never try to join back the cluster.

Attachments

Responses