Node in RedHat Cluster being removed from cluster due to missing heartbeats
Issue
We have one node in a 6 node RedHat cluster that we believed had hardware issues. It kept being dropped from the cluster, and our hardware vendor thought they had diagnosed hardware faults, and as a result have replaced both SP card and motherboard. The failing node is a Sun Fire X4600 server running RedHat Linux 4.8 (as are all the other nodes in the cluster).
Following the hardware replacements the server does appear to be stable, however, it is still getting failed by the cluster intermittently with the following errors:
Jun 3 17:04:24 bsqe01041 kernel: CMAN: node bsqe02041 has been removed from the cluster : Missed too many heartbeats
Jun 3 17:04:24 bsqe01041 kernel: CMAN: Started transition, generation 97
Jun 3 17:04:24 bsqe01041 clurgmgrd: [11564]: <info> Executing /etc/init.d/clu-pci status
Jun 3 17:04:25 bsqe01041 kernel: CMAN: Finished transition, generation 97
Jun 3 17:04:25 bsqe01041 fenced[6243]: fencing deferred to bsqe01040
Jun 3 17:04:37 bsqe01041 kernel: dm-cmirror: A cluster mirror log member has failed.
Jun 3 17:04:39 bsqe01041 clurgmgrd[11564]: <info> Magma Event: Membership Change
Jun 3 17:04:39 bsqe01041 clurgmgrd[11564]: <info> State change: bsqe02041 DOWN
Environment
RHEL 4.8
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
