High Availability does not Fail Over when a node goes offline

Latest response

I have a 2-node cluster, which has a pretty minimal configuration:

2 nodes, each of which is a RHEL 6.7 VM running on a different ESX host server in a bladecenter.
Failover Domain which is Prioritized and Restricted
3 Virtual IP Address Resources
One Service Group, which consists of:
Parent Virtual IP Address (Primary)
Children of 2 other virtual IP addresses, and one RHEL service. The service simply calls a script from /etc/init.d, which starts a custom tomcat instance.

The cluster is online and healthy. If I manually try to failover the service from one one to another, everything works fine.

If I try to simulate a hardware failure by doing a hard power off on the (VM) node which is currently in control of the service, nothing happens.

On the second node, running clustat shows that node1 is offline, but then shows that the service name is still "started" and the owner (Last) is still the first node. It will not even allow me to force the service to migrate to the 2nd node. (That just hangs)
If I bring the power back up, then the first node comes back online, and THEN it actually tries to fail over to the second node, which is no longer necessary.

What I am wondering is how to make sure that the service fails over to the second node when the power (or any other hardware failure) is lost on the first node. I am unable to upload a cluster.conf due to the files being on a different network, and the server does not have internet connectivity.

Some additional notes:
I am not currently using fencing, though I have tested the vmware fencing option and that did not resolve the issue.
I also tried the cman two_node=1, expected_votes=1 in my cluster.conf for possible quorum issues but that did not resolve the issue.

Thanks!

Responses

At first look it seems to be a multicast issue, when a node goes offline the internal heartbeat communication between nodes is not properly happening, hence ,the second see the first node as online and hence, service is not moving,..

Also, fencing is required in every setup as without being a node fenced service fail-over activity would not kick-off. If you don't want power fencing, then you could use SCSI fencing. The tags "two_ node" & "expected_votes" are required in case of a two node cluster to avoid split-brain situations.

Sadashiva,

Thank you for your response.    It could just be that we used the wrong fencing type.   Using the VMWARE type did not resolve the problem.   We are using the two_node and expected_votes now, but as before, we still don't get failover on hard-power offs.   Power fencing would be a valid check, we might look into that.   As for multicast, we have also tested this with no firewall in place.    The communication is there, since all of the failover commands work normally.  (IE:  When manually requested, or when the service in question fails)   It only doesn't work on a hard-power off.     The strangest part is that the failover that is supposed to happen, actually does happen, but only after the powered off node comes back online.   It acts like the service is locked into the node and can't be released until the node that was running it allows it to.

We are currently looking at using pacemaker instead (or in addition to cman), as our first tests with it work easily in this case.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.