Red Hat High Availability 2 nodes 2 service groups failback behaviour

Latest response

HI ,

I'm setting up a simple 2 node HA cluster with 2 service groups and struggling to achieve the behavior I want. I wonder if anyone else has experience or knows how to achieve this?

I'd like each service group to start up on its "home" or preferred node, be capable of failing over to the other node but NOT failback automatically after a failover.

I have settled on the following setup:

<failoverdomains>
                        <failoverdomain name="01-domain" nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode name="node1" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="02-domain" nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode name="node2" priority="1"/>
            </failoverdomain>
</failoverdomains>

With a service group in each failoverdomain. This seems to achieve the effect that the service groups start up on the preferred nodes and are able to failover to the other node when something happens however even with nofailback set to 1 the service group will move back to the original node after it has been successfully fenced and comes back online.

If i change the config to :

 <failoverdomains>
<failoverdomain name="01-domain" nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode name="node1" priority="1"/>
                                <failoverdomainnode name="node2" priority="2"/>
</failoverdomain>                       
 <failoverdomain name="02-domain" nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode name="node2" priority="1"/>
                                <failoverdomainnode name="node1" priority="2"/>
</failoverdomain>
 </failoverdomains>

Then the nofailback setting seems to work as expected however I noticed that when I start the cluster both service groups start on the same node. I can set the service groups not to auto start and then use clusvcadm -Fe to start them with respect to failoverdomain policies but that option seems to be deprecated and also prevent the service groups starting in the event of a crash.

Can anyone help?

Many thanks
Nick

Attachments

Responses

Nick, the second "failoverdomain" configuration looks perfect to me. How did you perform the test? Did you enable both "cman & rgmanager" to start on boot and poweron/restart both nodes at same time? Otherwise, manually start "rgmanager" service on both nodes at same time? Also, attach your cluster.conf file.

Hi Sadashiva, I'm just using the ccs --startall command to start up the cluster. The results are varied but often both service groups come online on the same node. I've attached the full cluster.conf

Btw, I forget to ask on RHEL version being used for cluster setup. I assume it is RHEL6.x. With all cluster related services being configured to come up on boot, does the cluster behaves the same when both nodes goes for reboot at same time...?

Even if both nodes are rebooted (or the cluster software started) at the same time, there is going to be the one particular microsecond when the first rgmanager starts up on a cluster node that has achieved cluster quorum. At that time, the other node is either ready to start cluster services, or it is not.

If the other node is ready, then both services should come up on their preferred nodes. But since we are talking about the first node that managed to start rgmanager, then it is likely that the other node is not yet quite reached that state.

If the other node is not ready at that time, then rgmanager must start both services on the node that is ready. It cannot wait forever for the other node, for that would be the same as not starting the other service at all if the other node has failed, and that is not acceptable.

The man page of rgmanager has the following about the nofailback policy:

When nofailback is used, the following two behaviors should be noted:

* If a subset of cluster nodes forms a quorum, the node with the highest priority in the failover domain is selected to run a service bound to the domain. After this point, a higher priority member joining the cluster will not trigger a relocation.

* When a service is running outside of its unrestricted failover domain and a cluster member boots which is a part of the service's failover domain, the service will relocate to that member. That is, nofailback does not prevent transitions from outside of a failover domain to inside a failover domain. After this point, a higher priority member joining the cluster will not trigger a relocation.

Ordering, restriction, and nofailback are flags and may be combined in almost any way (ie, ordered+restricted, unordered+unrestricted, etc.). These combinations affect both where services start after initial quorum formation and which cluster members will take over services in the event that the service has failed. 

That would imply that in the original post's first configuration, both services would initially start up on the first node that has successfully started rgmanager. Once the non-preferred service had completed start-up, it would then immediately go down in order to move to its preferred server, and then start up again there.

I can set the service groups not to auto start and then use clusvcadm -Fe to start them with respect to failoverdomain policies but that option seems to be deprecated

Where is it said that "clusvcadm -Fe" is deprecated?

and also prevent the service groups starting in the event of a crash.

Only when both nodes crash simultaneously enough that the surviving node cannot "hand over" the current cluster state information to the recovering node - but then you are again in the situation that both nodes are starting up roughly at the same time. If it is important to you that both services will automatically start on their appropriate nodes (and not do a start-stop-move-restart four-step to get there) when doing a controlled start-up, the same reasoning would probably apply in this situation too.

I might try the following:

  • use a failover domain configuration similar to the second example in the original post

  • disable autostart on the services

  • make a custom start-up script that runs after the cluster start-up script has completed, that will first wait a fixed short amount of time (that's ideally a bit different in each node) to ensure all cluster components have completed their initial communications, then run a "clusvcadm -Fe " on each service, ignoring any "service is already running" errors.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.