Quorum Disk ?

Latest response

Hi,

I have four (4) servers connected to SAN Storage and two (2) Switches Layer 2.

These four (4) servers are already set to Clustering and connecting to SAN Storage concurrently.

Server A and Server B are connecting to two (2) Storages (SAN) concurrently. Also they have the Service Group 'X' that is running (Active) at server A and Standby at Server B. Server C and Server D are only access to one (1) storage (SAN) concurrently (No active Service is running)

Network cable for IP Data is connected to First Switch and network cable for IP Heartbeat is connected to the Second Switch. Unfortunately, the connection of heartbeat from servers to the switch are not bonded (single link) because of limitation port in the switch.

My client's next plan is going to replace these two switches into new switches and ask me that these servers should not going to downtime while the network guys doing the switch replacement.

My question, is there the way to make these four servers still running (no downtime) even the network guys do unplug all network connected cables (Heartbeat) and install the new swithes ? is it necessary to configure Quorum Disk ? i was just affraid that this service X could be 'freeze' because the heartbeat connection is down temporarily.

Hereby the attachment "sample illustration" and "sample cluster.conf" for the detail.

Thanks in Advance.

Best Regards,

Aldi
 

Responses

Aldi,

 

It will most probably go down. The only workaround that may save you is a cross-cable and build a bond if the servers still have NIC-ports left.

Adding a quorum disk also requires downtime, for you need to add a resource to the cluster.conf.

Rescaning cluster.conf for new resource brings the service down.

 

Kind regards,

 

Jan Gerrit Kootstra

Hi Jan,

Thanks for the comment and yes i think i should take these servers (especially the application services) down first. However, is it necessary to configure the quorum disk ? or i just to have to make a bonding interface network on heartbeat connection ?

Thanks in advance

 

Regards,

Aldi

A quorum disk is a nice to have, but not a need to have.

A bonded heartbeat is a must have.

I've read about "fencing war" that only affect on 2 nodes. I read it in this link " http://h30499.www3.hp.com/t5/System-Administration/Question-on-Quorum-Disk-amp-Fence-device-in-Redhat-Cluster/td-p/5255779 " .

I'm afraid that this "fencing war" also affect to 4 nodes on cluster if i didn't set a quorum disk.

 

Regards,

Aldi

Rather than having a Quorum disk, you could just increase the number of votes that one of the nodes has. 

So you set one box to have 2 votes, effectively setting the total votes to 5. For the cluster to be Quorate you would then need 3 votes. Whether that be the 2 vote host and 1 other, or the 3 other hosts. 

Within /etc/cluster/cluster.conf look for the <clusternode> section. Just change the vote value for the node in question. 

To your original question, there unfortunately is no good way to prevent the cluster from falling apart in the event of a network outage.  Like you said, you can freeze the cluster service(s) within rgmanager, but this would have no impact on cman and corosync/openais detecting a membership change and taking the appropriate action (fencing if enough nodes still have quorum to do so, or losing quorum).  Once the outage is over, you would have nodes attempting to rejoin into a single cluster, but they would be unable to "merge" their states, more than likely causing a failure of the cluster because none would have quorum to fence/kill, and thus they'd all stay in a non-functional state. 

  https://access.redhat.com/knowledge/solutions/43931

  https://access.redhat.com/knowledge/solutions/308583

Your best option is to shut down the cluster before the outage, or take steps to restart the cluster afterwards.

As to a quorum disk, we generally do not recommend it unless its explicitly required, as it adds complexity to the cluster and also opens more ways in which a node can fail.  You can review the common use cases in which it might make sense to have a quorum device here:

  https://access.redhat.com/knowledge/techbriefs/how-optimally-configure-quorum-disk-red-hat-enterprise-linux-clustering-and-hig

In your situation, a quorum device would not help in the case of a network outage, nor does it sound like you specifically need one for other purposes, unless you want to maintain quorum even after losing 2 nodes (which comes along with its own set of problems). 

Using assymetrical vote counts is another option,  but we generally recommend against this because you are still dealing with some randomness in whether you will maintain quorum or not.  For example, if you have 2 nodes on one switch and 2 nodes on another, and one side represents more votes than the other, if something happens to that side (power loss, etc) then the other side can't maintain quorum. 

Hope this helps.

John Ruemker, RHCA

Senior Software Maintenance Engineer

Global Support Services

Red Hat, Inc.

 

 

Also, on the topic of "fence races" or "fence wars", a standard 4-node cluster with 1-vote each would not be susceptible to fence races.  If you were to have a split down-the-middle, where 2 nodes on each side could communicate with each other but not the other two, you'd have a loss of quorum (3/4 votes required).  When a node does not have quorum, it will not fence, and thus there would be no race here. 

In a situation where 1 node was split from the rest, the 1 node would not have quorum and the other 3 would, so they would fence the 1 missing node and carry on.  No race here either.

Some people prefer to allow 2 nodes to fail and have the cluster continue operating, in which case you could use a quorum device as explained in the link from my above comment.  This does open the possibility for fence races then, because 2 sides could maintain quorum simultaneously, so your heuristics *must* be designed to result in one side failing when there is a split.

-John