Recover failed service and failed to fence node

Latest response

We have recently deployed Redhat + apache cluster and it's on production now, thanks also for the help of the community members. We have faced however two related issues that might be dangeours in the future.

The first issue is that if a service fails to stop for any reason then it will be marked as failed and will stop. We had once incident where apache stopped momentarily and this failuer was detected, but when rgmanager tried to stop the service, it failed to unmount one of the shared drives and as a result it marked the service as failed. This is dangeours because we have a service that stopped on one node and didn't failover to the other node although the other node was ready to accept the service. So how can we force rgmanager to keep trying to failover the service and how can force stopping the service on the first node?

The other issue which is also related but happened before deployment. When we configured the cluster, the configuration for the fencing was incorrect so when we were testing failover after a hard reset it was failing due to the fencing failure. This was fixed and now fencing is working. However again if fencing fails for any reason or if the node is stuck on shutdown for any particular reason we noticed that failover won't happen and again we will end up with a situation where one node is down and the other node is ready to accept the service but it can't because fencing has failed. So is there a way to force movin the server even if the fencing has failed. By the way we don't have a shared storage (GFS2), we use Active/Passive mode so even if fencing fails there are no risks on starting the service on the other node.

Again I appreciate your support.

Responses

About #1 If service is failed to stop then it will not be relocated to other node. This is required for avoiding any kind of corruption or running some of processes on both nodes at same time.

About #2 service will not be relocated to other cluster node if fencing is not succeeded. Reason is almost same as mentioned in for #1.

Best way to move further is check fencing is failing (or why node is hang) and correct it.

Hi,

reg #2,
In order to provide fault tolerance for resources and application high-availability the cluster suite must be able to restrict access to clustered resources in the event of a no de becoming errant. For example, should communication with a single node in the cluster fail the other nodes in the cluster must be able to restrict access to clustered resources such as shared storage so that the errant node cannot damage the clustered resources. This is accomplished by means of a "fence device."

A fence device is a device or system external to the system that can be used by the cluster to restrict access to shared resources by an errant node. The most comman fence devices are power fencing devices that allow the cluster to reboot or power-off the errant node.

Fencing is a fundamental part of the Red Hat Cluster infrastructure and it is therefore important to validate that it is working properly and that there is enough redundancy (multiple fence devices configured) and resiliency (reliable and available fencing mechanism) in place. Fencing is the component of Red Hat Cluster Suite that cuts off access to a resource (hard disk, etc.) from a node if it loses contact with the rest of the nodes in the cluster.

So, when a node loses contact with the other nodes in the same cluster the cluster before taking any action that may involve the use of one of the resources used by the disconnected node needs to be sure that the node has released its resources. This can not be accomplished by contacting the node itself as the node is not more responsive therefore an external method is required. This external method is fencing. Without a fence device configured we do not have a way to know that the resources previously used by the disconnected node have been released. If we do not configure fence, the system may assume erroneously that the node has released its resources and this can lead to data corruption and data loss. Without a fence device configured data integrity can not be guaranteed.

When the fencing is in progress no other cluster operation is allowed to run. This means that every operation performed on a cluster services will freeze until the fence operation is completed and that service will not failover. If the fencing does not succeed because of a fault in the fence device and no backup fence device is configured the cluster operations will be frozen until the problem is resolved, this is expected behavior to assure the errant node could not access the shared storage (lvm)/filesystem (ext3/4/xfs/GFS/GFS2).

So, it would be suggested to please verify that fencing works in the event that a cluster node needs to be fenced off from the cluster. For detailed information on the importance of fencing please refer the following article:

"What is fencing and why is it important?"
 https://access.redhat.com/knowledge/solutions/15575

Cheers,
Milan.

With above explanations, I would like to add one more article to it which explains a possible way to avoid such issue by configuring a backup fence device. In which, if first fence device fails, cluster node would try for second fence device and if all gets well, cluster service should relocate the service to another node. Refer below article for more details and other possible ways to handle it:

Topic: "How can I prevent my Red Hat Enterprise Linux cluster from repeatedly failing to fence a node while the fence device is not accessible?"
--> https://access.redhat.com/knowledge/solutions/16657

If all configured fence levels fail, fenced will continued to loop through all fence levels until a fence level is successful. This is the standard behaviour. When this action takes place, the logs will be filled with messages from fence saying that the fence actions have failed.

In such a case the cluster will have its activities blocked, therefore GFS will be blocked, and no service failover can occur (but the services that are already running continue to run undisturbed). It is necessary to resume the fence action successfully, so that the cluster will unlock itself and continue operations.

This problem is more noticeable in clusters using one single fence method of the type iLO or DRAC, because in such cases, if the power to the machine is lost, the other node will try to fence it in a never-ending loop, because it will never be able to contact iLO or DRAC on the target machine, since it has no power.

Hope this helps :)