Recover failed service and failed to fence node

Latest response

We have recently deployed Redhat + apache cluster and it's on production now, thanks also for the help of the community members. We have faced however two related issues that might be dangeours in the future.

The first issue is that if a service fails to stop for any reason then it will be marked as failed and will stop. We had once incident where apache stopped momentarily and this failuer was detected, but when rgmanager tried to stop the service, it failed to unmount one of the shared drives and as a result it marked the service as failed. This is dangeours because we have a service that stopped on one node and didn't failover to the other node although the other node was ready to accept the service. So how can we force rgmanager to keep trying to failover the service and how can force stopping the service on the first node?

The other issue which is also related but happened before deployment. When we configured the cluster, the configuration for the fencing was incorrect so when we were testing failover after a hard reset it was failing due to the fencing failure. This was fixed and now fencing is working. However again if fencing fails for any reason or if the node is stuck on shutdown for any particular reason we noticed that failover won't happen and again we will end up with a situation where one node is down and the other node is ready to accept the service but it can't because fencing has failed. So is there a way to force movin the server even if the fencing has failed. By the way we don't have a shared storage (GFS2), we use Active/Passive mode so even if fencing fails there are no risks on starting the service on the other node.

Again I appreciate your support.

Responses