Why did my entire cluster service failover after successfully recovering a resource in RHEL?

Solution Verified - Updated 2024-08-06T07:24:22+00:00 -

Environment

Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
On RHEL 5: rgmanager prior to release 2.0.52-47.el5
On RHEL 6: rgmanager prior to release 3.0.12.1-17.el6
One or more cluster services in /etc/cluster/cluster.conf containing an <nfsclient>, <oracledb>, or <nfsexport> resource
- If <nfsclient>, then allow_recover="1" (this is the default value if unspecified)

Issue

An 'nfsclient' resource fails and is recovered due to the '1' default attribute of the 'allow_recover' attribute, but even after recovery is successful the service is still failed and recovered:

Nov 14 15:11:25 node1 clurgmgrd: [1133]: <err> nfsclient:NFS_client3 is missing! 
Nov 14 15:11:25 node1 clurgmgrd[1133]: <notice> status on nfsclient "NFS_client3" returned 1 (generic error) 
Nov 14 15:11:25 node1 bash: [24099]: <info> Removing export: client3.example.com:/export 
Nov 14 15:11:25 node1 bash: [24099]: <info> Adding export: client3.test.com:/export (fsid=12345,rw,async,wdelay,no_root_squash) 
Nov 14 15:11:39 node1 clurgmgrd[1133]: <notice> Stopping service service:S01

My service still failed after successfully recovering an oracledb resource
My service still failed after successfully recovering an nfsexport resource

Resolution

For RHEL6 cluster nodes please update rgmanager to version 3.0.12.1-17.el6 or later as per Red Hat Errata RHBA-2013:0409-1.
For RHEL5 cluster nodes please update rgmanager to version 2.0.52-47.el5 or later as per Red Hat Errata RHBA-2013:1316-1.

Root Cause

A few resource types provided by rgmanager support a recover operation, meaning if a status check on that resource fails, rgmanager will call the recover function for it. This will attempt to restart that resource without having to take down the entire service and recover it.
rgmanager contains a bug that results in a resource's last recorded status not being updated after the recovery operation. In some situations rgmanager will use this recorded status for subsequent status checks if that resource is not yet ready for a status call on that agent yet, and as a result the service determines it has failed and the entire service will go into recovery.
This issue was corrected in RHEL 6 by Red Hat Bugzilla #879031.
This issue is being tracked in RHEL 5 by Red Hat Bugzilla #879029.

Diagnostic Steps

To reproduce, set up a service with an fs, nfsexport, and nfsclient similar to:

        <service autostart="1" domain="d1" exclusive="0" name="test" recovery="restart" nfslock="1">
            <ip address="192.168.143.88"/>
            <fs device="/dev/clust/lv1" name="clust-lv1" fsid="1234" mountpoint="/mnt/lv1" fstype="ext3" self_fence="0" force_fsck="0">
                <nfsexport name="export">
                    <nfsclient name="world" target="*" options="rw,sync,no_root_squash"/>
                </nfsexport>
            </fs>
        </service>

After starting the service, manually remove the export using exportfs (using the correct <host>:<path> parameters):

# exportfs -u *:/mnt/lv1

Watch /var/log/messages to see the resource recovered, but the service still stop

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

Why did my entire cluster service failover after successfully recovering a resource in RHEL?

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links