Why did my entire cluster service failover after successfully recovering a resource in RHEL?

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
  • On RHEL 5: rgmanager prior to release 2.0.52-47.el5
  • On RHEL 6: rgmanager prior to release 3.0.12.1-17.el6
  • One or more cluster services in /etc/cluster/cluster.conf containing an <nfsclient>, <oracledb>, or <nfsexport> resource
    • If <nfsclient>, then allow_recover="1" (this is the default value if unspecified)

Issue

  • An 'nfsclient' resource fails and is recovered due to the '1' default attribute of the 'allow_recover' attribute, but even after recovery is successful the service is still failed and recovered:
Nov 14 15:11:25 node1 clurgmgrd: [1133]: <err> nfsclient:NFS_client3 is missing! 
Nov 14 15:11:25 node1 clurgmgrd[1133]: <notice> status on nfsclient "NFS_client3" returned 1 (generic error) 
Nov 14 15:11:25 node1 bash: [24099]: <info> Removing export: client3.example.com:/export 
Nov 14 15:11:25 node1 bash: [24099]: <info> Adding export: client3.test.com:/export (fsid=12345,rw,async,wdelay,no_root_squash) 
Nov 14 15:11:39 node1 clurgmgrd[1133]: <notice> Stopping service service:S01
  • My service still failed after successfully recovering an oracledb resource
  • My service still failed after successfully recovering an nfsexport resource

Resolution

  • For RHEL6 cluster nodes please update rgmanager to version 3.0.12.1-17.el6 or later as per Red Hat Errata RHBA-2013:0409-1.

  • For RHEL5 cluster nodes please update rgmanager to version 2.0.52-47.el5 or later as per Red Hat Errata RHBA-2013:1316-1.

Root Cause

  • A few resource types provided by rgmanager support a recover operation, meaning if a status check on that resource fails, rgmanager will call the recover function for it. This will attempt to restart that resource without having to take down the entire service and recover it.

  • rgmanager contains a bug that results in a resource's last recorded status not being updated after the recovery operation. In some situations rgmanager will use this recorded status for subsequent status checks if that resource is not yet ready for a status call on that agent yet, and as a result the service determines it has failed and the entire service will go into recovery.

  • This issue was corrected in RHEL 6 by Red Hat Bugzilla #879031.

  • This issue is being tracked in RHEL 5 by Red Hat Bugzilla #879029.

Diagnostic Steps

  • To reproduce, set up a service with an fs, nfsexport, and nfsclient similar to:
        <service autostart="1" domain="d1" exclusive="0" name="test" recovery="restart" nfslock="1">
            <ip address="192.168.143.88"/>
            <fs device="/dev/clust/lv1" name="clust-lv1" fsid="1234" mountpoint="/mnt/lv1" fstype="ext3" self_fence="0" force_fsck="0">
                <nfsexport name="export">
                    <nfsclient name="world" target="*" options="rw,sync,no_root_squash"/>
                </nfsexport>
            </fs>
        </service>
  • After starting the service, manually remove the export using exportfs (using the correct <host>:<path> parameters):
# exportfs -u *:/mnt/lv1
  • Watch /var/log/messages to see the resource recovered, but the service still stop

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments