Why did my entire cluster service failover after successfully recovering a resource in RHEL?
Environment
- Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
- On RHEL 5:
rgmanagerprior to release2.0.52-47.el5 - On RHEL 6:
rgmanagerprior to release3.0.12.1-17.el6 - One or more cluster services in
/etc/cluster/cluster.confcontaining an<nfsclient>,<oracledb>, or<nfsexport>resource- If
<nfsclient>, thenallow_recover="1"(this is the default value if unspecified)
- If
Issue
- An 'nfsclient' resource fails and is recovered due to the '1' default attribute of the 'allow_recover' attribute, but even after recovery is successful the service is still failed and recovered:
Nov 14 15:11:25 node1 clurgmgrd: [1133]: <err> nfsclient:NFS_client3 is missing!
Nov 14 15:11:25 node1 clurgmgrd[1133]: <notice> status on nfsclient "NFS_client3" returned 1 (generic error)
Nov 14 15:11:25 node1 bash: [24099]: <info> Removing export: client3.example.com:/export
Nov 14 15:11:25 node1 bash: [24099]: <info> Adding export: client3.test.com:/export (fsid=12345,rw,async,wdelay,no_root_squash)
Nov 14 15:11:39 node1 clurgmgrd[1133]: <notice> Stopping service service:S01
- My service still failed after successfully recovering an
oracledbresource - My service still failed after successfully recovering an
nfsexportresource
Resolution
-
For RHEL6 cluster nodes please update
rgmanagerto version3.0.12.1-17.el6or later as per Red Hat Errata RHBA-2013:0409-1. -
For RHEL5 cluster nodes please update
rgmanagerto version2.0.52-47.el5or later as per Red Hat Errata RHBA-2013:1316-1.
Root Cause
-
A few resource types provided by
rgmanagersupport arecoveroperation, meaning if a status check on that resource fails,rgmanagerwill call therecoverfunction for it. This will attempt to restart that resource without having to take down the entire service and recover it. -
rgmanagercontains a bug that results in a resource's last recorded status not being updated after the recovery operation. In some situationsrgmanagerwill use this recorded status for subsequent status checks if that resource is not yet ready for a status call on that agent yet, and as a result the service determines it has failed and the entire service will go into recovery. -
This issue was corrected in RHEL 6 by Red Hat Bugzilla #879031.
-
This issue is being tracked in RHEL 5 by Red Hat Bugzilla #879029.
Diagnostic Steps
- To reproduce, set up a service with an
fs,nfsexport, andnfsclientsimilar to:
<service autostart="1" domain="d1" exclusive="0" name="test" recovery="restart" nfslock="1">
<ip address="192.168.143.88"/>
<fs device="/dev/clust/lv1" name="clust-lv1" fsid="1234" mountpoint="/mnt/lv1" fstype="ext3" self_fence="0" force_fsck="0">
<nfsexport name="export">
<nfsclient name="world" target="*" options="rw,sync,no_root_squash"/>
</nfsexport>
</fs>
</service>
- After starting the service, manually remove the export using
exportfs(using the correct<host>:<path>parameters):
# exportfs -u *:/mnt/lv1
- Watch
/var/log/messagesto see the resource recovered, but the service still stop
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
