Why did my entire cluster service failover after successfully recovering a resource in RHEL?
Environment
- Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
- On RHEL 5:
rgmanager
prior to release2.0.52-47.el5
- On RHEL 6:
rgmanager
prior to release3.0.12.1-17.el6
- One or more cluster services in
/etc/cluster/cluster.conf
containing an<nfsclient>
,<oracledb>
, or<nfsexport>
resource- If
<nfsclient>
, thenallow_recover="1"
(this is the default value if unspecified)
- If
Issue
- An 'nfsclient' resource fails and is recovered due to the '1' default attribute of the 'allow_recover' attribute, but even after recovery is successful the service is still failed and recovered:
Nov 14 15:11:25 node1 clurgmgrd: [1133]: <err> nfsclient:NFS_client3 is missing!
Nov 14 15:11:25 node1 clurgmgrd[1133]: <notice> status on nfsclient "NFS_client3" returned 1 (generic error)
Nov 14 15:11:25 node1 bash: [24099]: <info> Removing export: client3.example.com:/export
Nov 14 15:11:25 node1 bash: [24099]: <info> Adding export: client3.test.com:/export (fsid=12345,rw,async,wdelay,no_root_squash)
Nov 14 15:11:39 node1 clurgmgrd[1133]: <notice> Stopping service service:S01
- My service still failed after successfully recovering an
oracledb
resource - My service still failed after successfully recovering an
nfsexport
resource
Resolution
-
For RHEL6 cluster nodes please update
rgmanager
to version3.0.12.1-17.el6
or later as per Red Hat Errata RHBA-2013:0409-1. -
For RHEL5 cluster nodes please update
rgmanager
to version2.0.52-47.el5
or later as per Red Hat Errata RHBA-2013:1316-1.
Root Cause
-
A few resource types provided by
rgmanager
support arecover
operation, meaning if a status check on that resource fails,rgmanager
will call therecover
function for it. This will attempt to restart that resource without having to take down the entire service and recover it. -
rgmanager
contains a bug that results in a resource's last recorded status not being updated after the recovery operation. In some situationsrgmanager
will use this recorded status for subsequent status checks if that resource is not yet ready for a status call on that agent yet, and as a result the service determines it has failed and the entire service will go into recovery. -
This issue was corrected in RHEL 6 by Red Hat Bugzilla #879031.
-
This issue is being tracked in RHEL 5 by Red Hat Bugzilla #879029.
Diagnostic Steps
- To reproduce, set up a service with an
fs
,nfsexport
, andnfsclient
similar to:
<service autostart="1" domain="d1" exclusive="0" name="test" recovery="restart" nfslock="1">
<ip address="192.168.143.88"/>
<fs device="/dev/clust/lv1" name="clust-lv1" fsid="1234" mountpoint="/mnt/lv1" fstype="ext3" self_fence="0" force_fsck="0">
<nfsexport name="export">
<nfsclient name="world" target="*" options="rw,sync,no_root_squash"/>
</nfsexport>
</fs>
</service>
- After starting the service, manually remove the export using
exportfs
(using the correct<host>:<path>
parameters):
# exportfs -u *:/mnt/lv1
- Watch
/var/log/messages
to see the resource recovered, but the service still stop
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.