Service containing nfsexport resource fails to unmount fs resource even with force_unmount="1" in a RHEL 5 or 6 High Availability cluster
Environment
- Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
rgmanager- One or more services in
/etc/cluster/cluster.confcontaining annfsexportresource - NFS is configured to NOT serve NFS version 4 to clients
- If the server is serving NFS version 4, see a similar issue in such configurations
Issue
-
Stopping or relocating a clustered NFS service with ext3 file system fails with below errors
clurgmgrd <notice> Stopping service service:NFS clurgmgrd: <info> Removing export: :/nfs clurgmgrd: <info> unmounting /nfs clurgmgrd: <err> 'umount /nfs' failed, error=0 clurgmgrd: <crit> #12: RG service:NFS failed to stop; intervention required clurgmgrd: <notice> Service service:NFS is failed
Resolution
There could be several different explanations why an fs resource may fail to unmount, resulting in a service failure. In the case of services with nfsexports and nfsclients, using a configuration where the IP stops before the nfsclient may correct this issue. One example of this type of configuration is as follows:
<service autostart="1" exclusive="0" name="nfs" recovery="relocate" nfslock="1">
<ip address="192.168.2.5/24" monitor_link="1"/>
<fs device="/dev/clust/lv1" force_fsck="0" force_unmount="1" fsid="5020" fstype="ext3" mountpoint="/nfs" name="lv1" options="defaults" self_fence="0">
<nfsexport name="nfs_export">
<nfsclient name="nfs-world" target="*" options="rw,sync"/>
</nfsexport>
</fs>
</service>
Notice that the IP is not a parent of the nfsexport or nfsclient. rgmanager's built-in stop order dictates that the ip will stop before the fs when stopping the service, achieving the desired result. There are other configurations that may accomplish the same thing (such as putting the ip as a child of the nfsclient), but the above is the recommended layout.
In situations where the above does not solve the problem, you may want to consider adding force_unmount="1" and self_fence="1" to the service, to account for unexpected situations where the fs resource fails to stop.
- force_unmount will cause the fs resource agent to attempt to kill any processes using the fs when stopping, in an effort to free it up.
-
self_fence will cause the node to reboot itself if the fs resource fails to stop. This will free up that fs resource, allowing another node to pick up the service.
-
Example of both:
<fs device="/dev/clust/lv1" force_fsck="0" force_unmount="1" fsid="5020" fstype="ext3" mountpoint="/nfs" name="lv1" options="defaults" self_fence="1">
Root Cause
As mentioned previously, there are several potential reasons that this could be happening. rgmanager has had a few bugs in the past that could result in processes not being killed correctly with force_umount="1", preventing the fs from being unmounted (all fixed in recent versions). In other situations there has been a failure to kill processes that were using the fs but were not killable for one reason or another (this is uncommon), but in those situations self_fence should account for this failure.
However, a common problem is for there to be a misconfiguration in the service resulting in the nfsclient stopping before the ip resource. This happens most commonly when the ip resource is a parent of all the other resources, like so:
<service autostart="1" exclusive="0" name="nfs" recovery="relocate" nfslock="1">
<ip address="192.168.2.5/24" monitor_link="1">
<fs device="/dev/clust/lv1" force_fsck="0" force_unmount="1" fsid="5020" fstype="ext3" mountpoint="/nfs" name="lv1" options="defaults" self_fence="0">
<nfsexport name="nfs_export">
<nfsclient name="nfs-world" target="*" options="rw,sync"/>
</nfsexport>
</fs>
</ip>
</service>
This is problematic for two reasons:
a) For NFS clients that are accessing the export, there is a short period of time where the IP is still available but the export no longer exists. This means they'll receive permission denied, stale file handle, or other I/O errors when accessing that share before the IP is brought down.
b) If the nfs export is removed while clients are still directly accessing it, then the file system may fail to unmount, even though lsof shows no open file handles.
By rearranging the service so that the ip resource stops before the nfsclient, it solves both of these problems. For issue a), if the ip stops before nfsclient, it means that instead of NFS clients being able to communicate with the IP address but getting access errors, they will fail to make a connection with that IP at all. While they are retrying, the service should be relocating to another node or restarting, and when the IP address returns the NFS export will already be set up and listening on that IP, allowing the clients to continue accessing where they left off. It also solves problem b), for reasons that are not clear at the time of this writing.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
