fs or clusterfs resource fails to stop when a process has its current working directory (cwd) within the resource's mountpoint in a RHEL 6 High Availability cluster

Solution Unverified - Updated May 28 2015 at 3:40 PM -

Environment

Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add On
resource-agents releases starting with 3.9.2-40.el6 up to, but not including, 3.9.2-40.el6_5.5
One or more <fs/>, <clusterfs/>, or <netfs/> resources in a service in /etc/cluster/cluster.conf
- One or more processes that change directories to, or set their current working directory to, a location on the mountpoint of one of those resources

Issue

My fs resource fails to stop, even though I have force_unmount enabled
If a process has a cwd on the mountpoint for a cluster-managed fs or clusterfs resource, rgmanager can't stop that resource and the node self-fences

Mar 08 18:21:04 rgmanager Stopping service service:myService
Mar 08 18:21:26 rgmanager [fs] unmounting /myFS
Mar 08 18:21:26 rgmanager [fs] umount failed: 1
Mar 08 18:21:26 rgmanager [fs] Sending SIGTERM to processes on /myFS
Mar 08 18:21:31 rgmanager [fs] unmounting /myFS
Mar 08 18:21:31 rgmanager [fs] umount failed: 1
Mar 08 18:21:31 rgmanager [fs] Sending SIGKILL to processes on /myFS
Mar 08 18:21:36 rgmanager [fs] unmounting /myFS
Mar 08 18:21:36 rgmanager [fs] umount failed: 1
Mar 08 18:21:37 rgmanager [fs] Sending SIGKILL to processes on /myFS
Mar 08 18:21:37 rgmanager [fs] 'umount /myFS' failed, error=1
Mar 08 18:21:37 rgmanager [fs] umount failed - REBOOTING

Resolution

Update to resource-agents-3.9.2-40.el6_5.5 or later, or to resource-agents-3.9.5-12.el6 or later.
Also see the general recommendations for preventing a file-system-based resource from failing to stop.

Root Cause

This issue was resolved by Red Hat in Bugzilla #1051115 for RHEL 6 Update 6 and in #1051185 for RHEL 6 Update 5 with an asychronous erratum.

A change was made in RHEL 6 Update 5 (resource-agents-3.9.2-40.el6) to the file system utility library used by several resource agents (fs, clusterfs, netfs) that altered how those resources detect processes using the mountpoint in question and kill them if force_unmount is set. This change was needed to address a separate issue that could cause a stop operation on one of these resource types to block if there was an unresponsive NFS mount anywhere on the system. This change to the fs utility library introduced a bug in that the resource agent would not detect or kill processes that did not directly have files open on the mountpoint but instead just had their current working directory ("cwd") on that mountpoint. The end result was that if the file system could not be unmounted during a stop operation because a process still resided on that mountpoint, that process may not be killed and thus the <fs/>, <clusterfs/>, or <netfs/> resource may fail to stop, even though force_unmount is enabled. If self_fence is enabled, this failure to stop would trigger the node to reboot itself.

A similar issue was later discovered affecting processes that utilize shared memory backed by the mountpoint managed by the resource, which is described in a separate solution.

Diagnostic Steps

While the <fs/>, <clusterfs/>, or <netfs/> resource is started, run lsof and look for any processes that list a directory that resides within the resource's mountpoint and where the FD column lists cwd. For example:
Raw
```
COMMAND     PID      USER   FD      TYPE             DEVICE SIZE/OFF       NODE NAME
myApp             4400    root     cwd     DIR                253,0     4096                         2 /myFS
```
- For any process that is found, look through the lsof output to see if any other entries are listed for that process where it has open a file somewhere on that mountpoint.
- If any process exists that does have a cwd on that mountpoint but does not hold any other file open, the resource is susceptible to failing to stop because of this issue.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Ansible.com

Red Hat Ecosystem Catalog

Red Hat Hybrid Cloud Console

Red Hat Store

Red Hat Marketplace

Red Hat Summit and AnsibleFest

fs or clusterfs resource fails to stop when a process has its current working directory (cwd) within the resource's mountpoint in a RHEL 6 High Availability cluster

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links