How do POSIX fcntl locks work on GFS2?

Updated -

Introduction

POSIX fcntl locks are locks that are accessed from the fcntl(2) system call with the F_GETLK, F_SETLK and F_SETLKW commands. They provide an advisory locking API that is used by a number of applications, including those running on NFS, since NFS does not support flock(2). The advisory nature of the locking means that there is nothing to prevent a process from accessing a file without taking a lock of the appropriate kind first. It relies on the assumption that all applications that need to access a particular file cooperate when requesting locks of this type. That is, they only perform read operations under a read lock, and read or write operations under a write lock.

The operation of POSIX fcntl locks is not intuitive, and care must be taken not to accidentally unlock them prematurely. This can happen if any file descriptor held by the process on the file in question is closed, whether or not it was the file descriptor on which the lock was originally granted. POSIX fcntl locks will also be unlocked if the process holding them exits. POSIX fcntl locks are not inherited across the fork(2) system call.

GFS2 Implementation

In GFS2 POSIX fcntl locks are largely implemented in user space. For older GFS2 versions, that is done in the gfs_controld daemon. For more recent versions, dlm_controld contains the same code that was in gfs_controld. The reason for moving the code from gfs_controld to dlm_controld was to allow OCFS2 to share the same POSIX fcntl lock implementation. The implementation uses the ring-based system of the corosync daemon to ensure that all nodes see the same fcntl POSIX lock requests in the same order. The DLM is not involved in the POSIX fcntl lock subsystem directly; the only connection is that the code has been moved into dlm_controld, but the rest of the DLM is entirely separate from the POSIX fcntl lock code. The reason that the DLM is not used directly for implementing the fcntl POSIX locks is that DLM does not provide a range locking API, nor does it implement hierarchical locking that might allow a range locking API to be built over it.

Options for cluster.conf

There are a number of options in the configuration file /etc/cluster/cluster.conf to control certain aspects of the POSIX fcntl locks in GFS2. The most important of these is plock_rate_limit= which is an attribute of the <dlm> (or for RHEL 5, the <gfs_controld>) tag in the /etc/cluster/cluster.conf. If plock_rate_limit= is set to 0, the rate at which fcntl POSIX locks may be granted is unlimited (this is the recommended setting). Otherwise, the number sets the maximum number of fcntl POSIX locks granted per second. The default value for plock_rate_limit= is 100 locks/sec in RHEL 5 so most users of POSIX locks will want to change this setting. In RHEL 6 the default was changed to 0 or Unlimited, so no adjustment is necessary.

The plock_ownership= option turns on caching of fcntl POSIX locks, which provides a performance advantage when a single node is making continued repetitive use of the same lock with few requests from other nodes in the cluster. By default it is set to 0 (disabled). There are also options to tune the cache of these locks that are described in the man pages for gfs_controld (RHEL5) or dlm_controld (RHEL6 and above).

A snippet of a /etc/cluster/cluster.conf for RHEL 5 with these options configured:

<?xml version="1.0"?>
<cluster config_version="42" name="rh5nodesThree">
  ....
  <gfs_controld plock_ownership="1" plock_rate_limit="0"/>
</cluster>

A snippet of a /etc/cluster/cluster.conf for RHEL 6 with these options configured:

<?xml version="1.0"?>
<cluster config_version="42" name="rh6nodesThree">
  ....
  <dlm plock_ownership="1" plock_rate_limit="0"/>  
</cluster>

Options for pacemaker

The options in the cluster.conf exists in pacemaker and are added to the dlm resource. There is an example show in the GFS2 documentation for RHEL 7 . This is the same command as in documentation except that we are adding the option args="--plock_ownership 1 --plock_rate_limit 0" to the controld resource.

# pcs resource create dlm ocf:pacemaker:controld args="--plock_ownership 1  --plock_rate_limit 0"  op monitor interval=30s on-fail=fence clone interleave=true ordered=true

Mount options

The localflocks mount option tells GFS2 not to deal with fcntl POSIX locks or make flock locks cluster-wide. Instead, it lets the VFS deal with flocks as it would for a local filesystem. If you mount your GFS2 file system using lock_nolock as the locking protocol, this will be the case anyway, so the mount option is only effective for lock_dlm mounts.

Using localflocks requires an audit of all applications using the filesystem to ensure that they will not cause data corruption when it is enabled. The performance gain from making locking local is considerable. Normally the only reason for using the option localflocks is when you are exporting a GFS2 filesystem as an NFS export since most applications that use fcntl POSIX locking will also require that the locking is cluster wide rather than node local.

The localflocks as noted is a mount option and there is no option in the /etc/cluster/cluster.conf to enable it on all GFS2 filesystems in the cluster. The option is enabled from the commandline with the -o option for the mount command or as a mount option for a clusterfs resource in the /etc/cluster/cluster.conf file.

This is example of mounting a GFS2 filesystem with the localflocks option from the commandline:

$ mount -o rw,noatime,nodiratime,localflocks -t gfs2 /dev/mapper/myGFS2-lv1 /mnt/gfs2

Here is an example of a clusterfs resource in a /etc/cluster/cluster.conf adding the localflocks option to a GFS2 filesystem:

<?xml version="1.0"?>
<cluster config_version="42" name="rh5nodesThree">
  ....
  <rm>
   <failoverdomains/>
   <resources>
     <clusterfs device="/dev/mapper/myGFS2-lv1" name="myGFS2" force_umount="1" fsid="1003"
                fstype="gfs2" mountpoint="/mnt/gfs2" 
                options="rw,noatime,nodiratime,localflocks" self_fence="0"/>
   </resources>
  </rm>
</cluster>

Here is an example of adding the lockflocks option to a GFS2 filesystem in pacemaker from the commandline:

# pcs resource create clusterfs Filesystem device="/dev/cluster_vg/cluster_lv" directory="/var/mountpoint" fstype="gfs2" "options=noatime,localflocks" op monitor interval=10s on-fail=fence clone interleave=true

Deviations from POSIX

There are a couple of issues that must be taken into account when using fcntl locks with GFS2 where the implementation differs from the POSIX standard. The first is a consequence of how the interface is defined and relates to the F_GETLK command. This command returns a process identifier (PID) of a process that is blocking access to the file in question. The obvious intent is that a signal could then be sent to the process. The problem with this in the clustered case, is that the PID might be on any node in the cluster. This, unfortunately, makes the interface less than useful when running a clustered filesystem. Applications that rely on this behavior will not work unmodified on GFS2.

The other issue is that F_SETLKW will wait in an uninterruptible fashion on GFS2. This issue has been resolved with the following errata.

POSIX locks should not be used to determine if a cluster node is able to read+write to a gfs2 filesystem as documented in the following solution: A POSX lock on a gfs2 filesystem is acquired before the cluster node has been fenced and waiting node ignores process signals

ping_pong test

There is a test program that is widely used for testing the basic performance of fcntl POSIX locks on GFS2 and other filesystems called ping_pong. If there is a problem with fcntl POSIX locking speed, it may be a useful to run the ping_pong test.

Please note that it will not help diagnose any other issues in either GFS2 or DLM. If you are having performance problems that are only related to fcntl POSIX locking, review the following article which will provide further help in narrowing down the issue. It is much more likely that any performance issues do not stem directly from the fcntl POSIX locking, but from whatever operation is being done while the lock is held.

NFS

The use of NFS over GFS2 is only supported in very restricted use cases. Currently that requires active/passive NFS only, with the localflocks mount option set on each GFS2 mount. This is required to ensure that the Linux NFS server works correctly with respect to fcntl POSIX locking. NFS is only supported on GFS2 when it is the sole application accessing the filesystem directly. Red Hat cannot yet support combinations of NFS and Samba and local application on the same GFS2 filesystem. For more information about a GFS2 being exported viaNFS, review the following articles:

Alternatives to fcntl POSIX locks

The first alternative to consider is flock(2) which has the advantage of being implemented via the DLM. It only supports uninterruptible waiting and it does not support range locking like fcntl(2), but this is often not a problem on GFS2 since the underlying inodes are locked as a unit, so there is no performance advantage to be gained by using range locks on files.

Another alternative is to use the DLM directly. It has a user space API that may also be used to implement advisory locking. However, if the intent is to lock files on the filesytem, the direct use of the DLM is considerably more complex and has little advantage over using flock(2).

Comments