How do POSIX fcntl locks work on GFS2?
- Introduction
- GFS2 Implementation
- Deviations from POSIX
- ping_pong test
- NFS
- Alternatives to fcntl POSIX locks
Introduction
POSIX fcntl locks are locks that are accessed from the fcntl(2)
system call with the F_GETLK
, F_SETLK
and F_SETLKW
commands. They provide an advisory locking API that is used by a number of applications, including those running on NFS
, since NFS
does not support flock(2)
. The advisory nature of the locking means that there is nothing to prevent a process from accessing a file without taking a lock of the appropriate kind first. It relies on the assumption that all applications that need to access a particular file cooperate when requesting locks of this type. That is, they only perform read operations under a read lock, and read or write operations under a write lock.
The operation of POSIX fcntl locks is not intuitive, and care must be taken not to accidentally unlock them prematurely. This can happen if any file descriptor held by the process on the file in question is closed, whether or not it was the file descriptor on which the lock was originally granted. POSIX fcntl locks will also be unlocked if the process holding them exits. POSIX fcntl locks are not inherited across the fork(2)
system call.
GFS2 Implementation
In GFS2
POSIX fcntl locks are largely implemented in user space. For older GFS2
versions, that is done in the gfs_controld
daemon. For more recent versions, dlm_controld
contains the same code that was in gfs_controld
. The reason for moving the code from gfs_controld
to dlm_controld
was to allow OCFS2
to share the same POSIX fcntl lock implementation. The implementation uses the ring-based system of the corosync
daemon to ensure that all nodes see the same fcntl POSIX lock requests in the same order. The DLM
is not involved in the POSIX fcntl lock subsystem directly; the only connection is that the code has been moved into dlm_controld
, but the rest of the DLM
is entirely separate from the POSIX fcntl lock code. The reason that the DLM
is not used directly for implementing the fcntl POSIX locks is that DLM
does not provide a range locking API, nor does it implement hierarchical locking that might allow a range locking API to be built over it.
Options for cluster.conf
There are a number of options in the configuration file /etc/cluster/cluster.conf
to control certain aspects of the POSIX fcntl locks in GFS2
. The most important of these is plock_rate_limit=
which is an attribute of the <dlm>
(or for RHEL 5, the <gfs_controld>
) tag in the /etc/cluster/cluster.conf
. If plock_rate_limit=
is set to 0, the rate at which fcntl POSIX locks may be granted is unlimited (this is the recommended setting). Otherwise, the number sets the maximum number of fcntl POSIX locks granted per second. The default value for plock_rate_limit=
is 100 locks/sec in RHEL 5 so most users of POSIX locks will want to change this setting. In RHEL 6 the default was changed to 0 or Unlimited, so no adjustment is necessary.
The plock_ownership=
option turns on caching of fcntl POSIX locks, which provides a performance advantage when a single node is making continued repetitive use of the same lock with few requests from other nodes in the cluster. By default it is set to 0 (disabled). There are also options to tune the cache of these locks that are described in the man pages for gfs_controld
(RHEL5) or dlm_controld
(RHEL6 and above).
A snippet of a /etc/cluster/cluster.conf
for RHEL 5 with these options configured:
<?xml version="1.0"?>
<cluster config_version="42" name="rh5nodesThree">
....
<gfs_controld plock_ownership="1" plock_rate_limit="0"/>
</cluster>
A snippet of a /etc/cluster/cluster.conf
for RHEL 6 with these options configured:
<?xml version="1.0"?>
<cluster config_version="42" name="rh6nodesThree">
....
<dlm plock_ownership="1" plock_rate_limit="0"/>
</cluster>
Options for pacemaker
The options in the cluster.conf
exists in pacemaker and are added to the dlm
resource. There is an example show in the GFS2 documentation for RHEL 7 . This is the same command as in documentation except that we are adding the option args="--plock_ownership 1 --plock_rate_limit 0"
to the controld
resource.
# pcs resource create dlm ocf:pacemaker:controld args="--plock_ownership 1 --plock_rate_limit 0" op monitor interval=30s on-fail=fence clone interleave=true ordered=true
Mount options
The localflocks
mount option tells GFS2
not to deal with fcntl POSIX locks or make flock locks cluster-wide. Instead, it lets the VFS
deal with flocks as it would for a local filesystem. If you mount your GFS2
file system using lock_nolock
as the locking protocol, this will be the case anyway, so the mount option is only effective for lock_dlm
mounts.
Using localflocks
requires an audit of all applications using the filesystem to ensure that they will not cause data corruption when it is enabled. The performance gain from making locking local is considerable. Normally the only reason for using the option localflocks
is when you are exporting a GFS2
filesystem as an NFS export since most applications that use fcntl POSIX locking will also require that the locking is cluster wide rather than node local.
The localflocks
as noted is a mount option and there is no option in the /etc/cluster/cluster.conf
to enable it on all GFS2
filesystems in the cluster. The option is enabled from the commandline with the -o
option for the mount command or as a mount option for a clusterfs
resource in the /etc/cluster/cluster.conf
file.
This is example of mounting a GFS2
filesystem with the localflocks
option from the commandline:
$ mount -o rw,noatime,nodiratime,localflocks -t gfs2 /dev/mapper/myGFS2-lv1 /mnt/gfs2
Here is an example of a clusterfs
resource in a /etc/cluster/cluster.conf
adding the localflocks
option to a GFS2
filesystem:
<?xml version="1.0"?>
<cluster config_version="42" name="rh5nodesThree">
....
<rm>
<failoverdomains/>
<resources>
<clusterfs device="/dev/mapper/myGFS2-lv1" name="myGFS2" force_umount="1" fsid="1003"
fstype="gfs2" mountpoint="/mnt/gfs2"
options="rw,noatime,nodiratime,localflocks" self_fence="0"/>
</resources>
</rm>
</cluster>
Here is an example of adding the lockflocks
option to a GFS2
filesystem in pacemaker
from the commandline:
# pcs resource create clusterfs Filesystem device="/dev/cluster_vg/cluster_lv" directory="/var/mountpoint" fstype="gfs2" "options=noatime,localflocks" op monitor interval=10s on-fail=fence clone interleave=true
Deviations from POSIX
There are a couple of issues that must be taken into account when using fcntl locks with GFS2
where the implementation differs from the POSIX standard. The first is a consequence of how the interface is defined and relates to the F_GETLK
command. This command returns a process identifier (PID) of a process that is blocking access to the file in question. The obvious intent is that a signal could then be sent to the process. The problem with this in the clustered case, is that the PID might be on any node in the cluster. This, unfortunately, makes the interface less than useful when running a clustered filesystem. Applications that rely on this behavior will not work unmodified on GFS2
.
The other issue is that F_SETLKW will wait in an uninterruptible fashion on GFS2
. This issue has been resolved with the following errata.
POSIX locks should not be used to determine if a cluster node is able to read+write to a gfs2 filesystem as documented in the following solution: A POSX lock on a gfs2 filesystem is acquired before the cluster node has been fenced and waiting node ignores process signals
ping_pong test
There is a test program that is widely used for testing the basic performance of fcntl POSIX locks on GFS2
and other filesystems called ping_pong
. If there is a problem with fcntl POSIX locking speed, it may be a useful to run the ping_pong
test.
Please note that it will not help diagnose any other issues in either GFS2
or DLM
. If you are having performance problems that are only related to fcntl POSIX locking, review the following article which will provide further help in narrowing down the issue. It is much more likely that any performance issues do not stem directly from the fcntl POSIX locking, but from whatever operation is being done while the lock is held.
NFS
The use of NFS
over GFS2
is only supported in very restricted use cases. Currently that requires active/passive NFS
only, with the localflocks
mount option set on each GFS2
mount. This is required to ensure that the Linux NFS
server works correctly with respect to fcntl POSIX locking. NFS
is only supported on GFS2
when it is the sole application accessing the filesystem directly. Red Hat cannot yet support combinations of NFS
and Samba
and local application on the same GFS2
filesystem. For more information about a GFS2
being exported viaNFS
, review the following articles:
- How should I configure my gfs2 filesystem when using NFS or Samba as a service on Red Hat Clustering?
- Are NFS and Samba/CIFS exports of the same directory/filesystem supported on Red Hat Enterprise Linux?
- Is active/active clustered NFS supported on GFS/GFS2 filesystems?
Alternatives to fcntl POSIX locks
The first alternative to consider is flock(2)
which has the advantage of being implemented via the DLM
. It only supports uninterruptible waiting and it does not support range locking like fcntl(2)
, but this is often not a problem on GFS2
since the underlying inodes
are locked as a unit, so there is no performance advantage to be gained by using range locks on files.
Another alternative is to use the DLM
directly. It has a user space API that may also be used to implement advisory locking. However, if the intent is to lock files on the filesytem, the direct use of the DLM
is considerably more complex and has little advantage over using flock(2)
.
Comments