Chapter 2. Recommendations for GFS2 usage
When deploying a GFS2 file system, there are a variety of general recommendations you should take into account.
Each file inode and directory inode has three time stamps associated with it:
ctime— The last time the inode status was changed
mtime— The last time the file (or directory) data was modified
atime— The last time the file (or directory) data was accessed
atime updates are enabled as they are by default on GFS2 and other Linux file systems, then every time a file is read its inode needs to be updated.
Because few applications use the information provided by
atime, those updates can require a significant amount of unnecessary write traffic and file locking traffic. That traffic can degrade performance; therefore, it may be preferable to turn off or reduce the frequency of
The following methods of reducing the effects of
atime updating are available:
relatime(relative atime), which updates the
atimeif the previous
atimeupdate is older than the
ctimeupdate. This is the default mount option for GFS2 file systems.
nodiratime. Mounting with
atimeupdates for both files and directories on that file system, while mounting with
atimeupdates only for directories on that file system, It is generally recommended that you mount GFS2 file systems with the
nodiratimemount option whenever possible, with the preference for
noatimewhere the application allows for this. For more information about the effect of these arguments on GFS2 file system performance, see GFS2 Node Locking.
Use the following command to mount a GFS2 file system with the
noatime Linux mount option.
mount BlockDevice MountPoint -o noatime
- Specifies the block device where the GFS2 file system resides.
- Specifies the directory where the GFS2 file system should be mounted.
In this example, the GFS2 file system resides on
/dev/vg01/lvol0 and is mounted on directory
atime updates turned off.
# mount /dev/vg01/lvol0 /mygfs2 -o noatime
2.2. VFS tuning options: research and experiment
Like all Linux file systems, GFS2 sits on top of a layer called the virtual file system (VFS). The VFS provides good defaults for the cache settings for most workloads and should not need changing in most cases. If, however, you have a workload that is not running efficiently (for example, cache is too large or too small) then you may be able to improve the performance by using the
sysctl(8) command to adjust the values of the
sysctl files in the
/proc/sys/vm directory. Documentation for these files can be found in the kernel source tree
For example, the values for
vfs_cache_pressure may be adjusted depending on your situation. To fetch the current values, use the following commands:
# sysctl -n vm.dirty_background_ratio # sysctl -n vm.vfs_cache_pressure
The following commands adjust the values:
# sysctl -w vm.dirty_background_ratio=20 # sysctl -w vm.vfs_cache_pressure=500
You can permanently change the values of these parameters by editing the
To find the optimal values for your use cases, research the various VFS options and experiment on a test cluster before deploying into full production.
2.3. SELinux on GFS2
Use of Security Enhanced Linux (SELinux) with GFS2 incurs a small performance penalty. To avoid this overhead, you may choose not to use SELinux with GFS2 even on a system with SELinux in enforcing mode. When mounting a GFS2 file system, you can ensure that SELinux will not attempt to read the
seclabel element on each file system object by using one of the
context options as described on the
mount(8) man page; SELinux will assume that all content in the file system is labeled with the
seclabel element provided in the
context mount options. This will also speed up processing as it avoids another disk read of the extended attribute block that could contain
For example, on a system with SELinux in enforcing mode, you can use the following
mount command to mount the GFS2 file system if the file system is going to contain Apache content. This label will apply to the entire file system; it remains in memory and is not written to disk.
# mount -t gfs2 -o context=system_u:object_r:httpd_sys_content_t:s0 /dev/mapper/xyz/mnt/gfs2
If you are not sure whether the file system will contain Apache content, you can use the labels
public_content_t, or you could define a new label altogether and define a policy around it.
Note that in a Pacemaker cluster you should always use Pacemaker to manage a GFS2 file system. You can specify the mount options when you create a GFS2 file system resource.
2.4. Setting up NFS over GFS2
Due to the added complexity of the GFS2 locking subsystem and its clustered nature, setting up NFS over GFS2 requires taking many precautions.
If the GFS2 file system is NFS exported, then you must mount the file system with the
localflocks option. Because utilizing the
localflocks option prevents you from safely accessing the GFS2 filesystem from multiple locations, and it is not viable to export GFS2 from multiple nodes simultaneously, it is a support requirement that the GFS2 file system be mounted on only one node at a time when using this configuration. The intended effect of this is to force POSIX locks from each server to be local: non-clustered, independent of each other. This is because a number of problems exist if GFS2 attempts to implement POSIX locks from NFS across the nodes of a cluster. For applications running on NFS clients, localized POSIX locks means that two clients can hold the same lock concurrently if the two clients are mounting from different servers, which could cause data corruption. If all clients mount NFS from one server, then the problem of separate servers granting the same locks independently goes away. If you are not sure whether to mount your file system with the
localflocks option, you should not use the option. Contact Red Hat support immediately to discuss the appropriate configuration to avoid data loss. Exporting GFS2 via NFS, while technically supported in some circumstances, is not recommended.
For all other (non-NFS) GFS2 applications, do not mount your file system using
localflocks, so that GFS2 will manage the POSIX locks and flocks between all the nodes in the cluster (on a cluster-wide basis). If you specify
localflocks and do not use NFS, the other nodes in the cluster will not have knowledge of each other’s POSIX locks and flocks, thus making them unsafe in a clustered environment
In addition to the locking considerations, you should take the following into account when configuring an NFS service over a GFS2 file system.
Red Hat supports only Red Hat High Availability Add-On configurations using NFSv3 with locking in an active/passive configuration with the following characteristics. This configuration provides High Availability (HA) for the file system and reduces system downtime since a failed node does not result in the requirement to execute the
fsckcommand when failing the NFS server from one node to another.
- The back-end file system is a GFS2 file system running on a 2 to 16 node cluster.
- An NFSv3 server is defined as a service exporting the entire GFS2 file system from a single cluster node at a time.
- The NFS server can fail over from one cluster node to another (active/passive configuration).
- No access to the GFS2 file system is allowed except through the NFS server. This includes both local GFS2 file system access as well as access through Samba or Clustered Samba. Accessing the file system locally via the cluster node from which it is mounted may result in data corruption.
- There is no NFS quota support on the system.
fsid=NFS option is mandatory for NFS exports of GFS2.
- If problems arise with your cluster (for example, the cluster becomes inquorate and fencing is not successful), the clustered logical volumes and the GFS2 file system will be frozen and no access is possible until the cluster is quorate. You should consider this possibility when determining whether a simple failover solution such as the one defined in this procedure is the most appropriate for your system.
2.5. Samba (SMB or Windows) file serving over GFS2
You can use Samba (SMB or Windows) file serving from a GFS2 file system with CTDB, which allows active/active configurations.
Simultaneous access to the data in the Samba share from outside of Samba is not supported. There is currently no support for GFS2 cluster leases, which slows Samba file serving. For further information about support policies for Samba, see Support Policies for RHEL Resilient Storage - ctdb General Policies and Support Policies for RHEL Resilient Storage - Exporting gfs2 contents via other protocols.
2.6. Configuring virtual machines for GFS2
When using a GFS2 file system with a virtual machine, it is important that your VM storage settings on each node be configured properly in order to force the cache off. For example, including these settings for
io in the
libvirt domain should allow GFS2 to behave as expected.
<driver name='qemu' type='raw' cache='none' io='native'/>
Alternately, you can configure the
shareable attribute within the device element. This indicates that the device is expected to be shared between domains (as long as hypervisor and OS support this). If
shareable is used,
cache='no' should be used for that device.
2.7. Block allocation
Even though applications that only write data typically do not care how or where a block is allocated, some knowledge of how block allocation works can help you optimize performance.
2.7.1. Leave free space in the file system
When a GFS2 file system is nearly full, the block allocator starts to have a difficult time finding space for new blocks to be allocated. As a result, blocks given out by the allocator tend to be squeezed into the end of a resource group or in tiny slices where file fragmentation is much more likely. This file fragmentation can cause performance problems. In addition, when a GFS2 file system is nearly full, the GFS2 block allocator spends more time searching through multiple resource groups, and that adds lock contention that would not necessarily be there on a file system that has ample free space. This also can cause performance problems.
For these reasons, it is recommended that you not run a file system that is more than 85 percent full, although this figure may vary depending on workload.
2.7.2. Have each node allocate its own files, if possible
When developing applications for use with GFS2 file systems, it is recommended that you have each node allocate it own files, if possible. Due to the way the distributed lock manager (DLM) works, there will be more lock contention if all files are allocated by one node and other nodes need to add blocks to those files.
The term "lock master" has been used historically to denote a node which is currently the coordinator of lock requests, which originate locally or from a remote node in the cluster. This term for the lock request coordinator is slightly misleading because it is really a resource (in DLM terminology) in relation to which lock requests are either queued, granted or declined. In the sense in which the term is used in the DLM, it should be taken to refer to "first among equals", since the DLM is a peer-to-peer system.
In the Linux kernel DLM implementation, the node on which the lock is first used becomes the coordinator of lock requests, and after that point it does not change. This is an implementation detail of the Linux kernel DLM and not a property of DLMs in general. It is possible that a future update may allow the coordination of lock requests for a particular lock to move between nodes.
The location where lock requests are coordinated is transparent to the initiator of the lock request, except by the effect on the latency of the request. One consequence of the current implementation is that if there is an imbalance of the initial workload (for example, one node scans through the whole filesystem before others perform any I/O commands) this can result in higher lock latencies for other nodes in the cluster compared with the node that performed the initial scan of the filesystem.
As in many file systems, the GFS2 allocator tries to keep blocks in the same file close to one another to reduce the movement of disk heads and boost performance. A node that allocates blocks to a file will likely need to use and lock the same resource groups for the new blocks (unless all the blocks in that resource group are in use). The file system will run faster if the lock request coordinator for the resource group containing the file allocates its data blocks (it is faster to have the node that first opened the file do all the writing of new blocks).
2.7.3. Preallocate, if possible
If files are preallocated, block allocations can be avoided altogether and the file system can run more efficiently. GFS2 includes the
fallocate(1) system call, which you can use to preallocate blocks of data.