Chapter 6. Improving GFS2 performance

This section provides advice for improving GFS2 peformance.

For general recommendations for deploying and upgrading Red Hat Enterprise Linux clusters using the High Availability Add-On and Red Hat Global File System 2 (GFS2) see the article "Red Hat Enterprise Linux Cluster, High Availability, and GFS Deployment Best Practices" on the Red Hat Customer Portal at https://access.redhat.com/kb/docs/DOC-40821.

6.1. GFS2 file system defragmentation

While there is no defragmentation tool for GFS2 on Red Hat Enterprise Linux, you can defragment individual files by identifying them with the filefrag tool, copying them to temporary files, and renaming the temporary files to replace the originals.

6.2. GFS2 Node Locking

In order to get the best performance from a GFS2 file system, it is important to understand some of the basic theory of its operation. A single node file system is implemented alongside a cache, the purpose of which is to eliminate latency of disk accesses when using frequently requested data. In Linux the page cache (and historically the buffer cache) provide this caching function.

With GFS2, each node has its own page cache which may contain some portion of the on-disk data. GFS2 uses a locking mechanism called glocks (pronounced gee-locks) to maintain the integrity of the cache between nodes. The glock subsystem provides a cache management function which is implemented using the distributed lock manager (DLM) as the underlying communication layer.

The glocks provide protection for the cache on a per-inode basis, so there is one lock per inode which is used for controlling the caching layer. If that glock is granted in shared mode (DLM lock mode: PR) then the data under that glock may be cached upon one or more nodes at the same time, so that all the nodes may have local access to the data.

If the glock is granted in exclusive mode (DLM lock mode: EX) then only a single node may cache the data under that glock. This mode is used by all operations which modify the data (such as the write system call).

If another node requests a glock which cannot be granted immediately, then the DLM sends a message to the node or nodes which currently hold the glocks blocking the new request to ask them to drop their locks. Dropping glocks can be (by the standards of most file system operations) a long process. Dropping a shared glock requires only that the cache be invalidated, which is relatively quick and proportional to the amount of cached data.

Dropping an exclusive glock requires a log flush, and writing back any changed data to disk, followed by the invalidation as per the shared glock.

The difference between a single node file system and GFS2, then, is that a single node file system has a single cache and GFS2 has a separate cache on each node. In both cases, latency to access cached data is of a similar order of magnitude, but the latency to access uncached data is much greater in GFS2 if another node has previously cached that same data.

Operations such as read (buffered), stat, and readdir only require a shared glock. Operations such as write (buffered), mkdir, rmdir, and unlink require an exclusive glock. Direct I/O read/write operations require a deferred glock if no allocation is taking place, or an exclusive glock if the write requires an allocation (that is, extending the file, or hole filling).

There are two main performance considerations which follow from this. First, read-only operations parallelize extremely well across a cluster, since they can run independently on every node. Second, operations requiring an exclusive glock can reduce performance, if there are multiple nodes contending for access to the same inode(s). Consideration of the working set on each node is thus an important factor in GFS2 file system performance such as when, for example, you perform a file system backup, as described in Backing up a GFS2 file system.

A further consequence of this is that we recommend the use of the noatime and nodiratime mount options with GFS2 whenever possible. This prevents reads from requiring exclusive locks to update the atime timestamp.

For users who are concerned about the working set or caching efficiency, GFS2 provides tools that allow you to monitor the performance of a GFS2 file system: Performance Co-Pilot and GFS2 tracepoints.

Note

Due to the way in which GFS2’s caching is implemented the best performance is obtained when either of the following takes place:

  • An inode is used in a read-only fashion across all nodes.
  • An inode is written or modified from a single node only.

Note that inserting and removing entries from a directory during file creation and deletion counts as writing to the directory inode.

It is possible to break this rule provided that it is broken relatively infrequently. Ignoring this rule too often will result in a severe performance penalty.

If you mmap() a file on GFS2 with a read/write mapping, but only read from it, this only counts as a read.

If you do not set the noatime mount parameter, then reads will also result in writes to update the file timestamps. We recommend that all GFS2 users should mount with noatime unless they have a specific requirement for atime.

6.3. Issues with Posix locking

When using Posix locking, you should take the following into account:

  • Use of Flocks will yield faster processing than use of Posix locks.
  • Programs using Posix locks in GFS2 should avoid using the GETLK function since, in a clustered environment, the process ID may be for a different node in the cluster.

6.4. Performance tuning with GFS2

It is usually possible to alter the way in which a troublesome application stores its data in order to gain a considerable performance advantage.

A typical example of a troublesome application is an email server. These are often laid out with a spool directory containing files for each user (mbox), or with a directory for each user containing a file for each message (maildir). When requests arrive over IMAP, the ideal arrangement is to give each user an affinity to a particular node. That way their requests to view and delete email messages will tend to be served from the cache on that one node. Obviously if that node fails, then the session can be restarted on a different node.

When mail arrives by means of SMTP, then again the individual nodes can be set up so as to pass a certain user’s mail to a particular node by default. If the default node is not up, then the message can be saved directly into the user’s mail spool by the receiving node. Again this design is intended to keep particular sets of files cached on just one node in the normal case, but to allow direct access in the case of node failure.

This setup allows the best use of GFS2’s page cache and also makes failures transparent to the application, whether imap or smtp.

Backup is often another tricky area. Again, if it is possible it is greatly preferable to back up the working set of each node directly from the node which is caching that particular set of inodes. If you have a backup script which runs at a regular point in time, and that seems to coincide with a spike in the response time of an application running on GFS2, then there is a good chance that the cluster may not be making the most efficient use of the page cache.

Obviously, if you are in the position of being able to stop the application in order to perform a backup, then this will not be a problem. On the other hand, if a backup is run from just one node, then after it has completed a large portion of the file system will be cached on that node, with a performance penalty for subsequent accesses from other nodes. This can be mitigated to a certain extent by dropping the VFS page cache on the backup node after the backup has completed with following command:

echo -n 3 >/proc/sys/vm/drop_caches

However this is not as good a solution as taking care to ensure the working set on each node is either shared, mostly read-only across the cluster, or accessed largely from a single node.

6.5. Troubleshooting GFS2 performance with the GFS2 lock dump

If your cluster performance is suffering because of inefficient use of GFS2 caching, you may see large and increasing I/O wait times. You can make use of GFS2’s lock dump information to determine the cause of the problem.

This section provides an overview of the GFS2 lock dump.

The GFS2 lock dump information can be gathered from the debugfs file which can be found at the following path name, assuming that debugfs is mounted on /sys/kernel/debug/:

/sys/kernel/debug/gfs2/fsname/glocks

The content of the file is a series of lines. Each line starting with G: represents one glock, and the following lines, indented by a single space, represent an item of information relating to the glock immediately before them in the file.

The best way to use the debugfs file is to use the cat command to take a copy of the complete content of the file (it might take a long time if you have a large amount of RAM and a lot of cached inodes) while the application is experiencing problems, and then looking through the resulting data at a later date.

Note

It can be useful to make two copies of the debugfs file, one a few seconds or even a minute or two after the other. By comparing the holder information in the two traces relating to the same glock number, you can tell whether the workload is making progress (it is just slow) or whether it has become stuck (which is always a bug and should be reported to Red Hat support immediately).

Lines in the debugfs file starting with H: (holders) represent lock requests either granted or waiting to be granted. The flags field on the holders line f: shows which: The 'W' flag refers to a waiting request, the 'H' flag refers to a granted request. The glocks which have large numbers of waiting requests are likely to be those which are experiencing particular contention.

Table 6.1, “Glock flags” shows the meanings of the different glock flags and Table 6.2, “Glock holder flags” shows the meanings of the different glock holder flags.

Table 6.1. Glock flags

FlagNameMeaning

b

Blocking

Valid when the locked flag is set, and indicates that the operation that has been requested from the DLM may block. This flag is cleared for demotion operations and for "try" locks. The purpose of this flag is to allow gathering of stats of the DLM response time independent from the time taken by other nodes to demote locks.

d

Pending demote

A deferred (remote) demote request

D

Demote

A demote request (local or remote)

f

Log flush

The log needs to be committed before releasing this glock

F

Frozen

Replies from remote nodes ignored - recovery is in progress. This flag is not related to file system freeze, which uses a different mechanism, but is used only in recovery.

i

Invalidate in progress

In the process of invalidating pages under this glock

I

Initial

Set when DLM lock is associated with this glock

l

Locked

The glock is in the process of changing state

L

LRU

Set when the glock is on the LRU list

o

Object

Set when the glock is associated with an object (that is, an inode for type 2 glocks, and a resource group for type 3 glocks)

p

Demote in progress

The glock is in the process of responding to a demote request

q

Queued

Set when a holder is queued to a glock, and cleared when the glock is held, but there are no remaining holders. Used as part of the algorithm the calculates the minimum hold time for a glock.

r

Reply pending

Reply received from remote node is awaiting processing

y

Dirty

Data needs flushing to disk before releasing this glock

Table 6.2. Glock holder flags

FlagNameMeaning

a

Async

Do not wait for glock result (will poll for result later)

A

Any

Any compatible lock mode is acceptable

c

No cache

When unlocked, demote DLM lock immediately

e

No expire

Ignore subsequent lock cancel requests

E

exact

Must have exact lock mode

F

First

Set when holder is the first to be granted for this lock

H

Holder

Indicates that requested lock is granted

p

Priority

Enqueue holder at the head of the queue

t

Try

A "try" lock

T

Try 1CB

A "try" lock that sends a callback

W

Wait

Set while waiting for request to complete

Having identified a glock which is causing a problem, the next step is to find out which inode it relates to. The glock number (n: on the G: line) indicates this. It is of the form type/number and if type is 2, then the glock is an inode glock and the number is an inode number. To track down the inode, you can then run find -inum number where number is the inode number converted from the hex format in the glocks file into decimal.

Warning

If you run the find command on a file system when it is experiencing lock contention, you are likely to make the problem worse. It is a good idea to stop the application before running the find command when you are looking for contended inodes.

Table 6.3, “Glock types” shows the meanings of the different glock types.

Table 6.3. Glock types

Type numberLock typeUse

1

Trans

Transaction lock

2

Inode

Inode metadata and data

3

Rgrp

Resource group metadata

4

Meta

The superblock

5

Iopen

Inode last closer detection

6

Flock

flock(2) syscall

8

Quota

Quota operations

9

Journal

Journal mutex

If the glock that was identified was of a different type, then it is most likely to be of type 3: (resource group). If you see significant numbers of processes waiting for other types of glock under normal loads, report this to Red Hat support.

If you do see a number of waiting requests queued on a resource group lock there may be a number of reasons for this. One is that there are a large number of nodes compared to the number of resource groups in the file system. Another is that the file system may be very nearly full (requiring, on average, longer searches for free blocks). The situation in both cases can be improved by adding more storage and using the gfs2_grow command to expand the file system.

6.6. Enabling data journaling

Ordinarily, GFS2 writes only metadata to its journal. File contents are subsequently written to disk by the kernel’s periodic sync that flushes file system buffers. An fsync() call on a file causes the file’s data to be written to disk immediately. The call returns when the disk reports that all data is safely written.

Data journaling can result in a reduced fsync() time for very small files because the file data is written to the journal in addition to the metadata. This advantage rapidly reduces as the file size increases. Writing to medium and larger files will be much slower with data journaling turned on.

Applications that rely on fsync() to sync file data may see improved performance by using data journaling. Data journaling can be enabled automatically for any GFS2 files created in a flagged directory (and all its subdirectories). Existing files with zero length can also have data journaling turned on or off.

Enabling data journaling on a directory sets the directory to "inherit jdata", which indicates that all files and directories subsequently created in that directory are journaled. You can enable and disable data journaling on a file with the chattr command.

The following commands enable data journaling on the /mnt/gfs2/gfs2_dir/newfile file and then check whether the flag has been set properly.

# chattr +j /mnt/gfs2/gfs2_dir/newfile
# lsattr /mnt/gfs2/gfs2_dir
---------j--- /mnt/gfs2/gfs2_dir/newfile

The following commands disable data journaling on the /mnt/gfs2/gfs2_dir/newfile file and then check whether the flag has been set properly.

# chattr -j /mnt/gfs2/gfs2_dir/newfile
# lsattr /mnt/gfs2/gfs2_dir
------------- /mnt/gfs2/gfs2_dir/newfile

You can also use the chattr command to set the j flag on a directory. When you set this flag for a directory, all files and directories subsequently created in that directory are journaled. The following set of commands sets the j flag on the gfs2_dir directory, then checks whether the flag has been set properly. After this, the commands create a new file called newfile in the /mnt/gfs2/gfs2_dir directory and then check whether the j flag has been set for the file. Since the j flag is set for the directory, then newfile should also have journaling enabled.

# chattr -j /mnt/gfs2/gfs2_dir
# lsattr /mnt/gfs2
---------j--- /mnt/gfs2/gfs2_dir
# touch /mnt/gfs2/gfs2_dir/newfile
# lsattr /mnt/gfs2/gfs2_dir
---------j--- /mnt/gfs2/gfs2_dir/newfile