Chapter 8. Debugging GFS2 file systems with GFS2 tracepoints and the debugfs glocks file

This section describes both the glock debugfs interface and the GFS2 tracepoints. It is intended for advanced users who are familiar with file system internals who would like to learn more about the design of GFS2 and how to debug GFS2-specific issues.

8.1. GFS2 tracepoint types

There are currently three types of GFS2 tracepoints: glock (pronounced "gee-lock") tracepoints, bmap tracepoints and log tracepoints. These can be used to monitor a running GFS2 file system and give additional information to that which can be obtained with the debugging options supported in previous releases of Red Hat Enterprise Linux. Tracepoints are particularly useful when a problem, such as a hang or performance issue, is reproducible and thus the tracepoint output can be obtained during the problematic operation. In GFS2, glocks are the primary cache control mechanism and they are the key to understanding the performance of the core of GFS2. The bmap (block map) tracepoints can be used to monitor block allocations and block mapping (lookup of already allocated blocks in the on-disk metadata tree) as they happen and check for any issues relating to locality of access. The log tracepoints keep track of the data being written to and released from the journal and can provide useful information on that part of GFS2.

The tracepoints are designed to be as generic as possible. This should mean that it will not be necessary to change the API during the course of Red Hat Enterprise Linux 8. On the other hand, users of this interface should be aware that this is a debugging interface and not part of the normal Red Hat Enterprise Linux 8 API set, and as such Red Hat makes no guarantees that changes in the GFS2 tracepoints interface will not occur.

Tracepoints are a generic feature of Red Hat Enterprise Linux and their scope goes well beyond GFS2. In particular they are used to implement the blktrace infrastructure and the blktrace tracepoints can be used in combination with those of GFS2 to gain a fuller picture of the system performance. Due to the level at which the tracepoints operate, they can produce large volumes of data in a very short period of time. They are designed to put a minimum load on the system when they are enabled, but it is inevitable that they will have some effect. Filtering events by a variety of means can help reduce the volume of data and help focus on obtaining just the information which is useful for understanding any particular situation.

8.2. Tracepoints

The tracepoints can be found under the /sys/kernel/debug/tracing/ directory assuming that debugfs is mounted in the standard place at the /sys/kernel/debug directory. The events subdirectory contains all the tracing events that may be specified and, provided the gfs2 module is loaded, there will be a gfs2 subdirectory containing further subdirectories, one for each GFS2 event. The contents of the /sys/kernel/debug/tracing/events/gfs2 directory should look roughly like the following:

[root@chywoon gfs2]# ls
enable            gfs2_bmap       gfs2_glock_queue         gfs2_log_flush
filter            gfs2_demote_rq  gfs2_glock_state_change  gfs2_pin
gfs2_block_alloc  gfs2_glock_put  gfs2_log_blocks          gfs2_promote

To enable all the GFS2 tracepoints, enter the following command:

[root@chywoon gfs2]# echo -n 1 >/sys/kernel/debug/tracing/events/gfs2/enable

To enable a specific tracepoint, there is an enable file in each of the individual event subdirectories. The same is true of the filter file which can be used to set an event filter for each event or set of events. The meaning of the individual events is explained in more detail below.

The output from the tracepoints is available in ASCII or binary format. This appendix does not currently cover the binary interface. The ASCII interface is available in two ways. To list the current content of the ring buffer, you can enter the following command:

[root@chywoon gfs2]# cat /sys/kernel/debug/tracing/trace

This interface is useful in cases where you are using a long-running process for a certain period of time and, after some event, want to look back at the latest captured information in the buffer. An alternative interface, /sys/kernel/debug/tracing/trace_pipe, can be used when all the output is required. Events are read from this file as they occur; there is no historical information available through this interface. The format of the output is the same from both interfaces and is described for each of the GFS2 events in the later sections of this appendix.

A utility called trace-cmd is available for reading tracepoint data. For more information on this utility, see the link in Section 8.10, “References”. The trace-cmd utility can be used in a similar way to the strace utility, for example to run a command while gathering trace data from various sources.

8.3. Glocks

To understand GFS2, the most important concept to understand, and the one which sets it aside from other file systems, is the concept of glocks. In terms of the source code, a glock is a data structure that brings together the DLM and caching into a single state machine. Each glock has a 1:1 relationship with a single DLM lock, and provides caching for that lock state so that repetitive operations carried out from a single node of the file system do not have to repeatedly call the DLM, and thus they help avoid unnecessary network traffic. There are two broad categories of glocks, those which cache metadata and those which do not. The inode glocks and the resource group glocks both cache metadata, other types of glocks do not cache metadata. The inode glock is also involved in the caching of data in addition to metadata and has the most complex logic of all glocks.

Table 8.1. Glock Modes and DLM Lock Modes

Glock modeDLM lock modeNotes

UN

IV/NL

Unlocked (no DLM lock associated with glock or NL lock depending on I flag)

SH

PR

Shared (protected read) lock

EX

EX

Exclusive lock

DF

CW

Deferred (concurrent write) used for Direct I/O and file system freeze

Glocks remain in memory until either they are unlocked (at the request of another node or at the request of the VM) and there are no local users. At that point they are removed from the glock hash table and freed. When a glock is created, the DLM lock is not associated with the glock immediately. The DLM lock becomes associated with the glock upon the first request to the DLM, and if this request is successful then the 'I' (initial) flag will be set on the glock. Table 8.4, “Glock flags” shows the meanings of the different glock flags. Once the DLM has been associated with the glock, the DLM lock will always remain at least at NL (Null) lock mode until the glock is to be freed. A demotion of the DLM lock from NL to unlocked is always the last operation in the life of a glock.

Each glock can have a number of "holders" associated with it, each of which represents one lock request from the higher layers. System calls relating to GFS2 queue and dequeue holders from the glock to protect the critical section of code.

The glock state machine is based on a work queue. For performance reasons, tasklets would be preferable; however, in the current implementation we need to submit I/O from that context which prohibits their use.

Note

Workqueues have their own tracepoints which can be used in combination with the GFS2 tracepoints.

Table 8.2, “Glock Modes and Data Types” shows what state may be cached under each of the glock modes and whether that cached state may be dirty. This applies to both inode and resource group locks, although there is no data component for the resource group locks, only metadata.

Table 8.2. Glock Modes and Data Types

Glock modeCache DataCache MetadataDirty DataDirty Metadata

UN

No

No

No

No

SH

Yes

Yes

No

No

DF

No

Yes

No

No

EX

Yes

Yes

Yes

Yes

8.4. The glock debugfs interface

The glock debugfs interface allows the visualization of the internal state of the glocks and the holders and it also includes some summary details of the objects being locked in some cases. Each line of the file either begins G: with no indentation (which refers to the glock itself) or it begins with a different letter, indented with a single space, and refers to the structures associated with the glock immediately above it in the file (H: is a holder, I: an inode, and R: a resource group) . Here is an example of what the content of this file might look like:

G:  s:SH n:5/75320 f:I t:SH d:EX/0 a:0 r:3
 H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G:  s:EX n:3/258028 f:yI t:EX d:EX/0 a:3 r:4
 H: s:EX f:tH e:0 p:4466 [postmark] gfs2_inplace_reserve_i+0x177/0x780 [gfs2]
 R: n:258028 f:05 b:22256/22256 i:16800
G:  s:EX n:2/219916 f:yfI t:EX d:EX/0 a:0 r:3
 I: n:75661/219916 t:8 f:0x10 d:0x00000000 s:7522/7522
G:  s:SH n:5/127205 f:I t:SH d:EX/0 a:0 r:3
 H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G:  s:EX n:2/50382 f:yfI t:EX d:EX/0 a:0 r:2
G:  s:SH n:5/302519 f:I t:SH d:EX/0 a:0 r:3
 H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G:  s:SH n:5/313874 f:I t:SH d:EX/0 a:0 r:3
 H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G:  s:SH n:5/271916 f:I t:SH d:EX/0 a:0 r:3
 H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G:  s:SH n:5/312732 f:I t:SH d:EX/0 a:0 r:3
 H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]

The above example is a series of excerpts (from an approximately 18MB file) generated by the command cat /sys/kernel/debug/gfs2/unity:myfs/glocks >my.lock during a run of the postmark benchmark on a single node GFS2 file system. The glocks in the figure have been selected in order to show some of the more interesting features of the glock dumps.

The glock states are either EX (exclusive), DF (deferred), SH (shared) or UN (unlocked). These states correspond directly with DLM lock modes except for UN which may represent either the DLM null lock state, or that GFS2 does not hold a DLM lock (depending on the I flag as explained above). The s: field of the glock indicates the current state of the lock and the same field in the holder indicates the requested mode. If the lock is granted, the holder will have the H bit set in its flags (f: field). Otherwise, it will have the W wait bit set.

The n: field (number) indicates the number associated with each item. For glocks, that is the type number followed by the glock number so that in the above example, the first glock is n:5/75320; which indicates an iopen glock which relates to inode 75320. In the case of inode and iopen glocks, the glock number is always identical to the inode’s disk block number.

Note

The glock numbers (n: field) in the debugfs glocks file are in hexadecimal, whereas the tracepoints output lists them in decimal. This is for historical reasons; glock numbers were always written in hex, but decimal was chosen for the tracepoints so that the numbers could easily be compared with the other tracepoint output (from blktrace for example) and with output from stat(1).

The full listing of all the flags for both the holder and the glock are set out in Table 8.4, “Glock flags” and Table 8.5, “Glock holder flags”. The content of lock value blocks is not currently available through the glock debugfs interface.

Table 8.3, “Glock Types” shows the meanings of the different glock types.

Table 8.3. Glock Types

Type numberLock typeUse

1

trans

Transaction lock

2

inode

Inode metadata and data

3

rgrp

Resource group metadata

4

meta

The superblock

5

iopen

Inode last closer detection

6

flock

flock(2) syscall

8

quota

Quota operations

9

journal

Journal mutex

One of the more important glock flags is the l (locked) flag. This is the bit lock that is used to arbitrate access to the glock state when a state change is to be performed. It is set when the state machine is about to send a remote lock request through the DLM, and only cleared when the complete operation has been performed. Sometimes this can mean that more than one lock request will have been sent, with various invalidations occurring between times.

Table 8.4, “Glock flags” shows the meanings of the different glock flags.

Table 8.4. Glock flags

FlagNameMeaning

d

Pending demote

A deferred (remote) demote request

D

Demote

A demote request (local or remote)

f

Log flush

The log needs to be committed before releasing this glock

F

Frozen

Replies from remote nodes ignored - recovery is in progress.

i

Invalidate in progress

In the process of invalidating pages under this glock

I

Initial

Set when DLM lock is associated with this glock

l

Locked

The glock is in the process of changing state

L

LRU

Set when the glock is on the LRU list`

o

Object

Set when the glock is associated with an object (that is, an inode for type 2 glocks, and a resource group for type 3 glocks)

p

Demote in progress

The glock is in the process of responding to a demote request

q

Queued

Set when a holder is queued to a glock, and cleared when the glock is held, but there are no remaining holders. Used as part of the algorithm the calculates the minimum hold time for a glock.

r

Reply pending

Reply received from remote node is awaiting processing

y

Dirty

Data needs flushing to disk before releasing this glock

When a remote callback is received from a node that wants to get a lock in a mode that conflicts with that being held on the local node, then one or other of the two flags D (demote) or d (demote pending) is set. In order to prevent starvation conditions when there is contention on a particular lock, each lock is assigned a minimum hold time. A node which has not yet had the lock for the minimum hold time is allowed to retain that lock until the time interval has expired.

If the time interval has expired, then the D (demote) flag will be set and the state required will be recorded. In that case the next time there are no granted locks on the holders queue, the lock will be demoted. If the time interval has not expired, then the d (demote pending) flag is set instead. This also schedules the state machine to clear d (demote pending) and set D (demote) when the minimum hold time has expired.

The I (initial) flag is set when the glock has been assigned a DLM lock. This happens when the glock is first used and the I flag will then remain set until the glock is finally freed (which the DLM lock is unlocked).

8.5. Glock holders

Table 8.5, “Glock holder flags” shows the meanings of the different glock holder flags.

Table 8.5. Glock holder flags

FlagNameMeaning

a

Async

Do not wait for glock result (will poll for result later)

A

Any

Any compatible lock mode is acceptable

c

No cache

When unlocked, demote DLM lock immediately

e

No expire

Ignore subsequent lock cancel requests

E

Exact

Must have exact lock mode

F

First

Set when holder is the first to be granted for this lock

H

Holder

Indicates that requested lock is granted

p

Priority

Enqueue holder at the head of the queue

t

Try

A "try" lock

T

Try 1CB

A "try" lock that sends a callback

W

Wait

Set while waiting for request to complete

The most important holder flags are H (holder) and W (wait) as mentioned earlier, since they are set on granted lock requests and queued lock requests respectively. The ordering of the holders in the list is important. If there are any granted holders, they will always be at the head of the queue, followed by any queued holders.

If there are no granted holders, then the first holder in the list will be the one that triggers the next state change. Since demote requests are always considered higher priority than requests from the file system, that might not always directly result in a change to the state requested.

The glock subsystem supports two kinds of "try" lock. These are useful both because they allow the taking of locks out of the normal order (with suitable back-off and retry) and because they can be used to help avoid resources in use by other nodes. The normal t (try) lock is just what its name indicates; it is a "try" lock that does not do anything special. The T (try 1CB) lock, on the other hand, is identical to the t lock except that the DLM will send a single callback to current incompatible lock holders. One use of the T (try 1CB) lock is with the iopen locks, which are used to arbitrate among the nodes when an inode’s i_nlink count is zero, and determine which of the nodes will be responsible for deallocating the inode. The iopen glock is normally held in the shared state, but when the i_nlink count becomes zero and →evict_inode() is called, it will request an exclusive lock with T (try 1CB) set. It will continue to deallocate the inode if the lock is granted. If the lock is not granted it will result in the node(s) which were preventing the grant of the lock marking their glock(s) with the D (demote) flag, which is checked at →drop_inode() time in order to ensure that the deallocation is not forgotten.

This means that inodes that have zero link count but are still open will be deallocated by the node on which the final close() occurs. Also, at the same time as the inode’s link count is decremented to zero the inode is marked as being in the special state of having zero link count but still in use in the resource group bitmap. This functions like the ext3 file system3’s orphan list in that it allows any subsequent reader of the bitmap to know that there is potentially space that might be reclaimed, and to attempt to reclaim it.

8.6. Glock tracepoints

The tracepoints are also designed to be able to confirm the correctness of the cache control by combining them with the blktrace output and with knowledge of the on-disk layout. It is then possible to check that any given I/O has been issued and completed under the correct lock, and that no races are present.

The gfs2_glock_state_change tracepoint is the most important one to understand. It tracks every state change of the glock from initial creation right through to the final demotion which ends with gfs2_glock_put and the final NL to unlocked transition. The l (locked) glock flag is always set before a state change occurs and will not be cleared until after it has finished. There are never any granted holders (the H glock holder flag) during a state change. If there are any queued holders, they will always be in the W (waiting) state. When the state change is complete then the holders may be granted which is the final operation before the l glock flag is cleared.

The gfs2_demote_rq tracepoint keeps track of demote requests, both local and remote. Assuming that there is enough memory on the node, the local demote requests will rarely be seen, and most often they will be created by umount or by occasional memory reclaim. The number of remote demote requests is a measure of the contention between nodes for a particular inode or resource group.

The gfs2_glock_lock_time tracepoint provides information on the time taken by requests to the DLM. The blocking (b) flag was introduced into the glock specifically to be used in combination with this tracepoint.

When a holder is granted a lock, gfs2_promote is called, this occurs as the final stages of a state change or when a lock is requested which can be granted immediately due to the glock state already caching a lock of a suitable mode. If the holder is the first one to be granted for this glock, then the f (first) flag is set on that holder. This is currently used only by resource groups.

8.7. Bmap tracepoints

Block mapping is a task central to any file system. GFS2 uses a traditional bitmap-based system with two bits per block. The main purpose of the tracepoints in this subsystem is to allow monitoring of the time taken to allocate and map blocks.

The gfs2_bmap tracepoint is called twice for each bmap operation: once at the start to display the bmap request, and once at the end to display the result. This makes it easy to match the requests and results together and measure the time taken to map blocks in different parts of the file system, different file offsets, or even of different files. It is also possible to see what the average extent sizes being returned are in comparison to those being requested.

The gfs2_rs tracepoint traces block reservations as they are created, used, and destroyed in the block allocator.

To keep track of allocated blocks, gfs2_block_alloc is called not only on allocations, but also on freeing of blocks. Since the allocations are all referenced according to the inode for which the block is intended, this can be used to track which physical blocks belong to which files in a live file system. This is particularly useful when combined with blktrace, which will show problematic I/O patterns that may then be referred back to the relevant inodes using the mapping gained by means this tracepoint.

Direct I/O (iomap) is an alternative cache policy which allows file data transfers to happen directly between disk and the user’s buffer. This has benefits in situations where cache hit rate is expected to be low. Both gfs2_iomap_start and gfs2_iomap_end tracepoints trace these operations and can be used to keep track of mapping using Direct I/O, the positions on the file system of the Direct I/O along with the operation type.

8.8. Log tracepoints

The tracepoints in this subsystem track blocks being added to and removed from the journal (gfs2_pin), as well as the time taken to commit the transactions to the log (gfs2_log_flush). This can be very useful when trying to debug journaling performance issues.

The gfs2_log_blocks tracepoint keeps track of the reserved blocks in the log, which can help show if the log is too small for the workload, for example.

The gfs2_ail_flush tracepoint is similar to the gfs2_log_flush tracepoint in that it keeps track of the start and end of flushes of the AIL list. The AIL list contains buffers which have been through the log, but have not yet been written back in place and this is periodically flushed in order to release more log space for use by the file system, or when a process requests a sync or fsync.

8.9. Glock statistics

GFS2 maintains statistics that can help track what is going on within the file system. This allows you to spot performance issues.

GFS2 maintains two counters:

  • dcount, which counts the number of DLM operations requested. This shows how much data has gone into the mean/variance calculations.
  • qcount, which counts the number of syscall level operations requested. Generally qcount will be equal to or greater than dcount.

In addition, GFS2 maintains three mean/variance pairs. The mean/variance pairs are smoothed exponential estimates and the algorithm used is the one used to calculate round trip times in network code.

The mean and variance pairs maintained in GFS2 are not scaled, but are in units of integer nanoseconds.

  • srtt/srttvar: Smoothed round trip time for non-blocking operations
  • srttb/srttvarb: Smoothed round trip time for blocking operations
  • irtt/irttvar: Inter-request time (for example, time between DLM requests)

A non-blocking request is one which will complete right away, whatever the state of the DLM lock in question. That currently means any requests when (a) the current state of the lock is exclusive (b) the requested state is either null or unlocked or (c) the "try lock" flag is set. A blocking request covers all the other lock requests.

Larger times are better for IRTTs, whereas smaller times are better for the RTTs.

Statistics are kept in two sysfs files:

  • The glstats file. This file is similar to the glocks file, except that it contains statistics, with one glock per line. The data is initialized from "per cpu" data for that glock type for which the glock is created (aside from counters, which are zeroed). This file may be very large.
  • The lkstats file. This contains "per cpu" stats for each glock type. It contains one statistic per line, in which each column is a cpu core. There are eight lines per glock type, with types following on from each other.

8.10. References

For more information about tracepoints and the GFS2 glocks file, see the following resources: