Chapter 32. Factors affecting I/O and file system performance

The appropriate settings for storage and file system performance are highly dependent on the storage purpose.

I/O and file system performance can be affected by any of the following factors:

  • Data write or read patterns
  • Sequential or random
  • Buffered or Direct IO
  • Data alignment with underlying geometry
  • Block size
  • File system size
  • Journal size and location
  • Recording access times
  • Ensuring data reliability
  • Pre-fetching data
  • Pre-allocating disk space
  • File fragmentation
  • Resource contention

32.1. Tools for monitoring and diagnosing I/O and file system issues

The following tools are available in Red Hat Enterprise Linux 9 for monitoring system performance and diagnosing performance problems related to I/O, file systems, and their configuration:

  • vmstat tool reports on processes, memory, paging, block I/O, interrupts, and CPU activity across the entire system. It can help administrators determine whether the I/O subsystem is responsible for any performance issues. If analysis with vmstat shows that the I/O subsystem is responsible for reduced performance, administrators can use the iostat tool to determine the responsible I/O device.
  • iostat reports on I/O device load in your system. It is provided by the sysstat package.
  • blktrace provides detailed information about how time is spent in the I/O subsystem. The companion utility blkparse reads the raw output from blktrace and produces a human readable summary of input and output operations recorded by blktrace.
  • btt analyzes blktrace output and displays the amount of time that data spends in each area of the I/O stack, making it easier to spot bottlenecks in the I/O subsystem. This utility is provided as part of the blktrace package. Some of the important events tracked by the blktrace mechanism and analyzed by btt are:

    • Queuing of the I/O event (Q)
    • Dispatch of the I/O to the driver event (D)
    • Completion of I/O event (C)
  • iowatcher can use the blktrace output to graph I/O over time. It focuses on the Logical Block Address (LBA) of disk I/O, throughput in megabytes per second, the number of seeks per second, and I/O operations per second. This can help to identify when you are hitting the operations-per-second limit of a device.
  • BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended Berkeley Packet Filter (eBPF) programs. The eBPF programs are triggered on events, such as disk I/O, TCP connections, and process creations. The BCC tools are installed in the /usr/share/bcc/tools/ directory. The following bcc-tools helps to analyze performance:

    • biolatency summarizes the latency in block device I/O (disk I/O) in histogram. This allows the distribution to be studied, including two modes for device cache hits and for cache misses, and latency outliers.
    • biosnoop is a basic block I/O tracing tool for displaying each I/O event along with the issuing process ID, and the I/O latency. Using this tool, you can investigate disk I/O performance issues.
    • biotop is used for block i/o operations in the kernel.
    • filelife tool traces the stat() syscalls.
    • fileslower traces slow synchronous file reads and writes.
    • filetop displays file reads and writes by process.
    • ext4slower, nfsslower, and xfsslower are tools that show file system operations slower than a certain threshold, which defaults to 10ms.

      For more information, see the Analyzing system performance with BPF Compiler Collection.

  • bpftace is a tracing language for eBPF used for analyzing performance issues. It also provides trace utilities like BCC for system observation, which is useful for investigating I/O performance issues.
  • The following SystemTap scripts may be useful in diagnosing storage or file system performance problems:

    • disktop.stp: Checks the status of reading or writing disk every 5 seconds and outputs the top ten entries during that period.
    • iotime.stp: Prints the amount of time spent on read and write operations, and the number of bytes read and written.
    • traceio.stp: Prints the top ten executable based on cumulative I/O traffic observed, every second.
    • traceio2.stp: Prints the executable name and process identifier as reads and writes to the specified device occur.
    • Inodewatch.stp: Prints the executable name and process identifier each time a read or write occurs to the specified inode on the specified major or minor device.
    • inodewatch2.stp: Prints the executable name, process identifier, and attributes each time the attributes are changed on the specified inode on the specified major or minor device.

Additional resources

32.2. Available tuning options for formatting a file system

Some file system configuration decisions cannot be changed after the device is formatted.

The following are the options available before formatting a storage device:

Size
Create an appropriately-sized file system for your workload. Smaller file systems require less time and memory for file system checks. However, if a file system is too small, its performance suffers from high fragmentation.
Block size

The block is the unit of work for the file system. The block size determines how much data can be stored in a single block, and therefore the smallest amount of data that is written or read at one time.

The default block size is appropriate for most use cases. However, your file system performs better and stores data more efficiently if the block size or the size of multiple blocks is the same as or slightly larger than the amount of data that is typically read or written at one time. A small file still uses an entire block. Files can be spread across multiple blocks, but this can create additional runtime overhead.

Additionally, some file systems are limited to a certain number of blocks, which in turn limits the maximum size of the file system. Block size is specified as part of the file system options when formatting a device with the mkfs command. The parameter that specifies the block size varies with the file system.

Geometry

File system geometry is concerned with the distribution of data across a file system. If your system uses striped storage, like RAID, you can improve performance by aligning data and metadata with the underlying storage geometry when you format the device.

Many devices export recommended geometry, which is then set automatically when the devices are formatted with a particular file system. If your device does not export these recommendations, or you want to change the recommended settings, you must specify geometry manually when you format the device with the mkfs command.

The parameters that specify file system geometry vary with the file system.

External journals
Journaling file systems document the changes that will be made during a write operation in a journal file prior to the operation being executed. This reduces the likelihood that a storage device will become corrupted in the event of a system crash or power failure, and speeds up the recovery process.
Note

Red Hat does not recommend using the external journals option.

Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a device with a metadata-intensive workload by placing its journal on dedicated storage that is as fast as, or faster than, the primary storage.

Warning

Ensure that external journals are reliable. Losing an external journal device causes file system corruption. External journals must be created at format time, with journal devices being specified at mount time.

Additional resources

32.3. Available tuning options for mounting a file system

The following are the options available to most file systems and can be specified as the device is mounted:

Access Time

Every time a file is read, its metadata is updated with the time at which access occurred (atime). This involves additional write I/O. The relatime is the default atime setting for most file systems.

However, if updating this metadata is time consuming, and if accurate access time data is not required, you can mount the file system with the noatime mount option. This disables updates to metadata when a file is read. It also enables nodiratime behavior, which disables updates to metadata when a directory is read.

Note

Disabling atime updates by using the noatime mount option can break applications that rely on them, for example, backup programs.

Read-ahead

Read-ahead behavior speeds up file access by pre-fetching data that is likely to be needed soon and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The higher the read-ahead value, the further ahead the system pre-fetches data.

Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects about your file system. However, accurate detection is not always possible. For example, if a storage array presents itself to the system as a single LUN, the system detects the single LUN, and does not set the appropriate read-ahead value for an array.

Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values. The storage-related tuned profiles provided with Red Hat Enterprise Linux raise the read-ahead value, as does using LVM striping, but these adjustments are not always sufficient for all workloads.

Additional resources

  • mount(8), xfs(5), and ext4(5) man pages

32.4. Types of discarding unused blocks

Regularly discarding blocks that are not in use by the file system is a recommended practice for both solid-state disks and thinly-provisioned storage.

The following are the two methods of discarding unused blocks:

Batch discard
This type of discard is part of the fstrim command. It discards all unused blocks in a file system that match criteria specified by the administrator. Red Hat Enterprise Linux 9 supports batch discard on XFS and ext4 formatted devices that support physical discard operations.
Online discard

This type of discard operation is configured at mount time with the discard option, and runs in real time without user intervention. However, it only discards blocks that are transitioning from used to free. Red Hat Enterprise Linux 9 supports online discard on XFS and ext4 formatted devices.

Red Hat recommends batch discard, except where online discard is required to maintain performance, or where batch discard is not feasible for the system’s workload.

Pre-allocation marks disk space as being allocated to a file without writing any data into that space. This can be useful in limiting data fragmentation and poor read performance. Red Hat Enterprise Linux 9 supports pre-allocating space on XFS, ext4, and GFS2 file systems. Applications can also benefit from pre-allocating space by using the fallocate(2) glibc call.

Additional resources

  • mount(8) and fallocate(2) man pages

32.5. Solid-state disks tuning considerations

Solid-state disks (SSD) use NAND flash chips rather than rotating magnetic platters to store persistent data. SSD provides a constant access time for data across their full Logical Block Address range, and does not incur measurable seek costs like their rotating counterparts. They are more expensive per gigabyte of storage space and have a lesser storage density, but they also have lower latency and greater throughput than HDDs.

Performance generally degrades as the used blocks on an SSD approach the capacity of the disk. The degree of degradation varies by vendor, but all devices experience degradation in this circumstance. Enabling discard behavior can help to alleviate this degradation. For more information, see Types of discarding unused blocks.

The default I/O scheduler and virtual memory options are suitable for use with SSDs. Consider the following factors when configuring settings that can affect SSD performance:

I/O Scheduler

Any I/O scheduler is expected to perform well with most SSDs. However, as with any other storage type, Red Hat recommends benchmarking to determine the optimal configuration for a given workload. When using SSDs, Red Hat advises changing the I/O scheduler only for benchmarking particular workloads. For instructions on how to switch between I/O schedulers, see the /usr/share/doc/kernel-version/Documentation/block/switching-sched.txt file.

For single queue HBA, the default I/O scheduler is deadline. For multiple queue HBA, the default I/O scheduler is none. For information about how to set the I/O scheduler, see Setting the disk scheduler.

Virtual Memory
Like the I/O scheduler, virtual memory (VM) subsystem requires no special tuning. Given the fast nature of I/O on SSD, try turning down the vm_dirty_background_ratio and vm_dirty_ratio settings, as increased write-out activity does not usually have a negative impact on the latency of other operations on the disk. However, this tuning can generate more overall I/O, and is therefore not generally recommended without workload-specific testing.
Swap
An SSD can also be used as a swap device, and is likely to produce good page-out and page-in performance.

32.6. Generic block device tuning parameters

The generic tuning parameters listed here are available in the /sys/block/sdX/queue/ directory.

The following listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all I/O schedulers:

add_random
Some I/O events contribute to the entropy pool for the /dev/random. This parameter can be set to 0 if the overhead of these contributions become measurable.
iostats

By default, iostats is enabled and the default value is 1. Setting iostats value to 0 disables the gathering of I/O statistics for the device, which removes a small amount of overhead with the I/O path. Setting iostats to 0 might slightly improve performance for very high performance devices, such as certain NVMe solid-state storage devices. It is recommended to leave iostats enabled unless otherwise specified for the given storage model by the vendor.

If you disable iostats, the I/O statistics for the device are no longer present within the /proc/diskstats file. The content of /sys/diskstats file is the source of I/O information for monitoring I/O tools, such as sar or iostats. Therefore, if you disable the iostats parameter for a device, the device is no longer present in the output of I/O monitoring tools.

max_sectors_kb

Specifies the maximum size of an I/O request in kilobytes. The default value is 512 KB. The minimum value for this parameter is determined by the logical block size of the storage device. The maximum value for this parameter is determined by the value of the max_hw_sectors_kb.

Red Hat recommends max_sectors_kb to always be a multiple of the optimal I/O size and the internal erase block size. Use a value of logical_block_size for either parameter if they are zero or not specified by the storage device.

nomerges
Most workloads benefit from request merging. However, disabling merges can be useful for debugging purposes. By default, the nomerges parameter is set to 0, which enables merging. To disable simple one-hit merging, set nomerges to 1. To disable all types of merging, set nomerges to 2.
nr_requests
It is the maximum allowed number of the queued I/O. If the current I/O scheduler is none, this number can only be reduced; otherwise the number can be increased or reduced.
optimal_io_size
Some storage devices report an optimal I/O size through this parameter. If this value is reported, Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size wherever possible.
read_ahead_kb

Defines the maximum number of kilobytes that the operating system may read ahead during a sequential read operation. As a result, the necessary information is already present within the kernel page cache for the next sequential read, which improves read I/O performance.

Device mappers often benefit from a high read_ahead_kb value. 128 KB for each device to be mapped is a good starting point, but increasing the read_ahead_kb value up to request queue’s max_sectors_kb of the disk might improve performance in application environments where sequential reading of large files takes place.

rotational
Some solid-state disks do not correctly advertise their solid-state status, and are mounted as traditional rotational disks. Manually set the rotational value to 0 to disable unnecessary seek-reducing logic in the scheduler.
rq_affinity
The default value of the rq_affinity is 1. It completes the I/O operations on one CPU core, which is in the same CPU group of the issued CPU core. To perform completions only on the processor that issued the I/O request, set the rq_affinity to 2. To disable the mentioned two abilities, set it to 0.
scheduler
To set the scheduler or scheduler preference order for a particular storage device, edit the /sys/block/devname/queue/scheduler file, where devname is the name of the device you want to configure.