Red Hat Training

A Red Hat training course is available for Red Hat Enterprise Linux

8.4. Configuration Tools

Red Hat Enterprise Linux provides a number of tools to assist administrators in configuring the storage and file systems. This section outlines the available tools and provides examples of how they can be used to solve I/O and file system related performance problems in Red Hat Enterprise Linux 7.

8.4.1. Configuring Tuning Profiles for Storage Performance

The Tuned service provides a number of profiles designed to improve performance for specific use cases. The following profiles are particularly useful for improving storage performance.
  • latency-performance
  • throughput-performance (the default)
To configure a profile on your system, run the following command, replacing name with the name of the profile you want to use.
$ tuned-adm profile name
The tuned-adm recommend command recommends an appropriate profile for your system.
For further details about these profiles or additional configuration options, see Section A.5, “tuned-adm”.

8.4.2. Setting the Default I/O Scheduler

The default I/O scheduler is the scheduler that is used if no other scheduler is explicitly specified for the device.
If no default scheduler is specified, the cfq scheduler is used for SATA drives, and the deadline scheduler is used for all other drives. If you specify a default scheduler by following the instructions in this section, that default scheduler is applied to all devices.
To set the default I/O scheduler, you can use the Tuned tool, or modify the /etc/default/grub file manually.
Red Hat recommends using the Tuned tool to specify the default I/O scheduler on a booted system. To set the elevator parameter, enable the disk plug-in. For information on the disk plug-in, see Section 3.1.1, “Plug-ins” in the Tuned chapter.
To modify the default scheduler by using GRUB 2, append the elevator parameter to the kernel command line, either at boot time, or when the system is booted. You can use the Tuned tool, or modify the /etc/default/grub file manually, as described in Procedure 8.1, “Setting the Default I/O Scheduler by Using GRUB 2”.

Procedure 8.1. Setting the Default I/O Scheduler by Using GRUB 2

To set the default I/O Scheduler on a booted system and make the configuration persist after reboot:
  1. Add the elevator parameter to the GRUB_CMDLINE_LINUX line in the /etc/default/grub file.
    # cat /etc/default/grub
    ...
    GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=vg00/lvroot rd.lvm.lv=vg00/lvswap elevator=noop"
    ...
    
    In Red Hat Enterprise Linux 7, the available schedulers are deadline, noop, and cfq. For more information, see the cfq-iosched.txt and deadline-iosched.txt files in the documentation for your kernel, available after installing the kernel-doc package.
  2. Create a new configuration with the elevator parameter added.
    The location of the GRUB 2 configuration file is different on systems with the BIOS firmware and on systems with UEFI. Use one of the following commands to recreate the GRUB 2 configuration file.
    • On a system with the BIOS firmware, use:
      # grub2-mkconfig -o /etc/grub2.cfg
    • On a system with the UEFI firmware, use:
      # grub2-mkconfig -o /etc/grub2-efi.cfg
  3. Reboot the system for the change to take effect.
    For more information on version 2 of the GNU GRand Unified Bootloader (GRUB 2), see the Working with the GRUB 2 Boot Loader chapter of the Red Hat Enterprise Linux 7 System Administrator's Guide.

8.4.3. Generic Block Device Tuning Parameters

The generic tuning parameters listed in this section are available within the /sys/block/sdX/queue/ directory. The listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all I/O schedulers.
add_random
Some I/O events contribute to the entropy pool for /dev/random. This parameter can be set to 0 if the overhead of these contributions becomes measurable.
iostats
The default value is 1 (enabled). Setting iostats to 0 disables the gathering of I/O statistics for the device, which removes a small amount of overhead with the I/O path. Setting iostats to 0 might slightly improve performance for very high performance devices, such as certain NVMe solid-state storage devices. It is recommended to leave iostats enabled unless otherwise specified for the given storage model by the vendor.
If you disable iostats, the I/O statistics for the device are no longer present within the /proc/diskstats file. The content of /sys/diskstats is the source of I/O information for monitoring I/O tools, such as sar or iostats. Therefore, if you disable the iostats parameter for a device, the device is no longer present in the output of I/O monitoring tools.
max_sectors_kb
Specifies the maximum size of an I/O request in kilobytes. The default value is 512 KB. The minimum value for this parameter is determined by the logical block size of the storage device. The maximum value for this parameter is determined by the value of max_hw_sectors_kb.
Certain solid-state disks perform poorly when the I/O requests are larger than the internal erase block size. To determine if this is the case of the solid-state disk model attached to the system, check with the hardware vendor, and follow their recommendations. Red Hat recommends max_sectors_kb to always be a multiple of the optimal I/O size and the internal erase block size. Use a value of logical_block_size for either parameter if they are zero or not specified by the storage device.
nomerges
Most workloads benefit from request merging. However, disabling merges can be useful for debugging purposes. By default, the nomerges parameter is set to 0, which enables merging. To disable simple one-hit merging, set nomerges to 1. To disable all types of merging, set nomerges to 2.
nr_requests
Specifies the maximum number of read and write requests that can be queued at one time. The default value is 128, which means that 128 read requests and 128 write requests can be queued before the next process to request a read or write is put to sleep.
For latency-sensitive applications, lower the value of this parameter and limit the command queue depth on the storage so that write-back I/O cannot fill the device queue with write requests. When the device queue fills, other processes attempting to perform I/O operations are put to sleep until queue space becomes available. Requests are then allocated in a round-robin manner, which prevents one process from continuously consuming all spots in the queue.
The maximum number of I/O operations within the I/O scheduler is nr_requests*2. As stated, nr_requests is applied separately for reads and writes. Note that nr_requests only applies to the I/O operations within the I/O scheduler and not to I/O operations already dispatched to the underlying device. Therefore, the maximum outstanding limit of I/O operations against a device is (nr_requests*2)+(queue_depth) where queue_depth is /sys/block/sdN/device/queue_depth, sometimes also referred to as the LUN queue depth. You can see this total outstanding number of I/O operations in, for example, the output of iostat in the avgqu-sz column.
optimal_io_size
Some storage devices report an optimal I/O size through this parameter. If this value is reported, Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size wherever possible.
read_ahead_kb
Defines the maximum number of kilobytes that the operating system may read ahead during a sequential read operation. As a result, the likely-needed information is already present within the kernel page cache for the next sequential read, which improves read I/O performance.
Device mappers often benefit from a high read_ahead_kb value. 128 KB for each device to be mapped is a good starting point, but increasing the read_ahead_kb value up to 4–8 MB might improve performance in application environments where sequential reading of large files takes place.
rotational
Some solid-state disks do not correctly advertise their solid-state status, and are mounted as traditional rotational disks. If your solid-state device does does not set this to 0 automatically, set it manually to disable unnecessary seek-reducing logic in the scheduler.
rq_affinity
By default, I/O completions can be processed on a different processor than the processor that issued the I/O request. Set rq_affinity to 1 to disable this ability and perform completions only on the processor that issued the I/O request. This can improve the effectiveness of processor data caching.
scheduler
To set the scheduler or scheduler preference order for a particular storage device, edit the /sys/block/devname/queue/scheduler file, where devname is the name of the device you want to configure.
# echo cfq > /sys/block/hda/queue/scheduler

8.4.4. Tuning the Deadline Scheduler

When deadline is in use, queued I/O requests are sorted into a read or write batch and then scheduled for execution in increasing LBA order. Read batches take precedence over write batches by default, as applications are more likely to block on read I/O. After a batch is processed, deadline checks how long write operations have been starved of processor time and schedules the next read or write batch as appropriate.
The following parameters affect the behavior of the deadline scheduler.
fifo_batch
The number of read or write operations to issue in a single batch. The default value is 16. A higher value can increase throughput, but will also increase latency.
front_merges
If your workload will never generate front merges, this tunable can be set to 0. However, unless you have measured the overhead of this check, Red Hat recommends the default value of 1.
read_expire
The number of milliseconds in which a read request should be scheduled for service. The default value is 500 (0.5 seconds).
write_expire
The number of milliseconds in which a write request should be scheduled for service. The default value is 5000 (5 seconds).
writes_starved
The number of read batches that can be processed before processing a write batch. The higher this value is set, the greater the preference given to read batches.

8.4.5. Tuning the CFQ Scheduler

When CFQ is in use, processes are placed into three classes: real time, best effort, and idle. All real time processes are scheduled before any best effort processes, which are scheduled before any idle processes. By default, processes are classed as best effort. You can manually adjust the class of a process with the ionice command.
You can further adjust the behavior of the CFQ scheduler with the following parameters. These parameters are set on a per-device basis by altering the specified files under the /sys/block/devname/queue/iosched directory.
back_seek_max
The maximum distance in kilobytes that CFQ will perform a backward seek. The default value is 16 KB. Backward seeks typically damage performance, so large values are not recommended.
back_seek_penalty
The multiplier applied to backward seeks when the disk head is deciding whether to move forward or backward. The default value is 2. If the disk head position is at 1024 KB, and there are equidistant requests in the system (1008 KB and 1040 KB, for example), the back_seek_penalty is applied to backward seek distances and the disk moves forward.
fifo_expire_async
The length of time in milliseconds that an asynchronous (buffered write) request can remain unserviced. After this amount of time expires, a single starved asynchronous request is moved to the dispatch list. The default value is 250 milliseconds.
fifo_expire_sync
The length of time in milliseconds that a synchronous (read or O_DIRECT write) request can remain unserviced. After this amount of time expires, a single starved synchronous request is moved to the dispatch list. The default value is 125 milliseconds.
group_idle
This parameter is set to 0 (disabled) by default. When set to 1 (enabled), the cfq scheduler idles on the last process that is issuing I/O in a control group. This is useful when using proportional weight I/O control groups and when slice_idle is set to 0 (on fast storage).
group_isolation
This parameter is set to 0 (disabled) by default. When set to 1 (enabled), it provides stronger isolation between groups, but reduces throughput, as fairness is applied to both random and sequential workloads. When group_isolation is disabled (set to 0), fairness is provided to sequential workloads only. For more information, see the installed documentation in /usr/share/doc/kernel-doc-version/Documentation/cgroups/blkio-controller.txt.
low_latency
This parameter is set to 1 (enabled) by default. When enabled, cfq favors fairness over throughput by providing a maximum wait time of 300 ms for each process issuing I/O on a device. When this parameter is set to 0 (disabled), target latency is ignored and each process receives a full time slice.
quantum
This parameter defines the number of I/O requests that cfq sends to one device at one time, essentially limiting queue depth. The default value is 8 requests. The device being used may support greater queue depth, but increasing the value of quantum will also increase latency, especially for large sequential write work loads.
slice_async
This parameter defines the length of the time slice (in milliseconds) allotted to each process issuing asynchronous I/O requests. The default value is 40 milliseconds.
slice_idle
This parameter specifies the length of time in milliseconds that cfq idles while waiting for further requests. The default value is 0 (no idling at the queue or service tree level). The default value is ideal for throughput on external RAID storage, but can degrade throughput on internal non-RAID storage as it increases the overall number of seek operations.
slice_sync
This parameter defines the length of the time slice (in milliseconds) allotted to each process issuing synchronous I/O requests. The default value is 100 ms.

8.4.5.1. Tuning CFQ for Fast Storage

The cfq scheduler is not recommended for hardware that does not suffer a large seek penalty, such as fast external storage arrays or solid-state disks. If your use case requires cfq to be used on this storage, you will need to edit the following configuration files:
  • Set /sys/block/devname/queue/iosched/slice_idle to 0
  • Set /sys/block/devname/queue/iosched/quantum to 64
  • Set /sys/block/devname/queue/iosched/group_idle to 1

8.4.6. Tuning the noop Scheduler

The noop I/O scheduler is primarily useful for CPU-bound systems that use fast storage. Also, the noop I/O scheduler is commonly, but not exclusively, used on virtual machines when they are performing I/O operations to virtual disks.
There are no tunable parameters specific to the noop I/O scheduler.

8.4.7. Configuring File Systems for Performance

This section covers the tuning parameters specific to each file system supported in Red Hat Enterprise Linux 7. Parameters are divided according to whether their values should be configured when you format the storage device, or when you mount the formatted device.
Where loss in performance is caused by file fragmentation or resource contention, performance can generally be improved by reconfiguring the file system. However, in some cases the application may need to be altered. In this case, Red Hat recommends contacting Customer Support for assistance.

8.4.7.1. Tuning XFS

This section covers some of the tuning parameters available to XFS file systems at format and at mount time.
The default formatting and mount settings for XFS are suitable for most workloads. Red Hat recommends changing them only if specific configuration changes are expected to benefit your workload.
8.4.7.1.1. Formatting Options
For further details about any of these formatting options, see the man page:
$ man mkfs.xfs
Directory block size
The directory block size affects the amount of directory information that can be retrieved or modified per I/O operation. The minimum value for directory block size is the file system block size (4 KB by default). The maximum value for directory block size is 64 KB.
At a given directory block size, a larger directory requires more I/O than a smaller directory. A system with a larger directory block size also consumes more processing power per I/O operation than a system with a smaller directory block size. It is therefore recommended to have as small a directory and directory block size as possible for your workload.
Red Hat recommends the directory block sizes listed in Table 8.1, “Recommended Maximum Directory Entries for Directory Block Sizes” for file systems with no more than the listed number of entries for write-heavy and read-heavy workloads.
For detailed information about the effect of directory block size on read and write workloads in file systems of different sizes, see the XFS documentation.
To configure directory block size, use the mkfs.xfs -l option. See the mkfs.xfs man page for details.
Allocation groups
An allocation group is an independent structure that indexes free space and allocated inodes across a section of the file system. Each allocation group can be modified independently, allowing XFS to perform allocation and deallocation operations concurrently as long as concurrent operations affect different allocation groups. The number of concurrent operations that can be performed in the file system is therefore equal to the number of allocation groups. However, since the ability to perform concurrent operations is also limited by the number of processors able to perform the operations, Red Hat recommends that the number of allocation groups be greater than or equal to the number of processors in the system.
A single directory cannot be modified by multiple allocation groups simultaneously. Therefore, Red Hat recommends that applications that create and remove large numbers of files do not store all files in a single directory.
To configure allocation groups, use the mkfs.xfs -d option. See the mkfs.xfs man page for details.
Growth constraints
If you may need to increase the size of your file system after formatting time (either by adding more hardware or through thin-provisioning), you must carefully consider initial file layout, as allocation group size cannot be changed after formatting is complete.
Allocation groups must be sized according to the eventual capacity of the file system, not the initial capacity. The number of allocation groups in the fully-grown file system should not exceed several hundred, unless allocation groups are at their maximum size (1 TB). Therefore for most file systems, the recommended maximum growth to allow for a file system is ten times the initial size.
Additional care must be taken when growing a file system on a RAID array, as the device size must be aligned to an exact multiple of the allocation group size so that new allocation group headers are correctly aligned on the newly added storage. The new storage must also have the same geometry as the existing storage, since geometry cannot be changed after formatting time, and therefore cannot be optimized for storage of a different geometry on the same block device.
Inode size and inline attributes
If the inode has sufficient space available, XFS can write attribute names and values directly into the inode. These inline attributes can be retrieved and modified up to an order of magnitude faster than retrieving separate attribute blocks, as additional I/O is not required.
The default inode size is 256 bytes. Only around 100 bytes of this is available for attribute storage, depending on the number of data extent pointers stored in the inode. Increasing inode size when you format the file system can increase the amount of space available for storing attributes.
Both attribute names and attribute values are limited to a maximum size of 254 bytes. If either name or value exceeds 254 bytes in length, the attribute is pushed to a separate attribute block instead of being stored inline.
To configure inode parameters, use the mkfs.xfs -i option. See the mkfs.xfs man page for details.
RAID
If software RAID is in use, mkfs.xfs automatically configures the underlying hardware with an appropriate stripe unit and width. However, stripe unit and width may need to be manually configured if hardware RAID is in use, as not all hardware RAID devices export this information. To configure stripe unit and width, use the mkfs.xfs -d option. See the mkfs.xfs man page for details.
Log size
Pending changes are aggregated in memory until a synchronization event is triggered, at which point they are written to the log. The size of the log determines the number of concurrent modifications that can be in-progress at one time. It also determines the maximum amount of change that can be aggregated in memory, and therefore how often logged data is written to disk. A smaller log forces data to be written back to disk more frequently than a larger log. However, a larger log uses more memory to record pending modifications, so a system with limited memory will not benefit from a larger log.
Logs perform better when they are aligned to the underlying stripe unit; that is, they start and end at stripe unit boundaries. To align logs to the stripe unit, use the mkfs.xfs -d option. See the mkfs.xfs man page for details.
To configure the log size, use the following mkfs.xfs option, replacing logsize with the size of the log:
# mkfs.xfs -l size=logsize
For further details, see the mkfs.xfs man page:
$ man mkfs.xfs
Log stripe unit
Log writes on storage devices that use RAID5 or RAID6 layouts may perform better when they start and end at stripe unit boundaries (are aligned to the underlying stripe unit). mkfs.xfs attempts to set an appropriate log stripe unit automatically, but this depends on the RAID device exporting this information.
Setting a large log stripe unit can harm performance if your workload triggers synchronization events very frequently, because smaller writes need to be padded to the size of the log stripe unit, which can increase latency. If your workload is bound by log write latency, Red Hat recommends setting the log stripe unit to 1 block so that unaligned log writes are triggered as possible.
The maximum supported log stripe unit is the size of the maximum log buffer size (256 KB). It is therefore possible that the underlying storage may have a larger stripe unit than can be configured on the log. In this case, mkfs.xfs issues a warning and sets a log stripe unit of 32 KB.
To configure the log stripe unit, use one of the following options, where N is the number of blocks to use as the stripe unit, and size is the size of the stripe unit in KB.
mkfs.xfs -l sunit=Nb
mkfs.xfs -l su=size
For further details, see the mkfs.xfs man page:
$ man mkfs.xfs
8.4.7.1.2. Mount Options
Inode allocation
Highly recommended for file systems greater than 1 TB in size. The inode64 parameter configures XFS to allocate inodes and data across the entire file system. This ensures that inodes are not allocated largely at the beginning of the file system, and data is not largely allocated at the end of the file system, improving performance on large file systems.
Log buffer size and number
The larger the log buffer, the fewer I/O operations it takes to write all changes to the log. A larger log buffer can improve performance on systems with I/O-intensive workloads that do not have a non-volatile write cache.
The log buffer size is configured with the logbsize mount option, and defines the maximum amount of information that can be stored in the log buffer; if a log stripe unit is not set, buffer writes can be shorter than the maximum, and therefore there is no need to reduce the log buffer size for synchronization-heavy workloads. The default size of the log buffer is 32 KB. The maximum size is 256 KB and other supported sizes are 64 KB, 128 KB or power of 2 multiples of the log stripe unit between 32 KB and 256 KB.
The number of log buffers is defined by the logbufs mount option. The default value is 8 log buffers (the maximum), but as few as two log buffers can be configured. It is usually not necessary to reduce the number of log buffers, except on memory-bound systems that cannot afford to allocate memory to additional log buffers. Reducing the number of log buffers tends to reduce log performance, especially on workloads sensitive to log I/O latency.
Delay change logging
XFS has the option to aggregate changes in memory before writing them to the log. The delaylog parameter allows frequently modified metadata to be written to the log periodically instead of every time it changes. This option increases the potential number of operations lost in a crash and increases the amount of memory used to track metadata. However, it can also increase metadata modification speed and scalability by an order of magnitude, and does not reduce data or metadata integrity when fsync, fdatasync, or sync are used to ensure data and metadata is written to disk.
For more information on mount options, see man xfs

8.4.7.2. Tuning ext4

This section covers some of the tuning parameters available to ext4 file systems at format and at mount time.
8.4.7.2.1. Formatting Options
Inode table initialization
Initializing all inodes in the file system can take a very long time on very large file systems. By default, the initialization process is deferred (lazy inode table initialization is enabled). However, if your system does not have an ext4 driver, lazy inode table initialization is disabled by default. It can be enabled by setting lazy_itable_init to 1). In this case, kernel processes continue to initialize the file system after it is mounted.
This section describes only some of the options available at format time. For further formatting parameters, see the mkfs.ext4 man page:
$ man mkfs.ext4
8.4.7.2.2. Mount Options
Inode table initialization rate
When lazy inode table initialization is enabled, you can control the rate at which initialization occurs by specifying a value for the init_itable parameter. The amount of time spent performing background initialization is approximately equal to 1 divided by the value of this parameter. The default value is 10.
Automatic file synchronization
Some applications do not correctly perform an fsync after renaming an existing file, or after truncating and rewriting. By default, ext4 automatically synchronizes files after each of these operations. However, this can be time consuming.
If this level of synchronization is not required, you can disable this behavior by specifying the noauto_da_alloc option at mount time. If noauto_da_alloc is set, applications must explicitly use fsync to ensure data persistence.
Journal I/O priority
By default, journal I/O has a priority of 3, which is slightly higher than the priority of normal I/O. You can control the priority of journal I/O with the journal_ioprio parameter at mount time. Valid values for journal_ioprio range from 0 to 7, with 0 being the highest priority I/O.
This section describes only some of the options available at mount time. For further mount options, see the mount man page:
$ man mount

8.4.7.3. Tuning Btrfs

Starting with Red Hat Enterprise Linux 7.0, Btrfs is provided as a Technology Preview. Tuning should always be done to optimize the system based on its current workload. For information on creation and mounting options, see the chapter on Btrfs in the Red Hat Enterprise Linux 7 Storage Administration Guide.

Data Compression

The default compression algorithm is zlib, but a specific workload can give a reason to change the compression algorithm. For example, if you have a single thread with heavy file I/O, using the lzo algorithm can be more preferable. Options at mount time are:
  • compress=zlib – the default option with a high compression ratio, safe for older kernels.
  • compress=lzo – compression faster, but lower, than zlib.
  • compress=no – disables compression.
  • compress-force=method – enables compression even for files that do not compress well, such as videos and disk images. The available methods are zlib and lzo.
Only files created or changed after the mount option is added will be compressed. To compress existing files, run the following command after you replace method with either zlib or lzo:
$ btrfs filesystem defragment -cmethod
To re-compress the file using lzo, run:
$ btrfs filesystem defragment -r -v -clzo /

8.4.7.4. Tuning GFS2

This section covers some of the tuning parameters available to GFS2 file systems at format and at mount time.
Directory spacing
All directories created in the top-level directory of the GFS2 mount point are automatically spaced to reduce fragmentation and increase write speed in those directories. To space another directory like a top-level directory, mark that directory with the T attribute, as shown, replacing dirname with the path to the directory you wish to space:
# chattr +T dirname
chattr is provided as part of the e2fsprogs package.
Reduce contention
GFS2 uses a global locking mechanism that can require communication between the nodes of a cluster. Contention for files and directories between multiple nodes lowers performance. You can minimize the risk of cross-cache invalidation by minimizing the areas of the file system that are shared between multiple nodes.