Chapter 8. Storage and File Systems

This chapter outlines supported file systems and configuration options that affect application performance for both I/O and file systems in Red Hat Enterprise Linux 7. Section 8.1, “Considerations” discusses the I/O and file system related factors that affect performance. Section 8.2, “Monitoring and Diagnosing Performance Problems” teaches you how to use Red Hat Enterprise Linux 7 tools to diagnose performance problems related to I/O or file system configuration details. Section 8.4, “Configuration Tools” discusses the tools and strategies you can use to solve I/O and file system related performance problems in Red Hat Enterprise Linux 7.

8.1. Considerations

The appropriate settings for storage and file system performance are highly dependent on the purpose of the storage. I/O and file system performance can be affected by any of the following factors:
  • Data write or read patterns
  • Data alignment with underlying geometry
  • Block size
  • File system size
  • Journal size and location
  • Recording access times
  • Ensuring data reliability
  • Pre-fetching data
  • Pre-allocating disk space
  • File fragmentation
  • Resource contention
Read this chapter to gain an understanding of the formatting and mount options that affect file system throughput, scalability, responsiveness, resource usage, and availability.

8.1.1. I/O Schedulers

The I/O scheduler determines when and for how long I/O operations run on a storage device. It is also known as the I/O elevator.
Red Hat Enterprise Linux 7 provides three I/O schedulers.
deadline
The default I/O scheduler for all block devices, except for SATA disks. Deadline attempts to provide a guaranteed latency for requests from the point at which requests reach the I/O scheduler. This scheduler is suitable for most use cases, but particularly those in which read operations occur more often than write operations.
Queued I/O requests are sorted into a read or write batch and then scheduled for execution in increasing LBA order. Read batches take precedence over write batches by default, as applications are more likely to block on read I/O. After a batch is processed, deadline checks how long write operations have been starved of processor time and schedules the next read or write batch as appropriate. The number of requests to handle per batch, the number of read batches to issue per write batch, and the amount of time before requests expire are all configurable; see Section 8.4.4, “Tuning the Deadline Scheduler” for details.
cfq
The default scheduler only for devices identified as SATA disks. The Completely Fair Queueing scheduler, cfq, divides processes into three separate classes: real time, best effort, and idle. Processes in the real time class are always performed before processes in the best effort class, which are always performed before processes in the idle class. This means that processes in the real time class can starve both best effort and idle processes of processor time. Processes are assigned to the best effort class by default.
cfq uses historical data to anticipate whether an application will issue more I/O requests in the near future. If more I/O is expected, cfq idles to wait for the new I/O, even if I/O from other processes is waiting to be processed.
Because of this tendency to idle, the cfq scheduler should not be used in conjunction with hardware that does not incur a large seek penalty unless it is tuned for this purpose. It should also not be used in conjunction with other non-work-conserving schedulers, such as a host-based hardware RAID controller, as stacking these schedulers tends to cause a large amount of latency.
cfq behavior is highly configurable; see Section 8.4.5, “Tuning the CFQ Scheduler” for details.
noop
The noop I/O scheduler implements a simple FIFO (first-in first-out) scheduling algorithm. Requests are merged at the generic block layer through a simple last-hit cache. This can be the best scheduler for CPU-bound systems using fast storage.
For details on setting a different default I/O scheduler, or specifying a different scheduler for a particular device, see Section 8.4, “Configuration Tools”.

8.1.2. File Systems

Read this section for details about supported file systems in Red Hat Enterprise Linux 7, their recommended use cases, and the format and mount options available to file systems in general. Detailed tuning recommendations for these file systems are available in Section 8.4.7, “Configuring File Systems for Performance”.

8.1.2.1. XFS

XFS is a robust and highly scalable 64-bit file system. It is the default file system in Red Hat Enterprise Linux 7. XFS uses extent-based allocation, and features a number of allocation schemes, including pre-allocation and delayed allocation, both of which reduce fragmentation and aid performance. It also supports metadata journaling, which can facilitate crash recovery. XFS can be defragmented and enlarged while mounted and active, and Red Hat Enterprise Linux 7 supports several XFS-specific backup and restore utilities.
As of Red Hat Enterprise Linux 7.0 GA, XFS is supported to a maximum file system size of 500 TB, and a maximum file offset of 8 EB (sparse files). For details about administering XFS, see the Red Hat Enterprise Linux 7 Storage Administration Guide. For assistance tuning XFS for a specific purpose, see Section 8.4.7.1, “Tuning XFS”.

8.1.2.2.  Ext4

Ext4 is a scalable extension of the ext3 file system. Its default behavior is optimal for most work loads. However, it is supported only to a maximum file system size of 50 TB, and a maximum file size of 16 TB. For details about administering ext4, see the Red Hat Enterprise Linux 7 Storage Administration Guide. For assistance tuning ext4 for a specific purpose, see Section 8.4.7.2, “Tuning ext4”.

8.1.2.3. Btrfs (Technology Preview)

The default file system for Red Hat Enterprise Linux 7 is XFS. Btrfs (B-tree file system), a relatively new copy-on-write (COW) file system, is shipped as a Technology Preview. Some of the unique Btrfs features include:
  • The ability to take snapshots of specific files, volumes or sub-volumes rather than the whole file system;
  • supporting several versions of redundant array of inexpensive disks (RAID);
  • back referencing map I/O errors to file system objects;
  • transparent compression (all files on the partition are automatically compressed);
  • checksums on data and meta-data.
Although Btrfs is considered a stable file system, it is under constant development, so some functionality, such as the repair tools, are basic compared to more mature file systems.
Currently, selecting Btrfs is suitable when advanced features (such as snapshots, compression, and file data checksums) are required, but performance is relatively unimportant. If advanced features are not required, the risk of failure and comparably weak performance over time make other file systems preferable. Another drawback, compared to other file systems, is the maximum supported file system size of 50 TB.
For more information, see Section 8.4.7.3, “Tuning Btrfs”, and the chapter on Btrfs in the Red Hat Enterprise Linux 7 Storage Administration Guide.

8.1.2.4. GFS2

Global File System 2 (GFS2) is part of the High Availability Add-On that provides clustered file system support to Red Hat Enterprise Linux 7. GFS2 provides a consistent file system image across all servers in a cluster, which allows servers to read from and write to a single shared file system.
GFS2 is supported to a maximum file system size of 100 TB.
For details on administering GFS2, see the Global File System 2 guide or the Red Hat Enterprise Linux 7 Storage Administration Guide. For information on tuning GFS2 for a specific purpose, see Section 8.4.7.4, “Tuning GFS2”.

8.1.3. Generic Tuning Considerations for File Systems

This section covers tuning considerations common to all file systems. For tuning recommendations specific to your file system, see Section 8.4.7, “Configuring File Systems for Performance”.

8.1.3.1. Considerations at Format Time

Some file system configuration decisions cannot be changed after the device is formatted. This section covers the options available to you for decisions that must be made before you format your storage device.
Size
Create an appropriately-sized file system for your workload. Smaller file systems have proportionally shorter backup times and require less time and memory for file system checks. However, if your file system is too small, its performance will suffer from high fragmentation.
Block size
The block is the unit of work for the file system. The block size determines how much data can be stored in a single block, and therefore the smallest amount of data that is written or read at one time.
The default block size is appropriate for most use cases. However, your file system will perform better and store data more efficiently if the block size (or the size of multiple blocks) is the same as or slightly larger than amount of data that is typically read or written at one time. A small file will still use an entire block. Files can be spread across multiple blocks, but this can create additional runtime overhead. Additionally, some file systems are limited to a certain number of blocks, which in turn limits the maximum size of the file system.
Block size is specified as part of the file system options when formatting a device with the mkfs command. The parameter that specifies the block size varies with the file system; see the mkfs man page for your file system for details. For example, to see the options available when formatting an XFS file system, execute the following command.
$ man mkfs.xfs
Geometry
File system geometry is concerned with the distribution of data across a file system. If your system uses striped storage, like RAID, you can improve performance by aligning data and metadata with the underlying storage geometry when you format the device.
Many devices export recommended geometry, which is then set automatically when the devices are formatted with a particular file system. If your device does not export these recommendations, or you want to change the recommended settings, you must specify geometry manually when you format the device with mkfs.
The parameters that specify file system geometry vary with the file system; see the mkfs man page for your file system for details. For example, to see the options available when formatting an ext4 file system, execute the following command.
$ man mkfs.ext4
External journals
Journaling file systems document the changes that will be made during a write operation in a journal file prior to the operation being executed. This reduces the likelihood that a storage device will become corrupted in the event of a system crash or power failure, and speeds up the recovery process.
Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a device with a metadata-intensive workload by placing its journal on dedicated storage that is as fast as, or faster than, the primary storage.

Warning

Ensure that external journals are reliable. Losing an external journal device will cause file system corruption.
External journals must be created at format time, with journal devices being specified at mount time. For details, see the mkfs and mount man pages.
$ man mkfs
$ man mount

8.1.3.2. Considerations at Mount Time

This section covers tuning decisions that apply to most file systems and can be specified as the device is mounted.
Barriers
File system barriers ensure that file system metadata is correctly written and ordered on persistent storage, and that data transmitted with fsync persists across a power outage. On previous versions of Red Hat Enterprise Linux, enabling file system barriers could significantly slow applications that relied heavily on fsync, or created and deleted many small files.
In Red Hat Enterprise Linux 7, file system barrier performance has been improved such that the performance effects of disabling file system barriers are negligible (less than 3%).
Access Time
Every time a file is read, its metadata is updated with the time at which access occurred (atime). This involves additional write I/O. In most cases, this overhead is minimal, as by default Red Hat Enterprise Linux 7 updates the atime field only when the previous access time was older than the times of last modification (mtime) or status change (ctime).
However, if updating this metadata is time consuming, and if accurate access time data is not required, you can mount the file system with the noatime mount option. This disables updates to metadata when a file is read. It also enables nodiratime behavior, which disables updates to metadata when a directory is read.
Read-ahead
Read-ahead behavior speeds up file access by pre-fetching data that is likely to be needed soon and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The higher the read-ahead value, the further ahead the system pre-fetches data.
Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects about your file system. However, accurate detection is not always possible. For example, if a storage array presents itself to the system as a single LUN, the system detects the single LUN, and does not set the appropriate read-ahead value for an array.
Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values. The storage-related tuned profiles provided with Red Hat Enterprise Linux 7 raise the read-ahead value, as does using LVM striping, but these adjustments are not always sufficient for all workloads.
The parameters that define read-ahead behavior vary with the file system; see the mount man page for details.
$ man mount

8.1.3.3. Maintenance

Regularly discarding blocks that are not in use by the file system is a recommended practice for both solid-state disks and thinly-provisioned storage. There are two methods of discarding unused blocks: batch discard and online discard.
Batch discard
This type of discard is part of the fstrim command. It discards all unused blocks in a file system that match criteria specified by the administrator.
Red Hat Enterprise Linux 7 supports batch discard on XFS and ext4 formatted devices that support physical discard operations (that is, on HDD devices where the value of /sys/block/devname/queue/discard_max_bytes is not zero, and SSD devices where the value of /sys/block/devname/queue/discard_granularity is not 0).
Online discard
This type of discard operation is configured at mount time with the discard option, and runs in real time without user intervention. However, online discard only discards blocks that are transitioning from used to free. Red Hat Enterprise Linux 7 supports online discard on XFS and ext4 formatted devices.
Red Hat recommends batch discard except where online discard is required to maintain performance, or where batch discard is not feasible for the system's workload.
Pre-allocation
Pre-allocation marks disk space as being allocated to a file without writing any data into that space. This can be useful in limiting data fragmentation and poor read performance. Red Hat Enterprise Linux 7 supports pre-allocating space on XFS, ext4, and GFS2 devices at mount time; see the mount man page for the appropriate parameter for your file system. Applications can also benefit from pre-allocating space by using the fallocate(2) glibc call.