Chapter 8. Storage and File Systems
- Data write or read patterns
- Data alignment with underlying geometry
- Block size
- File system size
- Journal size and location
- Recording access times
- Ensuring data reliability
- Pre-fetching data
- Pre-allocating disk space
- File fragmentation
- Resource contention
8.1.1. I/O Schedulers
- The default I/O scheduler for all block devices, except for SATA disks.
Deadlineattempts to provide a guaranteed latency for requests from the point at which requests reach the I/O scheduler. This scheduler is suitable for most use cases, but particularly those in which read operations occur more often than write operations.Queued I/O requests are sorted into a read or write batch and then scheduled for execution in increasing LBA order. Read batches take precedence over write batches by default, as applications are more likely to block on read I/O. After a batch is processed,
deadlinechecks how long write operations have been starved of processor time and schedules the next read or write batch as appropriate. The number of requests to handle per batch, the number of read batches to issue per write batch, and the amount of time before requests expire are all configurable; see Section 8.4.4, “Tuning the Deadline Scheduler” for details.
- The default scheduler only for devices identified as SATA disks. The Completely Fair Queueing scheduler,
cfq, divides processes into three separate classes: real time, best effort, and idle. Processes in the real time class are always performed before processes in the best effort class, which are always performed before processes in the idle class. This means that processes in the real time class can starve both best effort and idle processes of processor time. Processes are assigned to the best effort class by default.
cfquses historical data to anticipate whether an application will issue more I/O requests in the near future. If more I/O is expected,
cfqidles to wait for the new I/O, even if I/O from other processes is waiting to be processed.Because of this tendency to idle, the cfq scheduler should not be used in conjunction with hardware that does not incur a large seek penalty unless it is tuned for this purpose. It should also not be used in conjunction with other non-work-conserving schedulers, such as a host-based hardware RAID controller, as stacking these schedulers tends to cause a large amount of latency.
cfqbehavior is highly configurable; see Section 8.4.5, “Tuning the CFQ Scheduler” for details.
noopI/O scheduler implements a simple FIFO (first-in first-out) scheduling algorithm. Requests are merged at the generic block layer through a simple last-hit cache. This can be the best scheduler for CPU-bound systems using fast storage.
8.1.2. File Systems
126.96.36.199. Btrfs (Technology Preview)
- The ability to take snapshots of specific files, volumes or sub-volumes rather than the whole file system;
- supporting several versions of redundant array of inexpensive disks (RAID);
- back referencing map I/O errors to file system objects;
- transparent compression (all files on the partition are automatically compressed);
- checksums on data and meta-data.
8.1.3. Generic Tuning Considerations for File Systems
188.8.131.52. Considerations at Format Time
- Create an appropriately-sized file system for your workload. Smaller file systems have proportionally shorter backup times and require less time and memory for file system checks. However, if your file system is too small, its performance will suffer from high fragmentation.
- Block size
- The block is the unit of work for the file system. The block size determines how much data can be stored in a single block, and therefore the smallest amount of data that is written or read at one time.The default block size is appropriate for most use cases. However, your file system will perform better and store data more efficiently if the block size (or the size of multiple blocks) is the same as or slightly larger than amount of data that is typically read or written at one time. A small file will still use an entire block. Files can be spread across multiple blocks, but this can create additional runtime overhead. Additionally, some file systems are limited to a certain number of blocks, which in turn limits the maximum size of the file system.Block size is specified as part of the file system options when formatting a device with the
mkfscommand. The parameter that specifies the block size varies with the file system; see the
mkfsman page for your file system for details. For example, to see the options available when formatting an XFS file system, execute the following command.
$ man mkfs.xfs
- File system geometry is concerned with the distribution of data across a file system. If your system uses striped storage, like RAID, you can improve performance by aligning data and metadata with the underlying storage geometry when you format the device.Many devices export recommended geometry, which is then set automatically when the devices are formatted with a particular file system. If your device does not export these recommendations, or you want to change the recommended settings, you must specify geometry manually when you format the device with mkfs.The parameters that specify file system geometry vary with the file system; see the
mkfsman page for your file system for details. For example, to see the options available when formatting an ext4 file system, execute the following command.
$ man mkfs.ext4
- External journals
- Journaling file systems document the changes that will be made during a write operation in a journal file prior to the operation being executed. This reduces the likelihood that a storage device will become corrupted in the event of a system crash or power failure, and speeds up the recovery process.Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a device with a metadata-intensive workload by placing its journal on dedicated storage that is as fast as, or faster than, the primary storage.
WarningEnsure that external journals are reliable. Losing an external journal device will cause file system corruption.External journals must be created at format time, with journal devices being specified at mount time. For details, see the
$ man mkfs
$ man mount
184.108.40.206. Considerations at Mount Time
- File system barriers ensure that file system metadata is correctly written and ordered on persistent storage, and that data transmitted with
fsyncpersists across a power outage. On previous versions of Red Hat Enterprise Linux, enabling file system barriers could significantly slow applications that relied heavily on
fsync, or created and deleted many small files.In Red Hat Enterprise Linux 7, file system barrier performance has been improved such that the performance effects of disabling file system barriers are negligible (less than 3%).For further information, see the Red Hat Enterprise Linux 7 Storage Administration Guide.
- Access Time
- Every time a file is read, its metadata is updated with the time at which access occurred (
atime). This involves additional write I/O. In most cases, this overhead is minimal, as by default Red Hat Enterprise Linux 7 updates the
atimefield only when the previous access time was older than the times of last modification (
mtime) or status change (
ctime).However, if updating this metadata is time consuming, and if accurate access time data is not required, you can mount the file system with the
noatimemount option. This disables updates to metadata when a file is read. It also enables
nodiratimebehavior, which disables updates to metadata when a directory is read.
- Read-ahead behavior speeds up file access by pre-fetching data that is likely to be needed soon and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The higher the read-ahead value, the further ahead the system pre-fetches data.Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects about your file system. However, accurate detection is not always possible. For example, if a storage array presents itself to the system as a single LUN, the system detects the single LUN, and does not set the appropriate read-ahead value for an array.Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values. The storage-related tuned profiles provided with Red Hat Enterprise Linux 7 raise the read-ahead value, as does using LVM striping, but these adjustments are not always sufficient for all workloads.The parameters that define read-ahead behavior vary with the file system; see the mount man page for details.
$ man mount
- Batch discard
- This type of discard is part of the fstrim command. It discards all unused blocks in a file system that match criteria specified by the administrator.Red Hat Enterprise Linux 7 supports batch discard on XFS and ext4 formatted devices that support physical discard operations (that is, on HDD devices where the value of
/sys/block/devname/queue/discard_max_bytesis not zero, and SSD devices where the value of
- Online discard
- This type of discard operation is configured at mount time with the
discardoption, and runs in real time without user intervention. However, online discard only discards blocks that are transitioning from used to free. Red Hat Enterprise Linux 7 supports online discard on XFS and ext4 formatted devices.Red Hat recommends batch discard except where online discard is required to maintain performance, or where batch discard is not feasible for the system's workload.
- Pre-allocation marks disk space as being allocated to a file without writing any data into that space. This can be useful in limiting data fragmentation and poor read performance. Red Hat Enterprise Linux 7 supports pre-allocating space on XFS, ext4, and GFS2 devices at mount time; see the
mountman page for the appropriate parameter for your file system. Applications can also benefit from pre-allocating space by using the