One of the first decisions you will need to make is which I/O scheduler to use. This section provides an overview of each of the main schedulers to help you decide which is best for your workload.

6.4.1. Completely Fair Queuing (CFQ)

The default I/O scheduler in Red Hat Enterprise Linux 6. CFQ attempts to provide some fairness in I/O scheduling decisions based on the process which initiated the I/O. Three different scheduling classes are provided: real-time (RT), best-effort (BE), and idle. A scheduling class can be manually assigned to a process with the ionice command, or programmatically assigned via the ioprio_set system call. By default, processes are placed in the best-effort scheduling class. The real-time and best-effort scheduling classes are further subdivided into eight I/O priorities within each class, priority 0 being the highest and 7 the lowest. Processes in the real-time scheduling class are scheduled much more aggressively than processes in either best-effort or idle, so any scheduled real-time I/O is always performed before best-effort or idle I/O. This means that real-time priority I/O can starve out both the best-effort and idle classes. Best effort scheduling is the default scheduling class, and 4 is the default priority within this class. Processes in the idle scheduling class are only serviced when there is no other I/O pending in the system. Thus, it is very important to only set the I/O scheduling class of a process to idle if I/O from the process is not at all required for making forward progress.

CFQ provides fairness by assigning a time slice to each of the processes performing I/O. During its time slice, a process may have (by default) up to 8 requests in flight at a time. The scheduler tries to anticipate whether an application will issue more I/O in the near future based on historical data. If it is expected that a process will issue more I/O, then CFQ will idle, waiting for that I/O, even if there is I/O from other processes waiting to be issued.

Because of the idling performed by CFQ, it is often not a good fit for hardware that does not suffer from a large seek penalty, such as fast external storage arrays or solid state disks. If using CFQ on such storage is a requirement (for example, if you would also like to use the cgroup proportional weight I/O scheduler), you will need to tune some settings to improve CFQ performance. Set the following parameters in the files of the same name located in /sys/block/device/queue/iosched/:

slice_idle = 0
quantum = 64
group_idle = 1

When group_idle is set to 1, there is still the potential for I/O stalls (whereby the back-end storage is not busy due to idling). However, these stalls will be less frequent than idling on every queue in the system.

CFQ is a non-work-conserving I/O scheduler, which means it can be idle even when there are requests pending (as we discussed above). The stacking of non-work-conserving schedulers can introduce large latencies in the I/O path. An example of such stacking is using CFQ on top of a host-based hardware RAID controller. The RAID controller may implement its own non-work-conserving scheduler, thus causing delays at two levels in the stack. Non-work-conserving schedulers operate best when they have as much data as possible to base their decisions on. In the case of stacking such scheduling algorithms, the bottom-most scheduler will only see what the upper scheduler sends down. Thus, the lower layer will see an I/O pattern that is not at all representative of the actual workload.

Tunables

back_seek_max: Backward seeks are typically bad for performance, as they can incur greater delays in repositioning the heads than forward seeks do. However, CFQ will still perform them, if they are small enough. This tunable controls the maximum distance in KB the I/O scheduler will allow backward seeks. The default is 16 KB.
back_seek_penalty: Because of the inefficiency of backward seeks, a penalty is associated with each one. The penalty is a multiplier; for example, consider a disk head position at 1024KB. Assume there are two requests in the queue, one at 1008KB and another at 1040KB. The two requests are equidistant from the current head position. However, after applying the back seek penalty (default: 2), the request at the later position on disk is now twice as close as the earlier request. Thus, the head will move forward.
fifo_expire_async: This tunable controls how long an async (buffered write) request can go unserviced. After the expiration time (in milliseconds), a single starved async request will be moved to the dispatch list. The default is 250 ms.
fifo_expire_sync: This is the same as the fifo_expire_async tunable, for for synchronous (read and O_DIRECT write) requests. The default is 125 ms.
group_idle: When set, CFQ will idle on the last process issuing I/O in a cgroup. This should be set to 1 when using proportional weight I/O cgroups and setting slice_idle to 0 (typically done on fast storage).
group_isolation: If group isolation is enabled (set to 1), it provides a stronger isolation between groups at the expense of throughput. Generally speaking, if group isolation is disabled, fairness is provided for sequential workloads only. Enabling group isolation provides fairness for both sequential and random workloads. The default value is 0 (disabled). Refer to Documentation/cgroups/blkio-controller.txt for further information.
low_latency: When low latency is enabled (set to 1), CFQ attempts to provide a maximum wait time of 300 ms for each process issuing I/O on a device. This favors fairness over throughput. Disabling low latency (setting it to 0) ignores target latency, allowing each process in the system to get a full time slice. Low latency is enabled by default.
quantum: The quantum controls the number of I/Os that CFQ will send to the storage at a time, essentially limiting the device queue depth. By default, this is set to 8. The storage may support much deeper queue depths, but increasing quantum will also have a negative impact on latency, especially in the presence of large sequential write workloads.
slice_async: This tunable controls the time slice allotted to each process issuing asynchronous (buffered write) I/O. By default it is set to 40 ms.
slice_idle: This specifies how long CFQ should idle while waiting for further requests. The default value in Red Hat Enterprise Linux 6.1 and earlier is 8 ms. In Red Hat Enterprise Linux 6.2 and later, the default value is 0. The zero value improves the throughput of external RAID storage by removing all idling at the queue and service tree level. However, a zero value can degrade throughput on internal non-RAID storage, because it increases the overall number of seeks. For non-RAID storage, we recommend a slice_idle value that is greater than 0.
slice_sync: This tunable dictates the time slice allotted to a process issuing synchronous (read or direct write) I/O. The default is 100 ms.