Chapter 3. Subsystems and Tunable Parameters

Subsystems are kernel modules that are aware of cgroups. Typically, they are resource controllers that allocate varying levels of system resources to different cgroups. However, subsystems could be programmed for any other interaction with the kernel where the need exists to treat different groups of processes differently. The application programming interface (API) to develop new subsystems is documented in cgroups.txt in the kernel documentation, installed on your system at /usr/share/doc/kernel-doc-kernel-version/Documentation/cgroups/ (provided by the kernel-doc package). The latest version of the cgroups documentation is also available on line at http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt. Note, however, that the features in the latest documentation might not match those available in the kernel installed on your system.
State objects that contain the subsystem parameters for a cgroup are represented as pseudofiles within the cgroup virtual file system. These pseudo-files can be manipulated by shell commands or their equivalent system calls. For example, cpuset.cpus is a pseudo-file that specifies which CPUs a cgroup is permitted to access. If /cgroup/cpuset/webserver is a cgroup for the web server that runs on a system, and the following command is executed:
~]# echo 0,2 > /cgroup/cpuset/webserver/cpuset.cpus
The value 0,2 is written to the cpuset.cpus pseudofile and therefore limits any tasks whose PIDs are listed in /cgroup/cpuset/webserver/tasks to use only CPU 0 and CPU 2 on the system.

3.1. blkio

The Block I/O (blkio) subsystem controls and monitors access to I/O on block devices by tasks in cgroups. Writing values to some of these pseudofiles limits access or bandwidth, and reading values from some of these pseudofiles provides information on I/O operations.
The blkio subsystem offers two policies for controlling access to I/O:
  • Proportional weight division — implemented in the Completely Fair Queuing I/O scheduler, this policy allows you to set weights to specific cgroups. This means that each cgroup has a set percentage (depending on the weight of the cgroup) of all I/O operations reserved. For more information, refer to Section 3.1.1, “Proportional Weight Division Tunable Parameters”
  • I/O throttling (Upper limit) — this policy is used to set an upper limit for the number of I/O operations performed by a specific device. This means that a device can have a limited rate of read or write operations. For more information, refer to Section 3.1.2, “I/O Throttling Tunable Parameters”

Buffered write operations

Currently, the Block I/O subsystem does not work for buffered write operations. It is primarily targeted at direct I/O, although it works for buffered read operations.

3.1.1. Proportional Weight Division Tunable Parameters

blkio.weight
specifies the relative proportion (weight) of block I/O access available by default to a cgroup, in the range 100 to 1000. This value is overridden for specific devices by the blkio.weight_device parameter. For example, to assign a default weight of 500 to a cgroup for access to block devices, run:
echo 500 > blkio.weight
blkio.weight_device
specifies the relative proportion (weight) of I/O access on specific devices available to a cgroup, in the range 100 to 1000. The value of this parameter overrides the value of the blkio.weight parameter for the devices specified. Values take the format major:minor weight, where major and minor are device types and node numbers specified in Linux Allocated Devices, otherwise known as the Linux Devices List and available from http://www.kernel.org/doc/Documentation/devices.txt. For example, to assign a weight of 500 to a cgroup for access to /dev/sda, run:
echo 8:0 500 > blkio.weight_device
In the Linux Allocated Devices notation, 8:0 represents /dev/sda.

3.1.2. I/O Throttling Tunable Parameters

blkio.throttle.read_bps_device
specifies the upper limit on the number of read operations a device can perform. The rate of the read operations is specified in bytes per second. Entries have three fields: major, minor, and bytes_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and bytes_per_second is the upper limit rate at which read operations can be performed. For example, to allow the /dev/sda device to perform read operations at a maximum of 10 MBps, run:
~]# echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.read_bps_device
blkio.throttle.read_iops_device
specifies the upper limit on the number of read operations a device can perform. The rate of the read operations is specified in operations per second. Entries have three fields: major, minor, and operations_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and operations_per_second is the upper limit rate at which read operations can be performed. For example, to allow the /dev/sda device to perform a maximum of 10 read operations per second, run:
~]# echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.read_iops_device
blkio.throttle.write_bps_device
specifies the upper limit on the number of write operations a device can perform. The rate of the write operations is specified in bytes per second. Entries have three fields: major, minor, and bytes_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and bytes_per_second is the upper limit rate at which write operations can be performed. For example, to allow the /dev/sda device to perform write operations at a maximum of 10 MBps, run:
~]# echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.write_bps_device
blkio.throttle.write_iops_device
specifies the upper limit on the number of write operations a device can perform. The rate of the write operations is specified in operations per second. Entries have three fields: major, minor, and operations_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and operations_per_second is the upper limit rate at which write operations can be performed. For example, to allow the /dev/sda device to perform a maximum of 10 write operations per second, run:
~]# echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.write_iops_device
blkio.throttle.io_serviced
reports the number of I/O operations performed on specific devices by a cgroup as seen by the throttling policy. Entries have four fields: major, minor, operation, and number. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and number represents the number of operations.
blkio.throttle.io_service_bytes
reports the number of bytes transferred to or from specific devices by a cgroup. The only difference between blkio.io_service_bytes and blkio.throttle.io_service_bytes is that the former is not updated when the CFQ scheduler is operating on a request queue. Entries have four fields: major, minor, operation, and bytes. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and bytes is the number of bytes transferred.

3.1.3. blkio Common Tunable Parameters

The following parameters may be used for either of the policies listed in Section 3.1, “blkio”.
blkio.reset_stats
resets the statistics recorded in the other pseudofiles. Write an integer to this file to reset the statistics for this cgroup.
blkio.time
reports the time that a cgroup had I/O access to specific devices. Entries have three fields: major, minor, and time. Major and minor are device types and node numbers specified in Linux Allocated Devices, and time is the length of time in milliseconds (ms).
blkio.sectors
reports the number of sectors transferred to or from specific devices by a cgroup. Entries have three fields: major, minor, and sectors. Major and minor are device types and node numbers specified in Linux Allocated Devices, and sectors is the number of disk sectors.
blkio.avg_queue_size
reports the average queue size for I/O operations by a cgroup, over the entire length of time of the group's existence. The queue size is sampled every time a queue for this cgroup gets a timeslice. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.
blkio.group_wait_time
reports the total time (in nanoseconds — ns) a cgroup spent waiting for a timeslice for one of its queues. The report is updated every time a queue for this cgroup gets a timeslice, so if you read this pseudofile while the cgroup is waiting for a timeslice, the report will not contain time spent waiting for the operation currently queued. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.
blkio.empty_time
reports the total time (in nanoseconds — ns) a cgroup spent without any pending requests. The report is updated every time a queue for this cgroup has a pending request, so if you read this pseudofile while the cgroup has no pending requests, the report will not contain time spent in the current empty state. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.
blkio.idle_time
reports the total time (in nanoseconds — ns) the scheduler spent idling for a cgroup in anticipation of a better request than those requests already in other queues or from other groups. The report is updated every time the group is no longer idling, so if you read this pseudofile while the cgroup is idling, the report will not contain time spent in the current idling state. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.
blkio.dequeue
reports the number of times requests for I/O operations by a cgroup were dequeued by specific devices. Entries have three fields: major, minor, and number. Major and minor are device types and node numbers specified in Linux Allocated Devices, and number is the number of requests the group was dequeued. Note that this report is available only if CONFIG_DEBUG_BLK_CGROUP=y is set on the system.
blkio.io_serviced
reports the number of I/O operations performed on specific devices by a cgroup as seen by the CFQ scheduler. Entries have four fields: major, minor, operation, and number. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and number represents the number of operations.
blkio.io_service_bytes
reports the number of bytes transferred to or from specific devices by a cgroup as seen by the CFQ scheduler. Entries have four fields: major, minor, operation, and bytes. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and bytes is the number of bytes transferred.
blkio.io_service_time
reports the total time between request dispatch and request completion for I/O operations on specific devices by a cgroup as seen by the CFQ scheduler. Entries have four fields: major, minor, operation, and time. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and time is the length of time in nanoseconds (ns). The time is reported in nanoseconds rather than a larger unit so that this report is meaningful even for solid-state devices.
blkio.io_wait_time
reports the total time I/O operations on specific devices by a cgroup spent waiting for service in the scheduler queues. When you interpret this report, note:
  • the time reported can be greater than the total time elapsed, because the time reported is the cumulative total of all I/O operations for the cgroup rather than the time that the cgroup itself spent waiting for I/O operations. To find the time that the group as a whole has spent waiting, use the blkio.group_wait_time parameter.
  • if the device has a queue_depth > 1, the time reported only includes the time until the request is dispatched to the device, not any time spent waiting for service while the device re-orders requests.
Entries have four fields: major, minor, operation, and time. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read, write, sync, or async) and time is the length of time in nanoseconds (ns). The time is reported in nanoseconds rather than a larger unit so that this report is meaningful even for solid-state devices.
blkio.io_merged
reports the number of BIOS requests merged into requests for I/O operations by a cgroup. Entries have two fields: number and operation. Number is the number of requests, and operation represents the type of operation (read, write, sync, or async).
blkio.io_queued
reports the number of requests queued for I/O operations by a cgroup. Entries have two fields: number and operation. Number is the number of requests, and operation represents the type of operation (read, write, sync, or async).

3.1.4. Example Usage

Refer to Example 3.1, “blkio proportional weight division” for a simple test of running two dd threads in two different cgroups with various blkio.weight values.

Example 3.1. blkio proportional weight division

  1. Mount the blkio subsystem:
    ~]# mount -t cgroup -o blkio blkio /cgroup/blkio/
  2. Create two cgroups for the blkio subsystem:
    ~]# mkdir /cgroup/blkio/test1/
    ~]# mkdir /cgroup/blkio/test2/
  3. Set blkio weights in the previously-created cgroups:
    ~]# echo 1000 > /cgroup/blkio/test1/blkio.weight
    ~]# echo 500 > /cgroup/blkio/test2/blkio.weight
  4. Create two large files:
    ~]# dd if=/dev/zero of=file_1 bs=1M count=4000
    ~]# dd if=/dev/zero of=file_2 bs=1M count=4000
    The above commands create two files (file_1 and file_2) of size 4 GB.
  5. For each of the test cgroups, execute a dd command (which reads the contents of a file and outputs it to the null device) on one of the large files:
    ~]# cgexec -g blkio:test1 time dd if=file_1 of=/dev/null
    ~]# cgexec -g blkio:test2 time dd if=file_2 of=/dev/null
    Both commands will output their completion time once they have finished.
  6. Simultaneously with the two running dd threads, you can monitor the performance in real time by using the iotop utility. To install the iotop utility, execute, as root, the yum install iotop command. The following is an example of the output as seen in the iotop utility while running the previously-started dd threads:
    Total DISK READ: 83.16 M/s | Total DISK WRITE: 0.00 B/s
        TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    15:18:04 15071 be/4 root       27.64 M/s    0.00 B/s  0.00 % 92.30 % dd if=file_2 of=/dev/null
    15:18:04 15069 be/4 root       55.52 M/s    0.00 B/s  0.00 % 88.48 % dd if=file_1 of=/dev/null
    

In order to get the most accurate result in Example 3.1, “blkio proportional weight division”, prior to the execution of the dd commands, flush all file system buffers and free pagecache, dentries and inodes using the following commands:
~]# sync
~]# echo 3 > /proc/sys/vm/drop_caches
Additionally, you can enable group isolation which provides stronger isolation between groups at the expense of throughput. When group isolation is disabled, fairness can be expected only for a sequential workload. By default, group isolation is enabled and fairness can be expected for random I/O workloads as well. To enable group isolation, use the following command:
~]# echo 1 > /sys/block/<disk_device>/queue/iosched/group_isolation
where <disk_device> stands for the name of the desired device, for example sda.