What is the suggested I/O scheduler to improve disk performance when using Red Hat Enterprise Linux with virtualization?

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 4, 5, 6 or 7
  • Virtualization, e.g. KVM, Xen, VMware or Microsoft Hyper-V
  • Virtualization guest or virtualization host
  • Virtual disk

Issue

  • What is the recommended I/O scheduler for Red Hat Enterprise Linux as a virtualization host?

Resolution

Red Hat Enterprise Linux as a virtualization host (Xen, KVM or VMware)

  • When using RHEL as a host for virtualized guests, the default cfq scheduler is usually ideal. This scheduler performs well on nearly all workloads.
  • If, however, minimizing I/O latency is more important than maximizing I/O throughput on the guest workloads, it may be beneficial to use the deadline scheduler . The deadline is also the scheduler used by the tuned profile virtual-host.

Red Hat Enterprise Linux as a virtualization guest (any hypervisor technology)

  • RHEL guests often benefit greatly from the noop I/O scheduler, which allows the host/hypervisor to optimize the I/O requests and prioritize based on incoming guest load. The noop scheduler can still combine small requests from the guest OS into larger requests before handing the I/O to the hypervisor, however noop follows the idea to spend as few CPU cycles as possible in the guest for I/O scheduling. The host/hypervisor will have an overview of the requests of all guests and have a separate strategy for handling I/O.
  • Depending on the disk presentation (Virtual Disk vs RDM) and I/O workload, schedulers like deadline can be more advantageous. Performance testing is required to verify which scheduler is the most advantageous.
  • Guests using storage accessed by iSCSI, SR-IOV or physical device pass-through should not use the noop scheduler, since these methods do not allow the host to optimize I/O requests to the underlying physical device.

Root Cause

  • In virtualized environments, it is not advantageous to schedule I/O at both the guest and hypervisor layers. If multiple guests use storage on a filesystem or block device managed by the host operating system, the host will likely schedule I/O more efficiently because it is aware of requests from all guests and knows the physical layout of storage, which may not map linearly to the guests' virtual storage. On the other hand, depending on the workload, it can also be beneficial to use a scheduler like deadline in the guest.

  • All scheduler tuning should be tested under normal operating conditions, as synthetic benchmarks typically do not accurately compare performance of systems using shared resources in virtual environments.

Configuring the I/O scheduler on Red Hat Enterprise Linux 4, 5 and 6

  • The I/O scheduler can be selected at boot time using the elevator kernel parameter. In the following example grub.conf stanza, the system has been configured to use the noop scheduler:

    # cat /boot/grub/grub.conf
    [...]
    title Red Hat Enterprise Linux Server (2.6.18-8.el5)
    root (hd0,0)
    kernel /vmlinuz-2.6.18-8.el5 ro root=/dev/vg0/lv0 elevator=noop
    initrd /initrd-2.6.18-8.el5.img
    
  • The default scheduler in Red Hat Enterprise Linux 4, 5 and 6 is CFQ. The available tuned profiles use the deadline elevator. A custom tuned profile can also be used to specify the elevator. More information on creating a custom tuned profile can be found in solution 1305833

Configuring the I/O scheduler on Red Hat Enterprise Linux 7

  • The default scheduler in Red Hat Enterprise Linux 7 is now deadline.
  • To make the changes persistent through boot you have to add elevator=noop toGRUB_CMDLINE_LINUX in /etc/default/grub as shown below.

    # cat /etc/default/grub
    [...]
    GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=vg00/lvroot rhgb quiet elevator=noop"
    [...]
    
    • After the entry has been created/updated, rebuild the /boot/grub2/grub.cfg file to include the new configuration with the added parameter:
      • On BIOS-based machines: ~]# grub2-mkconfig -o /boot/grub2/grub.cfg
      • On UEFI-based machines: ~]# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
  • Another way to change the default I/O scheduler is to use tuned.

  • More information on creating a custom tuned profile can be found in solution 1305833

Online configuring the I/O scheduler on Red Hat Enterprise Linux

  • In Red Hat Enterprise Linux 5, 6 or 7 it is also possible to change the I/O scheduler for a particular disk after the system has been booted. This makes it possible to use different I/O schedulers for different disks.

    # cat /sys/block/hda/queue/scheduler
    noop anticipatory deadline [cfq]
    
    # echo 'noop' > /sys/block/hda/queue/scheduler
    # cat /sys/block/hda/queue/scheduler
    [noop] anticipatory deadline cfq
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

12 Comments

We've done lots of testing on this with RHEL 5 and found cfq to be best for the guest in an ESXi env.

If I'm not mistaken, there is a race condition when running RHEL 4 with elevator=NOOP in VMWare.  Use deadline or cfq.

Typo, one does not edit /etc/grub2.cfg See /etc/sysconfig/grub

etc/grub2.cfg should not be edited, the correct way is to edit /etc/default/grub and then run grub2-mkconfig

On UEFI- based machines, one must execute grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg instead of grub2-mkconfig -o /boot/grub2/grub.cfg.

For more info, see: Customizing the GRUB 2 Configuration File.

Another option would be to specify the scheduler depending on the disk being in use.
Create this new rule -> sudo nano /etc/udev/rules.d/60-schedulers.rules

Put the following content into the empty file (save it - reboot the system afterwards) :

# set cfq scheduler for rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="cfq"

# set deadline scheduler for non-rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"

The rule does not differentiate between the virtual disks and pass-through (for example enterprise san storage). So my assumption is the above is for virtual disks only.

Hi Dwight, this method is working for physical disks as well - I have tested it on various Linux distributions.
When you are running a RHEL host with KVM virtualization on a SSD, the deadline scheduler will be in use.

Pass thru san based disks are typically marked rotational, but because they are nominally backed by multiple physical disks, deadline is a better scheduler than cfq for those disks. The above rules will set cfq for rotational based san disks, which is often sub-optimal.

Hi Dwight, thanks for the information - I only wanted to provide the additional option to tweak the scheduler.
If you want to use the deadline scheduler for rotational disks, you of course can do this by changing the rule :
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="deadline"

If the schedulers noop and deadline disables the OS I/O-scheduling completely, the ionice command and the automatic I/O-prioritization in the OS does not work anymore.
Since the VMware host have no way of prioritizing I/O between individual processes in the guest OS, there must definitely be times when disabling OS scheduling is a bad idea.
For example when a low performing or very busy storage is used for the guest, and you for example have a low priority backup-process and a real-time scheduled production application/process.

"If the schedulers noop and deadline disables the OS I/O-scheduling completely ... the automatic I/O-prioritization in the OS does not work anymore"

Clarification is needed for above. noop and deadline do not disable io scheduling completely. They just don't allow significant tuning changes on how the scheduling (queueing) is performed. For example deadline has two queues: read and write, but the two queues are each accessed via two methods, location sorted and time sorted: so two inputs and 4 outputs. That's all its got and it cannot be tuned further in terms of what is in the queue. It can be turned to bias the draining of io out of the queues more for reads or writes depending on how the tunables are set -- that is how many io are stripped from each and interleaved/sent to storage or how long a "deadline" period is used before switching between location ordered stripping of io from queues to disk vs time ordered draining of io from queues.

The cfq has some additional automatic sorting of io (not really "scheduling") but allows ionice and cGroup manipulation of processes io into different queues and these queues are in sets of idle, best effort, real-time... and then further broken down into queues of different priorities within those queue types.

"there must definitely be times when disabling OS scheduling is a bad idea."

Yes there can be, hence the wiggle room with the descriptions of each io scheduler:

  • "...cfq scheduler is usually ideal" which conversely means there will be circumstances where it is not ideal but in general is a reasonable starting point for tuning.
  • "...guests often benefit greatly from the noop I/O scheduler", which conversely measn there will be circumstances where noop in the application environment isn't a benefit.
  • "...schedulers like deadline can be more advantageous", not shall be but can be for certain storage environments and application io loads, especially if those loads are read latency sensitive and less sensitive to and have a major component of buffered writes associated with the application.

This article is about the majority of cases and a reasonable starting point for further tuning. It doesn't cover cases for things like RT guests or where there is significant need to tuning/assigning priorities of individual processes... that is just way too specific to niche configurations. Also see the caveats in above, specifically about not treating paravirt or pass through devices to the same scheduling tuning as virtual disks.

Tuning outside of the general recommendations tend to be corner cases where the physical resources of the hypervisor being spread out over a number of guest is well separated and/or well bounded in terms of potential collective io load from all guests. But if using virtual disk io, then a single guest's scheduling of io is done in a vacuum and when it all trickles downhill into the hypervisor to be mixed in with other vm guests io going to the same underlying disks - setting up one guest to use RT queue scheduling means nothing to the hypervisor when it blends that io in with other guests io. Trying to tune one vm guest cannot know the definitive steady state impact to performance since it cannot know what io load being presented from other vm guests is at any given moment. Basically its trial and error with generally noop or deadline being nominally best over a wide range of vm guests and application loads. If you know the underlying physical storage is only being used by one guest's virtual disks, then you can almost treat it as physical pass-thru storage within the guest in terms of tuning... but most the time you may not know you have dedicated physical disks behind virtual disks just for the current vm guest. And if you did, it would be best to present them as pass-thru resources to the guest in the first place.

"for example have a low priority backup-process and a real-time scheduled production application/process."

You have to be very careful doing things like that as you can end up in a hung or really poorly performing system due to priority inversion. For example the backup process locks a directory on disk as it processes files so things can't change while grabbing files. While it has that filesystem lock, higher RT io to same filesystem is going on which blocks the backup io from being processed (while backup is holding that fs lock) to the point that the RT process trips across the filesystem lock the low priority backup process is holding. In truth I've see tuned applications using RT to the point that the filesystem's metadata transactions can't be flushed resulting in some parts of the RT application environment locking up (while other parts continue to read/write data to other parts of the filesystem or to other partitions on the disk) and it ends up locking out the filesystem's metadata transactions to the detriment of all other processes... eventually.

Tuning isn't for the faint of heart.