What is the suggested I/O scheduler to improve disk performance when using Red Hat Enterprise Linux with virtualization?

Solution Verified - Updated -

Red Hat Insights can detect this issue

Proactively detect and remediate issues impacting your systems.
View matching systems and remediation

Environment

  • Red Hat Enterprise Linux (RHEL) 4, 5, 6 or 7
  • Virtualization, e.g. KVM, Xen, VMware or Microsoft Hyper-V
  • Virtualization guest or virtualization host
  • Virtual disk

Issue

  • What is the recommended I/O scheduler for Red Hat Enterprise Linux as a virtualization host?

Resolution

Red Hat Enterprise Linux as a virtualization host (Xen, KVM or VMware)

  • When using RHEL as a host for virtualized guests, the default cfq scheduler is usually ideal. This scheduler performs well on nearly all workloads.
  • If, however, minimizing I/O latency is more important than maximizing I/O throughput on the guest workloads, it may be beneficial to use the deadline scheduler . The deadline is also the scheduler used by the tuned profile virtual-host.

Red Hat Enterprise Linux as a virtualization guest (any hypervisor technology)

  • RHEL guests often benefit greatly from the noop I/O scheduler, which allows the host/hypervisor to optimize the I/O requests and prioritize based on incoming guest load. The noop scheduler can still combine small requests from the guest OS into larger requests before handing the I/O to the hypervisor, however noop follows the idea to spend as few CPU cycles as possible in the guest for I/O scheduling. The host/hypervisor will have an overview of the requests of all guests and have a separate strategy for handling I/O.
  • Depending on the disk presentation (Virtual Disk vs Raw Device Mapping (RDM)) and I/O workload, schedulers like deadline can be more advantageous. Performance testing is required to verify which scheduler is the most advantageous.
  • Guests using storage accessed by iSCSI, SR-IOV or physical device pass-through should not use the noop scheduler, since these methods do not allow the host to optimize I/O requests to the underlying physical device.
  • The scheduler deployed by default in new RHEL installations in virtual guests has changed over time, since RHEL7.5 the deadline scheduler is used, in older versions none. This should not hint at deadline providing more performance in average environments than noop: as per explanations in this document, the specifics of the environment and workloads are relevant and testing should be done.

Root Cause

  • In virtualized environments, it is not advantageous to schedule I/O at both the guest and hypervisor layers. If multiple guests use storage on a filesystem or block device managed by the host operating system, the host will likely schedule I/O more efficiently because it is aware of requests from all guests and knows the physical layout of storage, which may not map linearly to the guests' virtual storage. On the other hand, depending on the workload, it can also be beneficial to use a scheduler like deadline in the guest.

  • All scheduler tuning should be tested under normal operating conditions, as synthetic benchmarks typically do not accurately compare performance of systems using shared resources in virtual environments.

Configuring the I/O scheduler on Red Hat Enterprise Linux 4, 5 and 6

  • The I/O scheduler can be selected at boot time using the elevator kernel parameter. In the following example grub.conf stanza, the system has been configured to use the noop scheduler:

    # cat /boot/grub/grub.conf
    [...]
    title Red Hat Enterprise Linux Server (2.6.18-8.el5)
    root (hd0,0)
    kernel /vmlinuz-2.6.18-8.el5 ro root=/dev/vg0/lv0 elevator=noop
    initrd /initrd-2.6.18-8.el5.img
    
  • The default scheduler in Red Hat Enterprise Linux 4, 5 and 6 is CFQ. The available tuned profiles use the deadline elevator. A custom tuned profile can also be used to specify the elevator. More information on creating a custom tuned profile can be found in solution 1305833

Configuring the I/O scheduler on Red Hat Enterprise Linux 7

  • The default scheduler in later Red Hat Enterprise Linux 7 versions (7.5 and later) is deadline.
  • To make the changes persistent through boot you have to add elevator=noop toGRUB_CMDLINE_LINUX in /etc/default/grub as shown below.

    # cat /etc/default/grub
    [...]
    GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=vg00/lvroot rhgb quiet elevator=noop"
    [...]
    
    • After the entry has been created/updated, rebuild the /boot/grub2/grub.cfg file to include the new configuration with the added parameter:
      • On BIOS-based machines: ~]# grub2-mkconfig -o /boot/grub2/grub.cfg
      • On UEFI-based machines: ~]# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
  • Another way to change the default I/O scheduler is to use tuned.

  • More information on creating a custom tuned profile can be found in solution 1305833

Online configuring the I/O scheduler on Red Hat Enterprise Linux

  • In Red Hat Enterprise Linux 5, 6 or 7 it is also possible to change the I/O scheduler for a particular disk after the system has been booted. This makes it possible to use different I/O schedulers for different disks.

    # cat /sys/block/hda/queue/scheduler
    noop anticipatory deadline [cfq]
    
    # echo 'noop' > /sys/block/hda/queue/scheduler
    # cat /sys/block/hda/queue/scheduler
    [noop] anticipatory deadline cfq
    

Testing

In this document, we refer to testing and comparing multiple schedulers. Some hints:

  • All scheduler tuning should be tested under normal operating conditions, as synthetic benchmarks typically do not accurately compare performance of systems using shared resources in virtual environments. Recommendations and defaults are only a place to start.
    • Outside of some specific corner cases, the typical change in performance when comparing different schedulers is nominally in the +/- 5% range. Its very unusual, even in corner cases like all sequential reads for video streaming, to see more than a 10-20% improvement in I/O performance via just a scheduler change. So desiring a 5-10x improvement by finding the right scheduler is not very likely to happen.
  • One should first be clear about the goal or the goals one wants to optimize for. Do I want as many I/O as possible to storage? Do I want to optimize an application to provide service in a certain way, for example "this apache webserver should be able to hand out as many static files [fetched from storage] as possible"?
  • With the goal clear, one can decide on the best tool to measure. Applications can then be started, and measured. Not changing the conditions, several schedulers can be tried out, and the measurement might change.
  • Special attention should be payed to mutual influence of the components. A RHEL with might host 10 KVM guests, and each of the guests various applications. Benchmarking should consider this whole system.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

27 Comments

We've done lots of testing on this with RHEL 5 and found cfq to be best for the guest in an ESXi env.

If I'm not mistaken, there is a race condition when running RHEL 4 with elevator=NOOP in VMWare.  Use deadline or cfq.

Typo, one does not edit /etc/grub2.cfg See /etc/sysconfig/grub

etc/grub2.cfg should not be edited, the correct way is to edit /etc/default/grub and then run grub2-mkconfig

On UEFI- based machines, one must execute grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg instead of grub2-mkconfig -o /boot/grub2/grub.cfg.

For more info, see: Customizing the GRUB 2 Configuration File.

Another option would be to specify the scheduler depending on the disk being in use.
Create this new rule -> sudo nano /etc/udev/rules.d/60-schedulers.rules

Put the following content into the empty file (save it - reboot the system afterwards) :

# set cfq scheduler for rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="cfq"

# set deadline scheduler for non-rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"

The rule does not differentiate between the virtual disks and pass-through (for example enterprise san storage). So my assumption is the above is for virtual disks only.

Hi Dwight, this method is working for physical disks as well - I have tested it on various Linux distributions.
When you are running a RHEL host with KVM virtualization on a SSD, the deadline scheduler will be in use.

Pass thru san based disks are typically marked rotational, but because they are nominally backed by multiple physical disks, deadline is a better scheduler than cfq for those disks. The above rules will set cfq for rotational based san disks, which is often sub-optimal.

Hi Dwight, thanks for the information - I only wanted to provide the additional option to tweak the scheduler.
If you want to use the deadline scheduler for rotational disks, you of course can do this by changing the rule :
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="deadline"

If the schedulers noop and deadline disables the OS I/O-scheduling completely, the ionice command and the automatic I/O-prioritization in the OS does not work anymore.
Since the VMware host have no way of prioritizing I/O between individual processes in the guest OS, there must definitely be times when disabling OS scheduling is a bad idea.
For example when a low performing or very busy storage is used for the guest, and you for example have a low priority backup-process and a real-time scheduled production application/process.

"If the schedulers noop and deadline disables the OS I/O-scheduling completely ... the automatic I/O-prioritization in the OS does not work anymore"

Clarification is needed for above. noop and deadline do not disable io scheduling completely. They just don't allow significant tuning changes on how the scheduling (queueing) is performed. For example deadline has two queues: read and write, but the two queues are each accessed via two methods, location sorted and time sorted: so two inputs and 4 outputs. That's all its got and it cannot be tuned further in terms of what is in the queue. It can be turned to bias the draining of io out of the queues more for reads or writes depending on how the tunables are set -- that is how many io are stripped from each and interleaved/sent to storage or how long a "deadline" period is used before switching between location ordered stripping of io from queues to disk vs time ordered draining of io from queues.

The cfq has some additional automatic sorting of io (not really "scheduling") but allows ionice and cGroup manipulation of processes io into different queues and these queues are in sets of idle, best effort, real-time... and then further broken down into queues of different priorities within those queue types.

"there must definitely be times when disabling OS scheduling is a bad idea."

Yes there can be, hence the wiggle room with the descriptions of each io scheduler:

  • "...cfq scheduler is usually ideal" which conversely means there will be circumstances where it is not ideal but in general is a reasonable starting point for tuning.
  • "...guests often benefit greatly from the noop I/O scheduler", which conversely measn there will be circumstances where noop in the application environment isn't a benefit.
  • "...schedulers like deadline can be more advantageous", not shall be but can be for certain storage environments and application io loads, especially if those loads are read latency sensitive and less sensitive to and have a major component of buffered writes associated with the application.

This article is about the majority of cases and a reasonable starting point for further tuning. It doesn't cover cases for things like RT guests or where there is significant need to tuning/assigning priorities of individual processes... that is just way too specific to niche configurations. Also see the caveats in above, specifically about not treating paravirt or pass through devices to the same scheduling tuning as virtual disks.

Tuning outside of the general recommendations tend to be corner cases where the physical resources of the hypervisor being spread out over a number of guest is well separated and/or well bounded in terms of potential collective io load from all guests. But if using virtual disk io, then a single guest's scheduling of io is done in a vacuum and when it all trickles downhill into the hypervisor to be mixed in with other vm guests io going to the same underlying disks - setting up one guest to use RT queue scheduling means nothing to the hypervisor when it blends that io in with other guests io. Trying to tune one vm guest cannot know the definitive steady state impact to performance since it cannot know what io load being presented from other vm guests is at any given moment. Basically its trial and error with generally noop or deadline being nominally best over a wide range of vm guests and application loads. If you know the underlying physical storage is only being used by one guest's virtual disks, then you can almost treat it as physical pass-thru storage within the guest in terms of tuning... but most the time you may not know you have dedicated physical disks behind virtual disks just for the current vm guest. And if you did, it would be best to present them as pass-thru resources to the guest in the first place.

"for example have a low priority backup-process and a real-time scheduled production application/process."

You have to be very careful doing things like that as you can end up in a hung or really poorly performing system due to priority inversion. For example the backup process locks a directory on disk as it processes files so things can't change while grabbing files. While it has that filesystem lock, higher RT io to same filesystem is going on which blocks the backup io from being processed (while backup is holding that fs lock) to the point that the RT process trips across the filesystem lock the low priority backup process is holding. In truth I've see tuned applications using RT to the point that the filesystem's metadata transactions can't be flushed resulting in some parts of the RT application environment locking up (while other parts continue to read/write data to other parts of the filesystem or to other partitions on the disk) and it ends up locking out the filesystem's metadata transactions to the detriment of all other processes... eventually.

Tuning isn't for the faint of heart.

If the noop scheduler is usually the better choice for virtual guests, why isn't it the rhel default on those systems?

I think our internal discussions did not end with a clear winner. The mentions currently both noop and deadline. Having a view over the workloads/envirionments our customers run would be required, and then an investigation whether noop or deadline would be better.

If you have hints on relevant statistics which help to point out that noop would be the better default, then that could be a base to rethink the defaults. That would work best in requesting an RFE via a customer center case.

This article reads like noop is the first choice for VMs and deadline is only better for certain workloads, that's why I asked the question :) if your internal discussions do not support this I would suggest to make that more clear in this article.

+1

I totally agree with @Klaas: Saying one thing and doing another just creates unnecessary confusion and possibly extra work for nothing.

Klaas, Jesper, I looked a bit deeper. RHEL 7.5 and later use deadline in guests by default, RHEL7.4 and earlier do not explicitly set one, they have 'none'.

The Virt tuning guide is also touching the topic, with quite similar wording to this kbase.

I see no clear recommendation towards noop or deadline in both sources, but 'noop' is mentioned before 'deadline'. The virt-tuning-guide also mentions that rhel7 defaults to deadline.

I did not see something as clearly as "noop scheduler is usually the better choice", such a clear recommendation would indeed be in contrast with the current defaults.

I think things could be clearer though. Let me start a mail thread with Dwight, and the authors of the virt-tuning-guide. Thank you for bringing this to our attention.

From this article: "RHEL guests often benefit greatly from the noop I/O scheduler, which allows the host/hypervisor to optimize the I/O requests and prioritize based on incoming guest load. The noop scheduler can still combine small requests from the guest OS into larger requests before handing the I/O to the hypervisor, however noop follows the idea to spend as few CPU cycles as possible in the guest for I/O scheduling. The host/hypervisor will have an overview of the requests of all guests and have a separate strategy for handling I/O." this does (at least to me) read as if noop is usually (to quote "often") the better choice :)

I am reading this solution in the same manner as Klaas. Unless noop is actually the preferred choice, please change the wording on this page.

Happy new year, all.

We had great threads on the topic internally, down to applications where the vendor would recommend schedulers, with comments regarding decision trees for deciding on the best scheduler for a given environment/workload. The common line is that the current default of deadline in guests is since rhel7.5GA, and was not willingly implemented for virtual guests, but more a side affect of other changes. So all agree that testing a given environment/workload with noop and deadline is the best recommendation. I hope with the recent kbase modification, this becomes more clear.

appreciate taking this up internally and trying to make it clearer. But unfortunately, I continue to find it confusing or unclear.

a question: is "none" and "noop" same?

Thanks

'none' is not the same as 'noop'. 'noop' is not sorting requests, but merges I/O requests. 'none' is not even merging requests. So theoretically there is a difference between none/noop and one might want to test both with typical workloads, but these 2 will probably perform quite similar to each other. 'deadline' etc. are more likely to result in different performance. Meaning, if I had only time to compare 2 schedulers in my guests, I would rather take deadline/noop or deadline/none, than noop/none.

Thanks for the effort.

But I believe it can still be improved.

Maybe a couple of tidy tables could 1) show the defaults, names and function for every OS, and 2) show pros and cons (recommendations) for every one ("indications/contraindications", if using medical nomenclature).

I read this article as the OS-default is not recommended. Can't we trust the engineers to set the default correctly for most use cases?

So I guess I wish you would/could recommend the OS-default. You could still add that depending on your environment, you may improve performance by tuning those parameters, but that it demands testing.

And regarding the recommended testing: It would be nice if you gave some clues to how. At some level it does become impossible, since everything is dynamic. All the guests share resources with each other, and performance will have to be quantified on both the applications, guests, hosts and storage.

Can't we trust the engineers to set the default correctly for most use cases? Looking at how the defaults changed even inside of major releases, I started to suspect that the default just changed as a side effect of other changes.

Very true for testing. Imagine one database application on a system (and one scheduler optimized for that), and a different application where a different scheduler should to best. I'll start a section on testing.

I would also like to ask if there is a more scientific way to conclude which works best for a given workload. By "scientific" I mean, can I look at some stats/data and then based on the pattern, can I conclude which scheduler is best suited for the workload?

Otherwise, I have to always try all 3 scheduler with each different workload. Instead, lets say I start with the default (deadline), gather data, identify the pattern and then conclude that the current scheduler (deadline) is best or I need to go to CFQ. Then I can atleast skip testing with noop.

Thanks

The available schedulers have descriptions which give an idea where they might perform best, but then that is often vague, and when one runs various applications (with different I/O characteristics) on a single system, you again will just guess. And then there are even workloads where a scheduler is said to be optimized for, but ends up performing worse than schedulers actually optimized for other workloads. "just" trying out the workload with various schedulers is still the best idea. It also forces one to think "what are my ressources, my bottlenecks, have I looked into performance tuning of the system already, and what do I actually want to optimize for".

So where is RHEL 8.x in all this? The booted scheduler options appear much different than for RHEL 7.x:

# cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none