What's the meaning of `Disk IO Utilisation` and `Disk IO Saturation` in Grafana dashboard?

Solution Verified - Updated -

Environment

  • OpenShift Container Platform

Issue

  • What's the meaning of Disk IO Utilisation and Disk IO Saturation in Grafana dashboard?
  • We know this is a performance indicators, but we are not sure about the meaning, can you give us some clarification

Resolution

"Disk IO Utilisation" is calculated from node_disk_utilisation as already indicated.
So basically the node-exporter gets those metrics from the devices listed. How is it calculated? basically, the node_disk_utilisation is taking the data from iostat output from the RHCOS nodes

  • Example of iostat
$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-128-190.us-east-2.compute.internal   Ready    worker   6h8m    v1.20.0+93f62ca
ip-10-0-134-153.us-east-2.compute.internal   Ready    master   6h13m   v1.20.0+93f62ca
ip-10-0-164-179.us-east-2.compute.internal   Ready    master   6h13m   v1.20.0+93f62ca
ip-10-0-177-63.us-east-2.compute.internal    Ready    worker   6h8m    v1.20.0+93f62ca
ip-10-0-196-68.us-east-2.compute.internal    Ready    master   6h13m   v1.20.0+93f62ca
ip-10-0-222-37.us-east-2.compute.internal    Ready    worker   6h7m    v1.20.0+93f62ca

$ oc debug node/ip-10-0-134-153.us-east-2.compute.internal --image=docker.io/fedora:latest
Creating debug namespace/openshift-debug-node-d2vr4 ...
Starting pod/ip-10-0-134-153us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.134.153
If you don't see a command prompt, try pressing enter.
sh-5.0#  dnf install -y sysstat
...
Complete!
sh-5.0# iostat -x
Linux 4.18.0-240.15.1.el8_3.x86_64 (ip-10-0-134-153)     03/29/21     _x86_64_    (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.67    0.00    7.79    0.46    0.01   77.06

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1          0.44     17.38     0.00   0.10    0.43    39.53   90.49    770.38     1.61   1.75    0.74     8.51    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.02   **7.24**

The disk nvme0n1 is used 7.24% (%util) at 0.44 reads per second. By iostat meaning of %util, it means basically a Percentage of CPU time during which I/O requests were issued to the device so is the percentage of the time the drive was doing at least one operation. If it's doing more tasks at the same time, it doesn't change if it is hard disk. The amount of time a disk spend doing one task is the perfect indication of how busy they really are, so it is captured under %util. If it is not hard disk and they're using SSD, raid, nvme or high speed storage in general, then we can perform multiple tasks in parallel so %util is not really helpful here, so on iostat there's one warning:

Device saturation occurs when %util is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

So node_disk_utilisation has to be checked with caution as it is not generic and Grafana obviously does not take all of this into consideration, it is just trying to indicate that node_disk_utilisation has $x of %util.

Basically, Prometheus will be checking the queries and will gather results around "%util" history and Grafana will provide the calculation for that expression in Grafana.

"Disk IO Saturation" is calculated from node_disk_saturation, and it is again checking "util" but only when it is closed to 100%, as this percentage reflects CPU time during I/O requests issued to the device, and if it is saturated, it means disk IO utilisation is very high.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments