sar -d and iostat -x report incorrect disk statistics in Red Hat Enterprise Linux 5 and 6

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux 5.3 - 5.6
  • Red Hat Enterprise Linux 6.0 - 6.1
  • sysstat

Issue

  • sar -d and/or iostat -x report high utilization and an impossibly high queue length for some partitions, for example:
03:00:25 PM     DEV      tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz              await     svctm     %util
03:00:30 PM dev134-448   0.00   0.00      0.00      0.00     0.00                  0.00      0.00     0.00
03:00:30 PM dev134-449   0.00   0.00      0.00      0.00     3689348814741909.50   0.00      0.00     99.84
03:00:30 PM dev134-450   0.00   0.00      0.00      0.00     1.00                  0.00      0.00     99.84
  • sar and iostat report unusually high average queue sizes and high utilization for idle devices.
  • Do any side effects of this bug cause slow IO performance?

Resolution

  • A permanent fix requires a kernel upgrade to:

  • The only known temporary workaround to this issue after it has been encountered is to reboot the system. This is required to force reset the disk statistics within the kernel that are causing this issue. Alternatively, fully deleting and rescanning/recreating the disk in question will work, but reboots are the preferred and typically easier method.

  • This accounting bug does not impact actual IO performance. It only causes the io statistics via sar and iostat to be incorrectly reported.

Root Cause

To hit the bug requires both

  • location of io
    • one io is at the end of one parition and the other io is at the beginning of the adjacent partition, and
  • timing
    • both io requests have to hit the io scheduler near enough in time to allow them to be merged together.

This results in the io being merged into one request and sent to storage. The bug occurs at io completion time. The completion code doesn't take into account that different partitions may be involved and deletes all completing io from the first partition's inflight disk statistics count (field #9 in /proc/diskstats). This creates a small negative inflight count for that partition, except since the counts are unsigned the result is seen as a huge positive number. This means two adjacent partitions end up being effected. The first partition ends up with a huge positive inflight number and the next/adjacent one with a small positive inflight number, usually 1.

The inflight count is used in managing two other /proc/diskstats fields:

   2739 void disk_round_stats(struct gendisk *disk)
   2740 {
:
   2746         if (disk->in_flight) {
   2747                 __disk_stat_add(disk, time_in_queue,
   2748                                 disk->in_flight * (now - disk->stamp));
   2749                 __disk_stat_add(disk, io_ticks, (now - disk->stamp));
   2750         }

The time_in_queue above is used within sar, iostat, and other disk statistics utilities to compute avgqu-sz and io_ticks is used in same utilities for computing %util. So when a system hits this bug, you typically see a phantom 100% utilization reported for an idle device AND a huge avgqu-sz on the first affected partition. The 2nd affected partition has the same phantom 100% utilitization but a small average queue size, again usually 1. The resulting signarture can be seen in sar or iostat -x while the device is idle:

10:34:30      DEV    tps rd_sec/s  wr_sec/s  avgrq-sz           avgqu-sz await svctm   %util
14:34:30  dev8-49   0.00     0.00      0.00  0.00     633473354179585.12  0.00  0.00  103.07   ...first partition...
14:34:30  dev8-50   0.00     0.00      0.00  0.00                   1.03  0.00  0.00  103.07   ...2nd/adjacent partition

The inflight count is field #9 in /proc/diskstats (3rd from the end). The time_in_queue is field #11 (end of line), and io_ticks is #10. The output from the commands within diagnostics section should be usable to confirm that this bug has been hit. Below we see an example of this:

   8   48 sdd  3885 62189 528564 5559 0 0 0 0 0          3312       5557
   8   49 sdd1 3740 61973 525690 5189 0 0 0 0 4294967295 1127694323 3167277003
   8   50 sdd2   40    71    874  110 0 0 0 0 1          1127695523 1127695590
                                              ^
                                              |
                  field #9, inflight count ---+

Note: The field counts from within the linux documentation start at the first disk statistics value after the device name. See /usr/share/doc/kernel-doc-*/Documentation/iostats.txt for more information.

The inflight decimal value 4294967295 = 0xFFFFFFFF (-1 if this was a signed value). So both 1st and 2nd partitions have their inflight counts off by 1.

On the plus side; given the two relatively rare requirements to hit this bug, this shouldn't happen too often. On the negative side, once this bug is encountered the in-flight count stays broken within the kernel for the disks affected until either a reboot or the device is deleted and recreated... usually reboot is the easiest method to recover. The only permanent way to avoid the issue from happening is to upgrade to a kernel which includes the fix for this issue.

Diagnostic Steps

Running the following commands while the system or at least the disk(s) in question are acquiescent and then confirming iostat output shows the issue and looking at the inflight count within /proc/diskstats output should allow confirmation that this bug is the cause of the high avgqu-sz/%util fields.

$ nohup watch -n 1 'echo $(date) >> d.log ; cat /proc/diskstats >> d.log' &
$ iostat -tkx 1 30 > i.log
$ kill %1

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments