NVMe performance degradation on RHEL 6.6

Latest response

Problem Statement
We are seeing unexpected performance degradation on our NVMe device when using RHEL 6.6.
The scenario is running a FIO random read job with a 4k block size. See all the parms below.
We do not see the problem when using RHEL 6.5 or RHEL 7.0 on the same hardware

System Details

OS Level      RHEL 6.6
Kernel          2.6.32-504.el6.x86_64
H/W      Super-micro X10SAE motherboard,   
                     16GB ddr3 memory @ 1600MHz
                     Intel Xeon CPU E3-1225 v3 @ 3.20GHz,  1 socket – 4 core
Device       Samsung NVMe SSD Controller 171X (rev 03)
                     Dell Express Flash NVMe XS1715 SSD 400GB
                    Using PCIe 3.0 slot.  Target Link Speed: 8GT/s  (from lspci)
Driver      nvme – shipped with the kernel code

FIO Test Results

RHEL 6.5    2.6.32-431.23.3.el6.x86_64                  750 Kiops @ 55% cpu utilization
RHEL 6.6    2.6.32-504.el6.x86_64                           139 Kiops @ 97% cpu utilization         <----
RHEL 7.0    3.10.0-123.9.3.el7.x86_64                           753 Kiops @ 59% cpu utilizaion
CentOS 6.5  2.6.32-431.29.2.el6.centos.plus.x86_64  753 Kiops @ 58% cpu utilization
CentOS 6.6  2.6.32-504.1.3.el6.x86_64                           749 Kiops @ 55% cpu utilization
CentOS 7.0  3.10.0-123.9.2.el7.x86_64                           749 Kiops @ 59% cpu utilization

FIO Output Sample (using rhel 6.6)
Measure_RR_4KB_QD256: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
...
Measure_RR_4KB_QD256: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 4 processes
Jobs: 4 (f=4): [rrrr] [100.0% done] [543.5M/0K/0K /s] [139K/0 /0 iops] [eta 00m:00s]
Measure_RR_4KB_QD256: (groupid=0, jobs=4): err= 0: pid=5510: Mon Dec 8 12:24:54 2014
read : io=161844MB, bw=552427KB/s, iops=138106 , runt=300001msec
slat (usec): min=0 , max=94120 , avg=21.57, stdev=96.18
clat (usec): min=7 , max=96788 , avg=1828.19, stdev=801.42
lat (usec): min=94 , max=96875 , avg=1850.74, stdev=806.36
clat percentiles (usec):
| 1.00th=[ 812], 5.00th=[ 1064], 10.00th=[ 1208], 20.00th=[ 1384],
| 30.00th=[ 1512], 40.00th=[ 1640], 50.00th=[ 1768], 60.00th=[ 1880],
| 70.00th=[ 2024], 80.00th=[ 2192], 90.00th=[ 2448], 95.00th=[ 2672],
| 99.00th=[ 3280], 99.50th=[ 3856], 99.90th=[11840], 99.95th=[13120],
| 99.99th=[21632]
bw (KB/s) : min=76848, max=154416, per=25.01%, avg=138144.24, stdev=7932.83
lat (usec) : 10=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.03%
lat (usec) : 750=0.56%, 1000=2.98%
lat (msec) : 2=64.83%, 4=31.12%, 10=0.30%, 20=0.16%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=20.09%, sys=77.38%, ctx=165487, majf=0, minf=354
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=41432135/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
READ: io=161844MB, aggrb=552426KB/s, minb=552426KB/s, maxb=552426KB/s, mint=300001msec, maxt=300001msec

FIO Job Parms
;Async Test CPU Utilization
;======================
; -- start job file --
[Measure_RR_4KB_QD256]
ioengine=libaio
direct=1
rw=randread
norandommap
randrepeat=0
iodepth=64
size=25%
numjobs=4
bs=4k
overwrite=1
filename=/dev/nvme1n1
runtime=5m
time_based
group_reporting
stonewall

Responses

Thomas,

Have you tried RHEL 6.6 with the RHEL 6.5 kernel version? This may help narrow the issue down to either an OS component or the update in kernel being the culprit.

From there I think it will definitely be a formal support request through Red Hat to investigate the regression.

Yes, I think this is best investigated in a support case.

Just curious as to if there was a ticket opened on this, and what the resolution might have been. Same situation here on 6.6. Thanks so much!

Interesting issue (I wish I could look at this first hand).
I am curious:
* are you using tuned profiles -- https://access.redhat.com/videos/898563
* did you double-check the disk alignment on the SSD between the 2 builds
* differences between the I/O elevator settings -- https://access.redhat.com/solutions/54164
* different SElinux settings (or nsswitch - perhaps there is a hang-up doing user lookups? a bit of a stretch here)
* buffer cache tuning the same between hosts (sysctl -a)
* boot params the same between 6.5 and 6.6 ( cat /proc/cmdline)

Along the lines of what PixelDrift had asked, I would be curious what happens if you install 6.5 and update to 6.6 and if the problem returns. If so, analyze all the files that get updated (using find or something). Hopefully it's a tunable (or setup) causing this behavior and not a binary.

Thanks, I'll ask the technical team to review the bullets you provided. I won’t pretend to understand their spreadsheet of test results dealing with irq and cpu affinity

I was given this high level summary of the workaround:
RHEL 6.5 performance was as expected with irq_balancer both on and off.
RHEL 6.6 performance was severely degraded until we turned irq_balancer service off. The default is on.
RHEL 7.0 performance was as expected with irq_balancer both on and off.

Perhaps my issue is slightly different, but I'm seeing abysmal read performance with Intel PCIe NVMe drives and EL6.6 with XFS or ext4 filesystems. Tests against the raw unformatted devices with fio are good, but read performance is capped at around 600MB/s sequential when I add a filesystem.

The state of irqbalance didn't impact results. I'm not in a position to downgrade to EL6.5 to test, nor move to EL7.1 (because of other incompatibilities).