NVMe performance degradation on RHEL 6.6

Latest response

Problem Statement
We are seeing unexpected performance degradation on our NVMe device when using RHEL 6.6.
The scenario is running a FIO random read job with a 4k block size. See all the parms below.
We do not see the problem when using RHEL 6.5 or RHEL 7.0 on the same hardware

System Details

OS Level      RHEL 6.6
Kernel          2.6.32-504.el6.x86_64
H/W      Super-micro X10SAE motherboard,   
                     16GB ddr3 memory @ 1600MHz
                     Intel Xeon CPU E3-1225 v3 @ 3.20GHz,  1 socket – 4 core
Device       Samsung NVMe SSD Controller 171X (rev 03)
                     Dell Express Flash NVMe XS1715 SSD 400GB
                    Using PCIe 3.0 slot.  Target Link Speed: 8GT/s  (from lspci)
Driver      nvme – shipped with the kernel code

FIO Test Results

RHEL 6.5    2.6.32-431.23.3.el6.x86_64                  750 Kiops @ 55% cpu utilization
RHEL 6.6    2.6.32-504.el6.x86_64                           139 Kiops @ 97% cpu utilization         <----
RHEL 7.0    3.10.0-123.9.3.el7.x86_64                           753 Kiops @ 59% cpu utilizaion
CentOS 6.5  2.6.32-431.29.2.el6.centos.plus.x86_64  753 Kiops @ 58% cpu utilization
CentOS 6.6  2.6.32-504.1.3.el6.x86_64                           749 Kiops @ 55% cpu utilization
CentOS 7.0  3.10.0-123.9.2.el7.x86_64                           749 Kiops @ 59% cpu utilization

FIO Output Sample (using rhel 6.6)
Measure_RR_4KB_QD256: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
...
Measure_RR_4KB_QD256: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 4 processes
Jobs: 4 (f=4): [rrrr] [100.0% done] [543.5M/0K/0K /s] [139K/0 /0 iops] [eta 00m:00s]
Measure_RR_4KB_QD256: (groupid=0, jobs=4): err= 0: pid=5510: Mon Dec 8 12:24:54 2014
read : io=161844MB, bw=552427KB/s, iops=138106 , runt=300001msec
slat (usec): min=0 , max=94120 , avg=21.57, stdev=96.18
clat (usec): min=7 , max=96788 , avg=1828.19, stdev=801.42
lat (usec): min=94 , max=96875 , avg=1850.74, stdev=806.36
clat percentiles (usec):
| 1.00th=[ 812], 5.00th=[ 1064], 10.00th=[ 1208], 20.00th=[ 1384],
| 30.00th=[ 1512], 40.00th=[ 1640], 50.00th=[ 1768], 60.00th=[ 1880],
| 70.00th=[ 2024], 80.00th=[ 2192], 90.00th=[ 2448], 95.00th=[ 2672],
| 99.00th=[ 3280], 99.50th=[ 3856], 99.90th=[11840], 99.95th=[13120],
| 99.99th=[21632]
bw (KB/s) : min=76848, max=154416, per=25.01%, avg=138144.24, stdev=7932.83
lat (usec) : 10=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.03%
lat (usec) : 750=0.56%, 1000=2.98%
lat (msec) : 2=64.83%, 4=31.12%, 10=0.30%, 20=0.16%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=20.09%, sys=77.38%, ctx=165487, majf=0, minf=354
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=41432135/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
READ: io=161844MB, aggrb=552426KB/s, minb=552426KB/s, maxb=552426KB/s, mint=300001msec, maxt=300001msec

FIO Job Parms
;Async Test CPU Utilization
;======================
; -- start job file --
[Measure_RR_4KB_QD256]
ioengine=libaio
direct=1
rw=randread
norandommap
randrepeat=0
iodepth=64
size=25%
numjobs=4
bs=4k
overwrite=1
filename=/dev/nvme1n1
runtime=5m
time_based
group_reporting
stonewall

Responses