Terrible performance of RAID10 on 256GB Dell PE820 KVM host
I have a 4-way Dell PE820 with 256GB of memory that primarily runs CPU intensive KVM guests. The system has a PERC 710 controller with a battery attached. I recently added 8 Seagate ST600MM0026 SAS 600 GB drives and created the following hardware RAID 10
- 64KB stripe size
- write through enabled
- disk cache enabled
- read ahead enabled
The RAID synced just fine and the RHEL6.5 (2.6.32-431.11.2.el6.x86_64) host sees the device just fine. I created a partition with parted
Model: DELL PERC H710 (scsi)
Disk /dev/sde: 2398GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 1049kB 2398GB 2398GB ext4 data
and formatted it ext4
mkfs.ext4 -b 4096 -E stride=16,stripe-width=64 /dev/sde1
I then mounted it without any special flags (mount /dev/sde1 /mnt/new)
The problem is that just about any IO I try is brutally slow or just hangs. For example, I can't even get bonnie++ to finish one test, it just hangs on 'Writing Intelligently' step. Watching things like iostat and vmstat show virtually no activity on the device or other devices on the system for that matter.
I installed tuned and enabled the enterprise-storage profile, but that did not seem to help.
If I do cude tests like 'badblocks /dev/sde' and watch iostat, I see reasonable values like:
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sde 8134.50 1041216.00 0.00 10412160 0
If I do a simple dd test like 'dd if=/dev/zero of=bogusfile count=5M',
5242880+0 records in
5242880+0 records out
2684354560 bytes (2.7 GB) copied, 27.4337 s, 97.8 MB/s
I see values like this from iostat
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sde 914.20 0.80 365319.20 8 3653192
If I then attempt to copy this bogusfile 'time cp bogusfile bogusfile2' , it takes a long time!
# time cp bogusfile bogusfile2
real 2m37.567s
user 0m0.032s
sys 2m36.510s
Any suggestions on what I have setup incorrectly? Thank you
Responses
I have encountered a similar situation where ANYthing on the box seemed to run poorly... and before I indicate what we did to resolve it.. I will admit I didn't believe it would have helped.
We had to update the PERC controller, and the BIOS (and we basically did all Firmware and LifeCycle Controller, etc... at the same time). We then rebooted (warm-reset). Same thing. So - for giggles, we did a cold reset (complete power cycle) of the host and it's been fine since.
On another note: we are researching disabling the PowerManagement of the CPU, specifically on Blades - but likely on all of our systems.
OMCONFIG="/opt/dell/srvadmin/bin/omconfig"
$OMCONFIG chassis biossetup attribute=SysProfile setting=PerfOptimized
$OMCONFIG chassis biossetup attribute=DynamicCoreAllocation setting=Disabled
A handy tool to see quick, high-level disk stats is "iotop".
To your original question(s): I don't know whether your setup is correct, or not.
Did you use parted to create the partition? If so, did it complain about alignment? People seem to ignore that partition alignment warning. I have found that SAN volumes and VM disks seem to align without the warning using 2048s for the first sector.
parted -s /dev/sdb mklabel msdos mkpart primary ext3 2048s 100% set 1 lvm on
I have not had to specify stride, etc.. for my filesystems, but it appears you have it correct according to:
http://wiki.centos.org/HowTos/Disk_Optimization
I believe there are some "om commands" like, omreport, etc.. that may be able to tell you some statistics about the PERC performance as well. You can make sure that the RAID device has complete it's build-cycle (which would slow things down). This was a cool reference (although the OP is using XFS)
http://techblog.xsoli.com/2013/05/creating-expandable-raid-10-dell-poweredge-r720xd/
Have you looked in to "tuned"? There are a number of optimizations that are deployed with RHEV that might lend some advantage here. I'll have to do some digging, but tuned and ksm (kernel shared memory) are coming to mind.
- tuned (specifically the storage profile)
- numad
- ksm (and ksmtuned)
as well as cgroups (possibly?) and vm.swappiness kernel setting.
When you specifically mentioned KVM again - it jogged my memory about a post I thought I recalled seeing talking about this same issue. It will likely take me a while to find that also.
EDIT: I would also investigate mount options for performance (noatime, etc...)
"Yup, my original message talked about tuned." -- I apologize... today has been excessively busy at work (but this thread is quite interesting ;-) I have a box that I am trying to do memory analysis on which has identified some "learning opportunities" for me ;-)
I am wondering about File System buffers, etc.. which is definitely out of my "wheelhouse" or whatever. Hopefully one of the performance tuning rock stars notice this thread activity and join in.
Are you able to see what the IO looks like using iotop?
Ex.
Total DISK READ: 14.30 M/s | Total DISK WRITE: 1810.91 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
8477 be/4 qemu 13.67 M/s 0.00 B/s 0.00 % 0.00 % qemu-kvm ~0,addr=0x7
As a reference, I have a RHEV node that is capable of running over 40 VMs (but.. I also use multipath SAN access) - I think your solution is sound, we just need to figure out what tunables are needed.
Could you also provide a snippet regarding the disk/controller setup on your VMs using grep like below?
# grep \<disk -A1 /etc/libvirt/qemu/*.xml
/etc/libvirt/qemu/PDGLLVNAGIOS10.xml: <disk type='file' device='disk'>
/etc/libvirt/qemu/PDGLLVNAGIOS10.xml- <driver name='qemu' type='raw' cache='none'/>
--
/etc/libvirt/qemu/PDGLLVNAGIOS10.xml: <disk type='file' device='disk'>
/etc/libvirt/qemu/PDGLLVNAGIOS10.xml- <driver name='qemu' type='raw' cache='none'/>
--
/etc/libvirt/qemu/PDGLLVNAGIOS10.xml: <disk type='file' device='disk'>
/etc/libvirt/qemu/PDGLLVNAGIOS10.xml- <driver name='qemu' type='raw' cache='none'/>
--
/etc/libvirt/qemu/PDGLLVNAGIOS10.xml: <disk type='block' device='cdrom'>
/etc/libvirt/qemu/PDGLLVNAGIOS10.xml- <driver name='qemu' type='raw'/>
I'm sure there a virsh command that would/could tell us also ;-)
I think optimally they recommend RAW devices (vs QCOW) and virtio for performance. (I should have asked for -A4 to see the bus as well). Although, I don't think qcow should be causing the issues you are describing.
http://www.linux-kvm.org/page/Tuning_KVM
http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaat/liaatbestpractices_pdf.pdf
Ha! This... is why I love doing this for a living ;-) I tell everyone it's just 1's and 0's and that IT is straight-forward! Been in your spot countless times though. Especially when it comes to firmware and such. Good luck figuring out the culprit!
I'm still not comfortable with Transparent Huge Pages. For DB hosts we have disabled THP, I'm not sure how they would affect an environment such as a Hypervisor.
I'm glad you actually responded (I would have had to search for this thread otherwise ;-) I found some RHEV specific tuning that I thought might be of interest also - although I have NO clue how they determine the tuning values.
[root@pusgtst91 sysctl.d]# pwd
/etc/sysctl.d
[root@pusgtst91 sysctl.d]# ls -l
total 8
-rw-r--r-- 1 root root 499 Mar 26 06:35 libvirtd
-rw-r--r-- 1 root root 76 Mar 23 06:58 vdsm.conf
[root@pusgtst91 sysctl.d]# cat libvirtd
# The kernel allocates aio memory on demand, and this number limits the
# number of parallel aio requests; the only drawback of a larger limit is
# that a malicious guest could issue parallel requests to cause the kernel
# to set aside memory. Set this number at least as large as
# 128 * (number of virtual disks on the host)
# Libvirt uses a default of 1M requests to allow 8k disks, with at most
# 64M of kernel memory if all disks hit an aio request at the same time.
fs.aio-max-nr = 1048576
[root@ppusgtst91 sysctl.d]# cat vdsm.conf
#Set dirty page parameters
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
EDIT: Just to let you know I JUST completed a server maintenance that caused the same behavior as you mentioned (slow performance) which seemed to be resolved by a bunch of firmware updates and a cold reset :-( It was not an enjoyable afternoon.
Just my 2 cents:
From the initial posting on the partition configuration ... did anyone validate that it was on a hardware stripe boundary? You said the stripe size was 64kb, in a 4 x 2 RAID10, so your full stripe width is 4 x 64kb = 256kb.
I normally don't run ext4 on storage arrays, so I am not sure of the interpretation of:
Model: DELL PERC H710 (scsi)
Disk /dev/sde: 2398GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 1049kB 2398GB 2398GB ext4 data
Does this mean that the ext4 partition starts at 1049 kb into the physical LUN?
If so ... you are NOT on a full stripe boundary, potentially resulting in extra IOs. You are 25kb into the stripe.
You are also not on a disk boundary (64kb), so you are 25kb into the 64kb stripe segment on an individual disk.
Many hypervisors try to optimize IO, and do relatively large IO when possible. Well, if the hypervisor is doing 32kb or 64kb IO, none of this IO will be well aligned.
So .... each of your large reads need additional IOs, reducing the expected performance gain ... if the additional IOs can be done in parallel. If the additional IOs are serialized ... you would get a reduction in performance ... less than 1 disk's worth of read performance.
You are using RAID10, so the write impact is not as great as if you were using RAID5 or RAID5, but you probably are still doubling the number of writes needed. So a single host write of 64kb on a 64kb boundary (from the the partition start) crosses two stripe segments, requiring an IO on 2 disks, rather than just 1 ... before mirroring. The mirroring then doubles the work again. So 1 host write of 64kb generates 4 disk writes in total, instead of 2.
I suggest that this is the core issue. Yes, all the various settings previously mentioned help ... but you end up doing unwise behaviors faster, than being efficient to begin with.
From my experience, most hypervisor documentation have specific examples on how to properly align their partitions on underlying RAID topologies. This issue is also very important for SSDs, which are effectively a stripe of flash chips, and have alignment issues also. Most SSDs prefer a 4kb alignment at the worst case, and your 1049kb alignment would not be SSD friendly either.
Hope this helps.
Dave B
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
