identical rhel clusters perform differently - io sizes and iops do not match during testing, need help identifying cause
Hi there,
I apologize for the atrocious title and massive post, but this is a difficult problem to nail down and relatively complex. We have been fighting this for over two weeks and have pretty much hit a wall. I am not looking for any silver bullets but if you have any thoughts, point us in the right direction it would be greatly appreciated.
Breakdown:
We have two identical environments hardware-wise, each environment has a 5 node RHEL 5.6 cluster, running on UCS B-series blades, connected to MDS SAN switches, and dedicated CX4-960 Clariions, with powerpath 5.5.
They file systems they are running their tests against are GFS2 setup with default parameters, on top of a single LVM with default parameters. They have been compared using gfs2_tool gettune/getargs/df and they are identical between the systems.
The customer is running a SAS based test suite to test the throughput of the entire environment. What they noticed is that the suite takes approximately 5-1/2 hours to run in Environment 1 and approximately 3-1/2 hours to run in Environment 2. After gathering tons of iostat data and performance data from the arrays it was determined that Environment 1 was doing almost 5x the iops and their io sizes from the array's perspective were much smaller, so it was working a lot harder to move the same amount of data - therefore causing the job to run longer (thats my theory at least).
Reads were particularly unbalanced. I have looked EVERYWHERE to find differences in the device files (/sys/block/emcpowerBLAH/) and validated that all files were identical between the enviroments. At one time I had changed the scheduler from cfq to noop, but that has since been changed back. I had also changed the nr_requests value, but that has also been changed back in both environments.
What complicates this even further is that I ran my own test using iozone to attempt to replicate the results. I gathered the same stats and did the same analysis and the iozone tests did NOT exhibit the same issues in regards to io size/iops, etc. So I started looking at their SAS application parameters (not that I know anything about SAS) but was unable to find any differences (changing BUFSIZE, etc) between the two and they swear that the test/environment they are running is identical.
We are pretty much positive this is not hardware related and that it is caused somewhere at the OS level or application level. Can anyone think of anything from a redhat perspective that could possible account for this type of difference? Thank you very much and I can provide any additional information that would be of use.
-Michael
Responses
The usual suspects are:
- Array performance
- Disk-path performance (HBAs, transport media, switches, array storage processors)
- Filesystem performance (FS-type, fs-parameters, etc.)
- Network Performance
- Platform hardware Health/saturation levels (CPU/RAM/system buses/etc.)
- OS Parameters (device driver tuning, memory setup, etc.)
- Application setup
The key, obviously, is to eliminate the healthy components from the sick.
If your I/O zone numbers are truly the same between systems for all I/O patterns, then you've mostly eliminated array performance, disk-path performance, filesystem performance and storage-related OS parameters from the equation.
What have you done, to date, to verify the OS'es network performance (since most applications have a networking component that can sometimes result in performance issues that might make you look in the wrong areas)?
Also, since these are blade systems, are the two sets of systems in the same chassis? If not, is the overall chassis-load equivalent during your testing windows (i.e., is it possible that one of your I/O subsystems is oversubscribed across the chassis interconnects)?
Once you've eliminated your hardware, OS/parameters and the like, have you verified that the application components are the same? Are they using the same software load between servers (e.g., same JVM/JDK, same internal parameters, same database layout/tuning/etc). Are they using the same database (and, if not, are the databases truly equivalent, both from a parameters standpoint, but equivalent indexing, stored procedures and internal data order). It might be good to run a DB analyzer against each database to see if it turns up anything "odd".
echo 3 > /proc/sys/vm/drop_caches
should clear the caches and give you a "clean slate" between benchmarking runs. (Please note that running this on a production host will likely negative impact performance as often used data is cleared from the caches.)
If data is in the pagecache, it won't be paged in from disk and could sometimes explain a disparity between read performance.
http://linux-mm.org/Drop_Caches
Some other ideas:
- Check firmware versions on HBAs and network interfaces.
- Verify battery status on HBAs or disk arrays (if applicable). Dell PERCs frequently place themselves into write-through cache mode if in the middle of a battery charge cycle.
- Cancel any running "self-tests" on the disk arrays. Higher-end disk arrays frequently have a self-test cycle that runs occasionally. If the array is in the middle of testing itself, you will lose throughput as lots of seeks are silently occuring in the background.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
