identical rhel clusters perform differently - io sizes and iops do not match during testing, need help identifying cause

Latest response

Hi there,

 

I apologize for the atrocious title and massive post, but this is a difficult problem to nail down and relatively complex.  We have been fighting this for over two weeks and have pretty much hit a wall.  I am not looking for any silver bullets but if you have any thoughts, point us in the right direction it would be greatly appreciated.

 

Breakdown: 

We have two identical environments hardware-wise, each environment has a 5 node RHEL 5.6 cluster, running on UCS B-series blades, connected to MDS SAN switches, and dedicated CX4-960 Clariions, with powerpath 5.5. 

 

They file systems they are running their tests against are GFS2 setup with default parameters, on top of a single LVM with default parameters.  They have been compared using gfs2_tool gettune/getargs/df and they are identical between the systems.

 

The customer is running a SAS based test suite to test the throughput of the entire environment.  What they noticed is that the suite takes approximately 5-1/2 hours to run in Environment 1 and approximately 3-1/2 hours to run in Environment 2.  After gathering tons of iostat data and performance data from the arrays it was determined that Environment 1 was doing almost 5x the iops and their io sizes from the array's perspective were much smaller, so it was working a lot harder to move the same amount of data - therefore causing the job to run longer (thats my theory at least). 

 

Reads were particularly unbalanced.  I have looked EVERYWHERE to find differences in the device files (/sys/block/emcpowerBLAH/) and validated that all files were identical between the enviroments.  At one time I had changed the scheduler from cfq to noop, but that has since been changed back.  I had also changed the nr_requests value, but that has also been changed back in both environments. 

 

What complicates this even further is that I ran my own test using iozone to attempt to replicate the results.  I gathered the same stats and did the same analysis and the iozone tests did NOT exhibit the same issues in regards to io size/iops, etc.  So I started looking at their SAS application parameters (not that I know anything about SAS) but was unable to find any differences (changing BUFSIZE, etc) between the two and they swear that the test/environment they are running is identical.

 

We are pretty much positive this is not hardware related and that it is caused somewhere at the OS level or application level.  Can anyone think of anything from a redhat perspective that could possible account for this type of difference?  Thank you very much and I can provide any additional information that would be of use.

 

-Michael

Responses

The usual suspects are:

  • Array performance
  • Disk-path performance (HBAs, transport media, switches, array storage processors)
  • Filesystem performance (FS-type, fs-parameters, etc.)
  • Network Performance
  • Platform hardware Health/saturation levels (CPU/RAM/system buses/etc.)
  • OS Parameters (device driver tuning, memory setup, etc.)
  • Application setup

The key, obviously, is to eliminate the healthy components from the sick.

 

If your I/O zone numbers are truly the same between systems for all I/O patterns, then you've mostly  eliminated array performance, disk-path performance, filesystem performance and storage-related OS parameters from the equation.

 

What have you done, to date, to verify the OS'es network performance (since most applications have a networking component that can sometimes result in performance issues that might make you look in the wrong areas)?

 

Also, since these are blade systems, are the two sets of systems in the same chassis? If not, is the overall chassis-load equivalent during your testing windows (i.e., is it possible that one of your I/O subsystems is oversubscribed across the chassis interconnects)?

 

Once you've eliminated your hardware, OS/parameters and the like, have you verified that the application components are the same? Are they using the same software load between servers (e.g., same JVM/JDK, same internal parameters, same database layout/tuning/etc). Are they using the same database (and, if not, are the databases truly equivalent, both from a parameters standpoint, but equivalent indexing, stored procedures and internal data order). It might be good to run a DB analyzer against each database to see if it turns up anything "odd".

Thank you for the reply Thomas:

 

For the database part, SAS does not use a DB - uses flat files so there is no database component to consider at this time.  The blades that are doing most of the work for each environment for this test's purposes are in the same chassis - and they have ran simutaneously and experienced the same issue - so we have pretty much eliminated that.  

 

For the OS/parameters part, software, etc - what types of parameters/files specifically in regards to SAS do you feel could impact the size of IO that the server sends to the array and how it breaks it up??

 

-Michael

Have you validated that the start of the partitions are aligned in accordance with your storage array recommendations? With older versions of RHEL & using FC-backed NetApp storage, it was important to align on a 4k block boundary.

Yes, we followed the EMC Host Connectivity Guide and started our offset at 128 for ALL of our SAN drives.  Tried to paste it in here, but failed miserably.  I did a fdisk print and validated that the two devices we're focusing on are identical.

 

-Michael

Disk /dev/emcpowerd: 255 heads, 63 sectors, 69709 cylinders
 
Nr AF  Hd Sec  Cyl  Hd Sec  Cyl     Start      Size ID
 1 00   1   1    0 254  63 1023        128 1119874957 8e
 2 00   0   0    0   0   0    0          0          0 00
 3 00   0   0    0   0   0    0          0          0 00
 4 00   0   0    0   0   0    0          0          0 00
Disk /dev/emcpowerd: 255 heads, 63 sectors, 69709 cylinders
 
Nr AF  Hd Sec  Cyl  Hd Sec  Cyl     Start      Size ID
 1 00   1   1    0 254  63 1023        128 1119874957 8e
 2 00   0   0    0   0   0    0          0          0 00
 3 00   0   0    0   0   0    0          0          0 00
 4 00   0   0    0   0   0    0          0          0 00
Disk /dev/emcpowerd: 255 heads, 63 sectors, 69709 cylinders
 
Nr AF  Hd Sec  Cyl  Hd Sec  Cyl     Start      Size ID
 1 00   1   1    0 254  63 1023        128 1119874957 8e
 2 00   0   0    0   0   0    0          0          0 00
 3 00   0   0    0   0   0    0          0          0 00
 4 00   0   0    0   0   0    0          0          0 00
Disk /dev/emcpowerd: 255 heads, 63 sectors, 69709 cylinders
 
Nr AF  Hd Sec  Cyl  Hd Sec  Cyl     Start      Size ID
 1 00   1   1    0 254  63 1023        128 1119874957 8e
 2 00   0   0    0   0   0    0          0          0 00
 3 00   0   0    0   0   0    0          0          0 00
 4 00   0   0    0   0   0    0          0          0 00
Disk /dev/emcpowerd: 255 heads, 63 sectors, 69709 cylinders
 
Nr AF  Hd Sec  Cyl  Hd Sec  Cyl     Start      Size ID
 1 00   1   1    0 254  63 1023        128 1119874957 8e
 2 00   0   0    0   0   0    0          0          0 00
 3 00   0   0    0   0   0    0          0          0 00
 4 00   0   0    0   0   0    0          0          0 

echo 3 > /proc/sys/vm/drop_caches
should clear the caches and give you a "clean slate" between benchmarking runs. (Please note that running this on a production host will likely negative impact performance as often used data is cleared from the caches.)

If data is in the pagecache, it won't be paged in from disk and could sometimes explain a disparity between read performance.

http://linux-mm.org/Drop_Caches

Some other ideas:

  1. Check firmware versions on HBAs and network interfaces.
  2. Verify battery status on HBAs or disk arrays (if applicable). Dell PERCs frequently place themselves into write-through cache mode if in the middle of a battery charge cycle.
  3. Cancel any running "self-tests" on the disk arrays. Higher-end disk arrays frequently have a self-test cycle that runs occasionally. If the array is in the middle of testing itself, you will lose throughput as lots of seeks are silently occuring in the background.

thanks for the ideas Phil - I am going to look into the drop cache stuff for sure.  These are blade servers so none of the HBA stuff really applies, and the arrays are fine (I work for EMC, fyi).  

 

As an update on this case, the customer and I came up with some good testing techniques.  They first realized that a particular file system seemed to be slowing their jobs up moreso than the others so they reorganized their test to utilize a different file system more heavily (a different job slot in sas world).  I told them to exclude the file system all-together from their testing and sure enough both of the environments ran their test suite and under 3 1/2 hours.  

 

So, we collected the mount -t debugfs none /sys/kernel/debug info and sosreports and sent them to support for analysis.  Then, we completely blew away the lvm, volume group, file system, even the LUN on the array and rebuilt everything.  Now the customer is running their test again using the problem-file-system and we'll see if that cleared it up.  I'll update you on Monday as to whether or not that worked or not. 

 

given that information, does anyone have any clue what would cause that?  Like I mentioned before, the output from gfs2_tool gettune/getargs/df  (filesys name) showed the problem-file-system to be set up identically to all of the others???  Maybe we hit a bug? some weird caching thing???

thanks for your help/interest.

 

-Michael

So sure enough, after we rebuilt everything from scratch the environments both perform fine.  However, the issue now appears to be going the other way - so Environment 2 is now doing about 1.5x as much iops/read activity as E1, but does not appear to be impacting the test job nearly as much as the original difference as it only took about 9 minutes longer (rather than 1.5 hours), but we will continue to benchmark.  

 

Any ideas as to what could have caused this???