identical rhel clusters perform differently - io sizes and iops do not match during testing, need help identifying cause

Latest response

Hi there,

 

I apologize for the atrocious title and massive post, but this is a difficult problem to nail down and relatively complex.  We have been fighting this for over two weeks and have pretty much hit a wall.  I am not looking for any silver bullets but if you have any thoughts, point us in the right direction it would be greatly appreciated.

 

Breakdown: 

We have two identical environments hardware-wise, each environment has a 5 node RHEL 5.6 cluster, running on UCS B-series blades, connected to MDS SAN switches, and dedicated CX4-960 Clariions, with powerpath 5.5. 

 

They file systems they are running their tests against are GFS2 setup with default parameters, on top of a single LVM with default parameters.  They have been compared using gfs2_tool gettune/getargs/df and they are identical between the systems.

 

The customer is running a SAS based test suite to test the throughput of the entire environment.  What they noticed is that the suite takes approximately 5-1/2 hours to run in Environment 1 and approximately 3-1/2 hours to run in Environment 2.  After gathering tons of iostat data and performance data from the arrays it was determined that Environment 1 was doing almost 5x the iops and their io sizes from the array's perspective were much smaller, so it was working a lot harder to move the same amount of data - therefore causing the job to run longer (thats my theory at least). 

 

Reads were particularly unbalanced.  I have looked EVERYWHERE to find differences in the device files (/sys/block/emcpowerBLAH/) and validated that all files were identical between the enviroments.  At one time I had changed the scheduler from cfq to noop, but that has since been changed back.  I had also changed the nr_requests value, but that has also been changed back in both environments. 

 

What complicates this even further is that I ran my own test using iozone to attempt to replicate the results.  I gathered the same stats and did the same analysis and the iozone tests did NOT exhibit the same issues in regards to io size/iops, etc.  So I started looking at their SAS application parameters (not that I know anything about SAS) but was unable to find any differences (changing BUFSIZE, etc) between the two and they swear that the test/environment they are running is identical.

 

We are pretty much positive this is not hardware related and that it is caused somewhere at the OS level or application level.  Can anyone think of anything from a redhat perspective that could possible account for this type of difference?  Thank you very much and I can provide any additional information that would be of use.

 

-Michael

Responses