Seaquest cluster application experiences performance regressions after upgrading from RHEL 5 to RHEL 6

Solution Verified - Updated 2024-08-06T06:33:08+00:00 -

Issue

In a 80 Node test cluster considerable performance regressions after upgrading from RHEL 5.7 to 6.2
Seaquest platform has recently moved from the RHEL 5.7 kernel to the RHEL 6.2 kernel. We have observed considerable performance regressions (20%~70%) in all our benchmarks on the RHEL 6.2 kernel, relative to the RHEL 5.7 kernel. The performance degradation may be due to application processes pegging some to many nodes at 100%, causing slowdowns across the cluster. Other observations, likely related:
- Application processes stuck waiting in memory allocation routines for up to 120 sec at a time
- collectl shows gaps in data collection for 20sec ~ 10 min at times
- gcores of the application processes on the pegged nodes appears to free them up
- dd and messaging traffic show no performance variability between RHEL 5.7 and RHEL 6.2

Red Hat Enterprise Linux 6.2
- Mellanox OFED driver
- HP-MPI
- Default kernel configuration settings (but use a mix of ext3 and ext4 filesystems)
HP DL380g7 – 2x 12core cpu, 96 GB RAM. Mellanox Connect-x2 PCI IB card

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.