Seaquest cluster application experiences performance regressions after upgrading from RHEL 5 to RHEL 6
Issue
-
In a 80 Node test cluster considerable performance regressions after upgrading from RHEL 5.7 to 6.2
-
Seaquest platform has recently moved from the RHEL 5.7 kernel to the RHEL 6.2 kernel. We have observed considerable performance regressions (20%~70%) in all our benchmarks on the RHEL 6.2 kernel, relative to the RHEL 5.7 kernel. The performance degradation may be due to application processes pegging some to many nodes at 100%, causing slowdowns across the cluster. Other observations, likely related:
- Application processes stuck waiting in memory allocation routines for up to 120 sec at a time
- collectl shows gaps in data collection for 20sec ~ 10 min at times
- gcores of the application processes on the pegged nodes appears to free them up
- dd and messaging traffic show no performance variability between RHEL 5.7 and RHEL 6.2
Environment
- Red Hat Enterprise Linux 6.2
- Mellanox OFED driver
- HP-MPI
- Default kernel configuration settings (but use a mix of ext3 and ext4 filesystems)
- HP DL380g7 – 2x 12core cpu, 96 GB RAM. Mellanox Connect-x2 PCI IB card
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.