OS noise caused by snmpd interrupting tasks of a parallel job results in degradation of barrier performance.
Issue
- snmpd as shipped with RHEL 6.2 is causing enough OS noise that the parallel job barrier performance is severely impacted.
- In a parallel job, when there is a barrier, the whole application has to stop and wait for the slowest task.
- Each time the OS interrupts the computation on one CPU of the job, it makes one task slower.
- This would accumulate rapidly for a very large compute job.
- The issue was identified when running a job across 1000 nodes (16,000 cores).
- The issue would be 100x impact to MPI barrier performance at scale (cluster is 2916 nodes, 46,208 cores).
mpiBench -i 1000 -t 0 Barrier times, 5 samples each
15tpn, w/ snmpd: 93, 48, 109, 51, 47
15tpn, no snmpd: 35, 49, 52, 42, 42
16tpn, w/ snmpd: 4834, 202, 1530, 16019, 17397
16tpn, no snmpd: 54, 54, 51, 57, 51
- First run was 15 tasks per node (leaving a free cpu) with snmpd running.
- Second run was 15 tasks per node without snmpd. Some small impact, but very little which is probably because the OS scheduled snmpd on the free cpu.
- Third run used all cpus with snmpd and performance was degraded by up to 100x or more.
Environment
Red Hat Enterprise Linux 6
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
