OS noise caused by snmpd interrupting tasks of a parallel job results in degradation of barrier performance.
Issue
- snmpd as shipped with RHEL 6.2 is causing enough OS noise that the parallel job barrier performance is severely impacted.
- In a parallel job, when there is a barrier, the whole application has to stop and wait for the slowest task.
- Each time the OS interrupts the computation on one CPU of the job, it makes one task slower.
- This would accumulate rapidly for a very large compute job.
- The issue was identified when running a job across 1000 nodes (16,000 cores).
- The issue would be 100x impact to MPI barrier performance at scale (cluster is 2916 nodes, 46,208 cores).
mpiBench -i 1000 -t 0 Barrier times, 5 samples each
15tpn, w/ snmpd: 93, 48, 109, 51, 47
15tpn, no snmpd: 35, 49, 52, 42, 42
16tpn, w/ snmpd: 4834, 202, 1530, 16019, 17397
16tpn, no snmpd: 54, 54, 51, 57, 51
- First run was 15 tasks per node (leaving a free cpu) with snmpd running.
- Second run was 15 tasks per node without snmpd. Some small impact, but very little which is probably because the OS scheduled snmpd on the free cpu.
- Third run used all cpus with snmpd and performance was degraded by up to 100x or more.
Environment
Red Hat Enterprise Linux 6
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.