OS noise caused by snmpd interrupting tasks of a parallel job results in degradation of barrier performance.

Solution Unverified - Updated -

Issue

  • snmpd as shipped with RHEL 6.2 is causing enough OS noise that the parallel job barrier performance is severely impacted.
  • In a parallel job, when there is a barrier, the whole application has to stop and wait for the slowest task.
  • Each time the OS interrupts the computation on one CPU of the job, it makes one task slower.
  • This would accumulate rapidly for a very large compute job.
  • The issue was identified when running a job across 1000 nodes (16,000 cores).
  • The issue would be 100x impact to MPI barrier performance at scale (cluster is 2916 nodes, 46,208 cores).
mpiBench -i 1000 -t 0 Barrier times, 5 samples each            
 15tpn, w/ snmpd: 93, 48, 109, 51, 47                           
 15tpn, no snmpd: 35, 49, 52, 42, 42                            
 16tpn, w/ snmpd: 4834, 202, 1530, 16019, 17397                 
 16tpn, no snmpd: 54, 54, 51, 57, 51                            
  • First run was 15 tasks per node (leaving a free cpu) with snmpd running.
  • Second run was 15 tasks per node without snmpd. Some small impact, but very little which is probably because the OS scheduled snmpd on the free cpu.
  • Third run used all cpus with snmpd and performance was degraded by up to 100x or more.

Environment

Red Hat Enterprise Linux 6

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content