Parallel jobs do not use all available cores with kernel 2.6.32-279 or newer

Latest response

We have an IBM x3850 x5 dual node system with 8 CPU sockets. Each cpu has 6 cores building up an eight-node numa system with 48 cores.

Our users have noticed that starting with kernel 2.6.32-279.19.1, the performance of the system has massively decreased for parallel jobs. They noticed, that multi-threaded jobs are not distributed to all available cores anymore even though most of the cores are idle. The threads seem to cluster within numa nodes and some cores are allocated to multiple threads depending on the number of parallel threads requested. E.g. with 20 threads, the system may use only two numa nodes with two or three threads per core. This happens even for jobs where the threads do not need any shared memory regions.

We have completely upgraded the system some days ago to RHEL 6.4 kernel 2.6.32-358.2.1, but the odd behavior is still the same.

With the old kernel 2.6.32-220.7.1 we do not see this clustering of threads on numa-nodes and have therefore reverted to this kernel. However, this cuts us off from all security fixes coming with the newer kernels.

I tried to lower the tunable kernel.sched_migration_cost, but this did not give reproducible benefits.

Does anybody have ideas how we can resolve the issue?

Thanks for any help,

Stefan

 

# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 98285 MB
node 0 free: 79328 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 98304 MB
node 1 free: 78438 MB
node 2 cpus: 12 13 14 15 16 17
node 2 size: 98304 MB
node 2 free: 73256 MB
node 3 cpus: 18 19 20 21 22 23
node 3 size: 98304 MB
node 3 free: 84384 MB
node 4 cpus: 24 25 26 27 28 29
node 4 size: 98304 MB
node 4 free: 88010 MB
node 5 cpus: 30 31 32 33 34 35
node 5 size: 98304 MB
node 5 free: 88409 MB
node 6 cpus: 36 37 38 39 40 41
node 6 size: 98304 MB
node 6 free: 82114 MB
node 7 cpus: 42 43 44 45 46 47
node 7 size: 98304 MB
node 7 free: 54779 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  12  11  11  11  12  12  12 
  1:  12  10  11  11  12  12  11  12 
  2:  11  11  10  12  12  11  12  12 
  3:  11  11  12  10  12  12  12  11 
  4:  11  12  12  12  10  12  11  11 
  5:  12  12  11  12  12  10  11  11 
  6:  12  11  12  12  11  11  10  12 
  7:  12  12  12  11  11  11  12  10 
# sysctl -A | grep "sched" | grep -v "domain"
kernel.sched_child_runs_first = 0
kernel.sched_min_granularity_ns = 4000000
kernel.sched_latency_ns = 20000000
kernel.sched_wakeup_granularity_ns = 4000000
kernel.sched_tunable_scaling = 1
kernel.sched_features = 3183
kernel.sched_migration_cost = 500000
kernel.sched_nr_migrate = 32
kernel.sched_time_avg = 1000
kernel.sched_shares_window = 10000000
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_compat_yield = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_cfs_bandwidth_slice_us = 5000

Responses

Hi Stefan, and we appreciate you posting in our Groups! Doesn't look like we've had any takers on this yet so I've reached out to a Red Hatter to respond. Any updates you can provide would be very helpful-

Hello Stefan, sorry to hear that you're experiencing some unexpected behavior here. There were a few changes to NUMA made between the kernel versions you're noting, and my guess is that one of these alterations is causing these threads to clump together. I have a few questions for you to help us pin down just what's going on here:

  • Is numad running on this server to help balance applications between NUMA nodes?
  • Alternatively, have you performed any testing using numactl to attempt to pin different processes to specific nodes or groups of nodes?
  • Finally, have any other kernel-level tunables on the system been changed after moving to the new kernel?

Thank you for any additional information you can provide, I'll keep digging to help us get to the bottom of this.

Hello Chris,

thanks for the followup questions.

  • numad is not running. We also do not have crgoups runnnig either which is a prerequisite AFAIK.
  • Yes I tried targeting procesess to different nodes / groups of nodes. This seemed to work when I used a subset of available nodes with the number of cores less than the number of parallel processes to start from the master process. It did return to the odd behavior, when I instructed numatcl to use all nodes - well for that I do not have to use numactl at all, obviously. I have to say that I did not have prior experience with numatcl - if you have advices which tests to perform, that would be appreciated.
  • There was no change done to any kernel parameter when we switched to the new kernels, unless some package updates would do this. Btw. we are using the latency-performance profile.

The programs we are testing with and which are relevant to our users are R and bowtie plus one that I found at http://stackoverflow.com/questions/2828602/linux-2-6-31-scheduler-and-multithreaded-jobs. bowtie is a program form the bioinformatics field and has an option to define the number of parallel processes (-p option). With R plus its parallel package, we just need two commands to create a number of parallel processes, 20 in the following example:

$ R
> library(parallel)
> mclapply(1:20, function(i) repeat sqrt(pi), mc.cores=20)
# need Ctrl-C to stop

In 'top' one can switch on the display of individual CPUs using the keystroke sequence 'F' + 'j'

For the R example we typically see then that several R processes are running on the same CPU and their CPU-percentage is less than 100% - 100 divided by the number of R processes, often it was 33%. The CPUs group into numa nodes.

For the multithreaded processes one does not see individual CPUs in top, but the total percentage, e.g. for 20 threads 2000. 'mpstat -P ALL' can then show which CPUs are in use.

Btw., we have run the same tests on another system which has two numa-nodes with 16 CPUs each using exactly the same binaries and RHEL 6.3 with kernel 2.6.32-279. There we did not see the odd behavior and all processes got free CPUs.

I hope the additional information is useful.

Thanks for any help, Stefan

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.