Parallel jobs do not use all available cores with kernel 2.6.32-279 or newer

Latest response

We have an IBM x3850 x5 dual node system with 8 CPU sockets. Each cpu has 6 cores building up an eight-node numa system with 48 cores.

Our users have noticed that starting with kernel 2.6.32-279.19.1, the performance of the system has massively decreased for parallel jobs. They noticed, that multi-threaded jobs are not distributed to all available cores anymore even though most of the cores are idle. The threads seem to cluster within numa nodes and some cores are allocated to multiple threads depending on the number of parallel threads requested. E.g. with 20 threads, the system may use only two numa nodes with two or three threads per core. This happens even for jobs where the threads do not need any shared memory regions.

We have completely upgraded the system some days ago to RHEL 6.4 kernel 2.6.32-358.2.1, but the odd behavior is still the same.

With the old kernel 2.6.32-220.7.1 we do not see this clustering of threads on numa-nodes and have therefore reverted to this kernel. However, this cuts us off from all security fixes coming with the newer kernels.

I tried to lower the tunable kernel.sched_migration_cost, but this did not give reproducible benefits.

Does anybody have ideas how we can resolve the issue?

Thanks for any help,

Stefan

 

# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 98285 MB
node 0 free: 79328 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 98304 MB
node 1 free: 78438 MB
node 2 cpus: 12 13 14 15 16 17
node 2 size: 98304 MB
node 2 free: 73256 MB
node 3 cpus: 18 19 20 21 22 23
node 3 size: 98304 MB
node 3 free: 84384 MB
node 4 cpus: 24 25 26 27 28 29
node 4 size: 98304 MB
node 4 free: 88010 MB
node 5 cpus: 30 31 32 33 34 35
node 5 size: 98304 MB
node 5 free: 88409 MB
node 6 cpus: 36 37 38 39 40 41
node 6 size: 98304 MB
node 6 free: 82114 MB
node 7 cpus: 42 43 44 45 46 47
node 7 size: 98304 MB
node 7 free: 54779 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  12  11  11  11  12  12  12 
  1:  12  10  11  11  12  12  11  12 
  2:  11  11  10  12  12  11  12  12 
  3:  11  11  12  10  12  12  12  11 
  4:  11  12  12  12  10  12  11  11 
  5:  12  12  11  12  12  10  11  11 
  6:  12  11  12  12  11  11  10  12 
  7:  12  12  12  11  11  11  12  10 
# sysctl -A | grep "sched" | grep -v "domain"
kernel.sched_child_runs_first = 0
kernel.sched_min_granularity_ns = 4000000
kernel.sched_latency_ns = 20000000
kernel.sched_wakeup_granularity_ns = 4000000
kernel.sched_tunable_scaling = 1
kernel.sched_features = 3183
kernel.sched_migration_cost = 500000
kernel.sched_nr_migrate = 32
kernel.sched_time_avg = 1000
kernel.sched_shares_window = 10000000
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_compat_yield = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_cfs_bandwidth_slice_us = 5000

Responses