Parallel jobs do not use all available cores with kernel 2.6.32-279 or newer
We have an IBM x3850 x5 dual node system with 8 CPU sockets. Each cpu has 6 cores building up an eight-node numa system with 48 cores.
Our users have noticed that starting with kernel 2.6.32-279.19.1, the performance of the system has massively decreased for parallel jobs. They noticed, that multi-threaded jobs are not distributed to all available cores anymore even though most of the cores are idle. The threads seem to cluster within numa nodes and some cores are allocated to multiple threads depending on the number of parallel threads requested. E.g. with 20 threads, the system may use only two numa nodes with two or three threads per core. This happens even for jobs where the threads do not need any shared memory regions.
We have completely upgraded the system some days ago to RHEL 6.4 kernel 2.6.32-358.2.1, but the odd behavior is still the same.
With the old kernel 2.6.32-220.7.1 we do not see this clustering of threads on numa-nodes and have therefore reverted to this kernel. However, this cuts us off from all security fixes coming with the newer kernels.
I tried to lower the tunable kernel.sched_migration_cost, but this did not give reproducible benefits.
Does anybody have ideas how we can resolve the issue?
Thanks for any help,
Stefan
# numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 node 0 size: 98285 MB node 0 free: 79328 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 98304 MB node 1 free: 78438 MB node 2 cpus: 12 13 14 15 16 17 node 2 size: 98304 MB node 2 free: 73256 MB node 3 cpus: 18 19 20 21 22 23 node 3 size: 98304 MB node 3 free: 84384 MB node 4 cpus: 24 25 26 27 28 29 node 4 size: 98304 MB node 4 free: 88010 MB node 5 cpus: 30 31 32 33 34 35 node 5 size: 98304 MB node 5 free: 88409 MB node 6 cpus: 36 37 38 39 40 41 node 6 size: 98304 MB node 6 free: 82114 MB node 7 cpus: 42 43 44 45 46 47 node 7 size: 98304 MB node 7 free: 54779 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 12 11 11 11 12 12 12 1: 12 10 11 11 12 12 11 12 2: 11 11 10 12 12 11 12 12 3: 11 11 12 10 12 12 12 11 4: 11 12 12 12 10 12 11 11 5: 12 12 11 12 12 10 11 11 6: 12 11 12 12 11 11 10 12 7: 12 12 12 11 11 11 12 10 # sysctl -A | grep "sched" | grep -v "domain" kernel.sched_child_runs_first = 0 kernel.sched_min_granularity_ns = 4000000 kernel.sched_latency_ns = 20000000 kernel.sched_wakeup_granularity_ns = 4000000 kernel.sched_tunable_scaling = 1 kernel.sched_features = 3183 kernel.sched_migration_cost = 500000 kernel.sched_nr_migrate = 32 kernel.sched_time_avg = 1000 kernel.sched_shares_window = 10000000 kernel.sched_rt_period_us = 1000000 kernel.sched_rt_runtime_us = 950000 kernel.sched_compat_yield = 0 kernel.sched_autogroup_enabled = 0 kernel.sched_cfs_bandwidth_slice_us = 5000