Task scheduler won't balance tasks over cores

Solution In Progress - Updated -

Issue

  • We have a floating issue in the cluster, where multiple tasks/threads seem to be assigned to only a subset of the available cores, leaving the others idle. That may slow down computations drastically and lead to degradation of a whole parallel job run over a number of nodes.

Once we noticed a suspect node during a job operation, we can then observe the issue in more detail running 16 parallel tasks which are summing up to numbers in a loop for a while.

  • As cpu scaling is turned on, it's clear that some cores are underused while others are 100% utilized.
# grep MHz /proc/cpuinfo ; top -b -n 1 |grep stress ; ps axur
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 2701.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 2701.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
cpu MHz         : 1200.000
 2986 root      20   0   424  248   88 R 98.7  0.0   0:05.03 stress
 2990 root      20   0   424  248   88 R  7.9  0.0   0:00.34 stress
 2994 root      20   0   424  248   88 R  7.9  0.0   0:00.34 stress
 2987 root      20   0   424  248   88 R  5.9  0.0   0:00.32 stress
 2988 root      20   0   424  244   88 R  5.9  0.0   0:00.33 stress
 2989 root      20   0   424  248   88 R  5.9  0.0   0:00.33 stress
 2991 root      20   0   424  248   88 R  5.9  0.0   0:00.33 stress
 2992 root      20   0   424  248   88 R  5.9  0.0   0:00.33 stress
 2993 root      20   0   424  248   88 R  5.9  0.0   0:00.33 stress
 2995 root      20   0   424  248   88 R  5.9  0.0   0:00.33 stress
 2996 root      20   0   424  248   88 R  5.9  0.0   0:00.33 stress
 2997 root      20   0   424  248   88 R  5.9  0.0   0:00.33 stress
 2998 root      20   0   424  248   88 R  5.9  0.0   0:00.33 stress
 2999 root      20   0   424  248   88 R  5.9  0.0   0:00.32 stress
 3000 root      20   0   424  248   88 R  5.9  0.0   0:00.32 stress
 3001 root      20   0   424  248   88 R  5.9  0.0   0:00.32 stress
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2986  101  0.0    424   248 pts/0    R    21:25   0:05 ./stress 50000
root      2987  6.6  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2988  6.8  0.0    424   244 pts/0    R    21:25   0:00 ./stress 50000
root      2989  6.6  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2990  6.8  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2991  6.6  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2992  6.6  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2993  6.6  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2994  6.8  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2995  6.6  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2996  6.6  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2997  6.6  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2998  6.8  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      2999  6.4  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      3000  6.4  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
root      3001  6.4  0.0    424   248 pts/0    R    21:25   0:00 ./stress 50000
  • The CPU affinity mask for each process is ffff
  • After using 'taskset' to change affinity to explicitly distribute processes across all cores, we get it alright:
# grep MHz /proc/cpuinfo ; top -b -n 1 |grep stress ; ps axur
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
cpu MHz         : 2701.000
 5185 root      20   0   424  248   88 R 100.0  0.0   0:16.62 stress
 5188 root      20   0   424  244   88 R 100.0  0.0   0:16.60 stress
 5191 root      20   0   424  248   88 R 100.0  0.0   0:16.61 stress
 5192 root      20   0   424  248   88 R 100.0  0.0   0:14.98 stress
 5194 root      20   0   424  248   88 R 100.0  0.0   0:16.61 stress
 5197 root      20   0   424  248   88 R 100.0  0.0   0:14.98 stress
 5198 root      20   0   424  248   88 R 100.0  0.0   0:14.97 stress
 5199 root      20   0   424  244   88 R 100.0  0.0   0:16.61 stress
 5186 root      20   0   424  248   88 R 98.6  0.0   0:16.58 stress
 5187 root      20   0   424  244   88 R 98.6  0.0   0:16.62 stress
 5189 root      20   0   424  252   88 R 98.6  0.0   0:16.58 stress
 5190 root      20   0   424  248   88 R 98.6  0.0   0:14.98 stress
 5193 root      20   0   424  248   88 R 98.6  0.0   0:14.96 stress
 5195 root      20   0   424  248   88 R 98.6  0.0   0:14.97 stress
 5196 root      20   0   424  244   88 R 98.6  0.0   0:14.96 stress
 5200 root      20   0   424  244   88 R 98.6  0.0   0:14.96 stress
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      5185 92.5  0.0    424   248 pts/0    R    21:43   0:16 ./stress 100000
root      5186 92.2  0.0    424   248 pts/0    R    21:43   0:16 ./stress 100000
root      5187 92.5  0.0    424   244 pts/0    R    21:43   0:16 ./stress 100000
root      5188 92.3  0.0    424   244 pts/0    R    21:43   0:16 ./stress 100000
root      5189 92.2  0.0    424   252 pts/0    R    21:43   0:16 ./stress 100000
root      5190 83.3  0.0    424   248 pts/0    R    21:43   0:15 ./stress 100000
root      5191 92.4  0.0    424   248 pts/0    R    21:43   0:16 ./stress 100000
root      5192 83.3  0.0    424   248 pts/0    R    21:43   0:15 ./stress 100000
root      5193 83.3  0.0    424   248 pts/0    R    21:43   0:15 ./stress 100000
root      5194 92.4  0.0    424   248 pts/0    R    21:43   0:16 ./stress 100000
root      5195 83.3  0.0    424   248 pts/0    R    21:43   0:15 ./stress 100000
root      5196 83.2  0.0    424   244 pts/0    R    21:43   0:14 ./stress 100000
root      5197 83.3  0.0    424   248 pts/0    R    21:43   0:15 ./stress 100000
root      5198 83.3  0.0    424   248 pts/0    R    21:43   0:15 ./stress 100000
root      5199 92.4  0.0    424   244 pts/0    R    21:43   0:16 ./stress 100000
root      5200 83.2  0.0    424   244 pts/0    R    21:43   0:14 ./stress 100000
  • The problem is that kernel schedules independent tasks onto the same core leaving others core idle. This is clearly a bug in the kernel behavior as we see it, since it leads to a drastic degradation of performance especially in a parallel environment.
  • We used taskset only to show that the cores itself are capable to perform the same tasks at 100% user utilization and time of the task completion would fall significantly. Also the fact is quite eloquent, that this behavior usually goes away as we reboot the node.
  • Using taskset is not a real solution as nodes for a task are picked automatically by outside batch scheduler.
  • Please confirm if such an issue is a kernel error or it can be somehow be configured so that task a scheduled in the optimal way without manual intervention .

Environment

  • Red Hat Enterprise Linux 6.2

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.