Task scheduler won't balance tasks over cores
Issue
- We have a floating issue in the cluster, where multiple tasks/threads seem to be assigned to only a subset of the available cores, leaving the others idle. That may slow down computations drastically and lead to degradation of a whole parallel job run over a number of nodes.
Once we noticed a suspect node during a job operation, we can then observe the issue in more detail running 16 parallel tasks which are summing up to numbers in a loop for a while.
- As cpu scaling is turned on, it's clear that some cores are underused while others are 100% utilized.
# grep MHz /proc/cpuinfo ; top -b -n 1 |grep stress ; ps axur
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 2701.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 2701.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
2986 root 20 0 424 248 88 R 98.7 0.0 0:05.03 stress
2990 root 20 0 424 248 88 R 7.9 0.0 0:00.34 stress
2994 root 20 0 424 248 88 R 7.9 0.0 0:00.34 stress
2987 root 20 0 424 248 88 R 5.9 0.0 0:00.32 stress
2988 root 20 0 424 244 88 R 5.9 0.0 0:00.33 stress
2989 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2991 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2992 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2993 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2995 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2996 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2997 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2998 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2999 root 20 0 424 248 88 R 5.9 0.0 0:00.32 stress
3000 root 20 0 424 248 88 R 5.9 0.0 0:00.32 stress
3001 root 20 0 424 248 88 R 5.9 0.0 0:00.32 stress
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2986 101 0.0 424 248 pts/0 R 21:25 0:05 ./stress 50000
root 2987 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2988 6.8 0.0 424 244 pts/0 R 21:25 0:00 ./stress 50000
root 2989 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2990 6.8 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2991 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2992 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2993 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2994 6.8 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2995 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2996 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2997 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2998 6.8 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2999 6.4 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 3000 6.4 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 3001 6.4 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
- The CPU affinity mask for each process is ffff
- After using 'taskset' to change affinity to explicitly distribute processes across all cores, we get it alright:
# grep MHz /proc/cpuinfo ; top -b -n 1 |grep stress ; ps axur
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
5185 root 20 0 424 248 88 R 100.0 0.0 0:16.62 stress
5188 root 20 0 424 244 88 R 100.0 0.0 0:16.60 stress
5191 root 20 0 424 248 88 R 100.0 0.0 0:16.61 stress
5192 root 20 0 424 248 88 R 100.0 0.0 0:14.98 stress
5194 root 20 0 424 248 88 R 100.0 0.0 0:16.61 stress
5197 root 20 0 424 248 88 R 100.0 0.0 0:14.98 stress
5198 root 20 0 424 248 88 R 100.0 0.0 0:14.97 stress
5199 root 20 0 424 244 88 R 100.0 0.0 0:16.61 stress
5186 root 20 0 424 248 88 R 98.6 0.0 0:16.58 stress
5187 root 20 0 424 244 88 R 98.6 0.0 0:16.62 stress
5189 root 20 0 424 252 88 R 98.6 0.0 0:16.58 stress
5190 root 20 0 424 248 88 R 98.6 0.0 0:14.98 stress
5193 root 20 0 424 248 88 R 98.6 0.0 0:14.96 stress
5195 root 20 0 424 248 88 R 98.6 0.0 0:14.97 stress
5196 root 20 0 424 244 88 R 98.6 0.0 0:14.96 stress
5200 root 20 0 424 244 88 R 98.6 0.0 0:14.96 stress
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 5185 92.5 0.0 424 248 pts/0 R 21:43 0:16 ./stress 100000
root 5186 92.2 0.0 424 248 pts/0 R 21:43 0:16 ./stress 100000
root 5187 92.5 0.0 424 244 pts/0 R 21:43 0:16 ./stress 100000
root 5188 92.3 0.0 424 244 pts/0 R 21:43 0:16 ./stress 100000
root 5189 92.2 0.0 424 252 pts/0 R 21:43 0:16 ./stress 100000
root 5190 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5191 92.4 0.0 424 248 pts/0 R 21:43 0:16 ./stress 100000
root 5192 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5193 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5194 92.4 0.0 424 248 pts/0 R 21:43 0:16 ./stress 100000
root 5195 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5196 83.2 0.0 424 244 pts/0 R 21:43 0:14 ./stress 100000
root 5197 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5198 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5199 92.4 0.0 424 244 pts/0 R 21:43 0:16 ./stress 100000
root 5200 83.2 0.0 424 244 pts/0 R 21:43 0:14 ./stress 100000
- The problem is that kernel schedules independent tasks onto the same core leaving others core idle. This is clearly a bug in the kernel behavior as we see it, since it leads to a drastic degradation of performance especially in a parallel environment.
- We used taskset only to show that the cores itself are capable to perform the same tasks at 100% user utilization and time of the task completion would fall significantly. Also the fact is quite eloquent, that this behavior usually goes away as we reboot the node.
- Using taskset is not a real solution as nodes for a task are picked automatically by outside batch scheduler.
- Please confirm if such an issue is a kernel error or it can be somehow be configured so that task a scheduled in the optimal way without manual intervention .
Environment
- Red Hat Enterprise Linux 6.2
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.