Task scheduler won't balance tasks over cores
Issue
- We have a floating issue in the cluster, where multiple tasks/threads seem to be assigned to only a subset of the available cores, leaving the others idle. That may slow down computations drastically and lead to degradation of a whole parallel job run over a number of nodes.
Once we noticed a suspect node during a job operation, we can then observe the issue in more detail running 16 parallel tasks which are summing up to numbers in a loop for a while.
- As cpu scaling is turned on, it's clear that some cores are underused while others are 100% utilized.
# grep MHz /proc/cpuinfo ; top -b -n 1 |grep stress ; ps axur
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 2701.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 2701.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
2986 root 20 0 424 248 88 R 98.7 0.0 0:05.03 stress
2990 root 20 0 424 248 88 R 7.9 0.0 0:00.34 stress
2994 root 20 0 424 248 88 R 7.9 0.0 0:00.34 stress
2987 root 20 0 424 248 88 R 5.9 0.0 0:00.32 stress
2988 root 20 0 424 244 88 R 5.9 0.0 0:00.33 stress
2989 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2991 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2992 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2993 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2995 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2996 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2997 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2998 root 20 0 424 248 88 R 5.9 0.0 0:00.33 stress
2999 root 20 0 424 248 88 R 5.9 0.0 0:00.32 stress
3000 root 20 0 424 248 88 R 5.9 0.0 0:00.32 stress
3001 root 20 0 424 248 88 R 5.9 0.0 0:00.32 stress
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2986 101 0.0 424 248 pts/0 R 21:25 0:05 ./stress 50000
root 2987 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2988 6.8 0.0 424 244 pts/0 R 21:25 0:00 ./stress 50000
root 2989 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2990 6.8 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2991 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2992 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2993 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2994 6.8 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2995 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2996 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2997 6.6 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2998 6.8 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 2999 6.4 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 3000 6.4 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
root 3001 6.4 0.0 424 248 pts/0 R 21:25 0:00 ./stress 50000
- The CPU affinity mask for each process is ffff
- After using 'taskset' to change affinity to explicitly distribute processes across all cores, we get it alright:
# grep MHz /proc/cpuinfo ; top -b -n 1 |grep stress ; ps axur
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
cpu MHz : 2701.000
5185 root 20 0 424 248 88 R 100.0 0.0 0:16.62 stress
5188 root 20 0 424 244 88 R 100.0 0.0 0:16.60 stress
5191 root 20 0 424 248 88 R 100.0 0.0 0:16.61 stress
5192 root 20 0 424 248 88 R 100.0 0.0 0:14.98 stress
5194 root 20 0 424 248 88 R 100.0 0.0 0:16.61 stress
5197 root 20 0 424 248 88 R 100.0 0.0 0:14.98 stress
5198 root 20 0 424 248 88 R 100.0 0.0 0:14.97 stress
5199 root 20 0 424 244 88 R 100.0 0.0 0:16.61 stress
5186 root 20 0 424 248 88 R 98.6 0.0 0:16.58 stress
5187 root 20 0 424 244 88 R 98.6 0.0 0:16.62 stress
5189 root 20 0 424 252 88 R 98.6 0.0 0:16.58 stress
5190 root 20 0 424 248 88 R 98.6 0.0 0:14.98 stress
5193 root 20 0 424 248 88 R 98.6 0.0 0:14.96 stress
5195 root 20 0 424 248 88 R 98.6 0.0 0:14.97 stress
5196 root 20 0 424 244 88 R 98.6 0.0 0:14.96 stress
5200 root 20 0 424 244 88 R 98.6 0.0 0:14.96 stress
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 5185 92.5 0.0 424 248 pts/0 R 21:43 0:16 ./stress 100000
root 5186 92.2 0.0 424 248 pts/0 R 21:43 0:16 ./stress 100000
root 5187 92.5 0.0 424 244 pts/0 R 21:43 0:16 ./stress 100000
root 5188 92.3 0.0 424 244 pts/0 R 21:43 0:16 ./stress 100000
root 5189 92.2 0.0 424 252 pts/0 R 21:43 0:16 ./stress 100000
root 5190 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5191 92.4 0.0 424 248 pts/0 R 21:43 0:16 ./stress 100000
root 5192 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5193 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5194 92.4 0.0 424 248 pts/0 R 21:43 0:16 ./stress 100000
root 5195 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5196 83.2 0.0 424 244 pts/0 R 21:43 0:14 ./stress 100000
root 5197 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5198 83.3 0.0 424 248 pts/0 R 21:43 0:15 ./stress 100000
root 5199 92.4 0.0 424 244 pts/0 R 21:43 0:16 ./stress 100000
root 5200 83.2 0.0 424 244 pts/0 R 21:43 0:14 ./stress 100000
- The problem is that kernel schedules independent tasks onto the same core leaving others core idle. This is clearly a bug in the kernel behavior as we see it, since it leads to a drastic degradation of performance especially in a parallel environment.
- We used taskset only to show that the cores itself are capable to perform the same tasks at 100% user utilization and time of the task completion would fall significantly. Also the fact is quite eloquent, that this behavior usually goes away as we reboot the node.
- Using taskset is not a real solution as nodes for a task are picked automatically by outside batch scheduler.
- Please confirm if such an issue is a kernel error or it can be somehow be configured so that task a scheduled in the optimal way without manual intervention .
Environment
- Red Hat Enterprise Linux 6.2
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
