Kernel 2.6.32-573.22.1.el6.x86_64 and load averages
Anyone else running into any issues with kernel 2.6.32-573.22.1.el6.x86_64?
We recently upgraded a portion of our lab environment to kernel-2.6.32-573.22.1.el6.x86_64 during our routine patching cycle and after reboot we are noticing a subset of systems reporting high Load Averages .
Upon closer inspection, reverting the a previous kernel, we observe more historical load averages. I am aware there was a change to the calc method for load averages in 2.6.32-573.20.1.el6 but this seems like it should have resolved an issue rather than create a new one.
- Fri Jan 22 2016 Frantisek Hrbata fhrbata@redhat.com [2.6.32-573.20.1.el6]
- [sched] kernel: sched: Fix nohz load accounting -- again (Rafael Aquini) [1300349 1167755]
This system has no running user processes and yet reports a higher load average on kernel 2.6.32-573.22.1.el6.x86_64
with kernel 2.6.32-573.22.1.el6.x86_64
#> uptime
07:44:29 up 15 min, 1 user, load average: 0.98, 0.72, 0.46
#> uname -r
2.6.32-573.22.1.el6.x86_64
with kernel 2.6.32-573.18.1.el6.x86_64
#> uptime
08:02:03 up 15 min, 1 user, load average: 0.00, 0.00, 0.00
#> uname -r
2.6.32-573.18.1.el6.x86_64
About to pull the trigger on rolling our lower environment back to 2.6.32-573.18. Have a support case open already too, but waiting in queue.
TIA for any input
Responses
Interestingly the changelog entry mentions multiple load average fixes in 2.6.32-573.20.1.el6
2016-01-22 Frantisek Hrbata <fhrbata@redhat.com> [2.6.32-573.20.1.el6]
- [sched] kernel: sched: Fix nohz load accounting -- again (Rafael Aquini) [1300349 1167755]
- [sched] kernel: sched: Move sched_avg_update to update_cpu_load (Rafael Aquini) [1300349 1167755]
- [sched] kernel: sched: Cure more NO_HZ load average woes (Rafael Aquini) [1300349 1167755]
- [sched] kernel: sched: Cure load average vs NO_HZ woes (Rafael Aquini) [1300349 1167755]
Full changelog is here: https://access.redhat.com/downloads/content/rhel---6/x86_64/168/kernel/2.6.32-573.22.1.el6/x86_64/fd431d51/package-changelog
If the bugs were public they may provide some more details.
Have Red Hat Support provided any more insight?
Does the CPU and memory usage look the same across both kernels? it's only the load number that has changed?
If you have SAR installed it would be worth comparing before/after and seeing if the %idle CPU is similar for both periods (also check other CPU and memory metrics for the period).
Does it mean that the higher load is correct and the displayed load before the patch was not?
If these are VM's it would be interesting to see the performance metrics before/after from the hypervisors perspective. vRealize Operations would be perfect to provide this information if you were on the VMware stack.
Appreciate the updates Will!
Thanks again for the update Will.
It's unfortunate that the bug is private (what happened to 'default to open'?). I am hoping it's because the bug report includes customer information, not because Red Hat don't want to share the discussion.
RE: "It's unfortunate that the bug is private (what happened to 'default to open'?). I am hoping it's because the bug report includes customer information, not because Red Hat don't want to share the discussion."
I've worked in Red Hat support for years and many of us have pushed (for years) to have a "default to open" policy; however, unfortunately we still don't due to the risk of newer (untrained) support engineers posting public BZs with sensitive customer data. :(
Same thing observed here on our servers (approx one hundred). Some raising monitoring alerts (we will modify the checks temporarily as a workaround). Hoping this will get corrected for the next update.
Sam experience here. after updating to 2.6.32-573.22.1.el6.x86_64, load averages multiply whereas cpu usage (measured from atop and from Vcenter) stays the same. Tested with a freshly installed, blank redhat 6.7, no further user processes. Downgrade to previous kernel fixes the issue. Has anyone opened a call?
I'm having the same issue. The 2.6.32-573.26.1 kernel appears to be marginally better, but I again rolled back to 2.6.32-573.18.1 to work around this.
The interesting thing here is that there is that the number of runnable processes plus uninterruptable sleep processes (which by definition, is system load) don't add up to the load I'm seeing. This appears to be phantom load. I've got several graphs I can lay side by side for each kernel and the only thing that is elevated is the system load.
We've noticed this behavior in 22.1 and 26.1 as well across all our machines. We've reverted back to 18.1 until Red Hat supplies a new kernel patch that fixes it.
I would like to caution that if you have long running processes on your servers, like databases, and do not want to use the latest kernel, I would go further back than .18.1 because there is a potential kernel panic for those workloads: https://access.redhat.com/solutions/2181181
For more about the changes to the way load is calculated in later RHEL 6 kernels, please see the following Knowledge Base article: https://access.redhat.com/solutions/2253721
Jennifer,
This knowledge base article doesn't explain the changes to how load is calculated, it just summarises that the load number output from the kernel has changed from the prior kernel version.
The changelog suggests these changes were made 22nd of January 2016, is there any indication as to why this has taken 4+ months to resolve? Are you able to advise why the related Bugzilla is private (linked in the knowledge base article)?
we have also down graded to to kernel .18 on a handful of servers, most servers show the problem, but only these appear to have caused users issues.
Still experiencing this issue in 2.6.32-642.el6. After kernel upgrade from 2.6.32-573.12.1.el6 servers doing virtually nothing show load spikes to 12. Before, the load was max 0.8.
I reported this to RH on March 29, 2016.
In April, the support tech provided us with a kernel which had these changes reverted:
- [sched] kernel: sched: Fix nohz load accounting -- again (Rafael Aquini) [1300349 1167755]
- [sched] kernel: sched: Move sched_avg_update to update_cpu_load (Rafael Aquini) [1300349 1167755]
- [sched] kernel: sched: Cure more NO_HZ load average woes (Rafael Aquini) [1300349 1167755]
We ran with this for a while, but our last quarterly maintenance cycle pulled the newer kernel, 2.6.32-642.1.1.el6.x86_64, and the issue has returned.
I presume that these will be reverted in the pending update?
New Kernel released yesterday includes load average fix.
After upgrading the kernel, CPU load average increased compared to the prior
kernel version due to the modification of the scheduler. The provided patch set
reverts the calculation algorithm of this load average to the the previous
version thus resulting in relatively lower values under the same system load.
(BZ#1343015)
see https://access.redhat.com/solutions/2253721
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
