Kernel 2.6.32-573.22.1.el6.x86_64 and load averages

Latest response

Anyone else running into any issues with kernel 2.6.32-573.22.1.el6.x86_64?
We recently upgraded a portion of our lab environment to kernel-2.6.32-573.22.1.el6.x86_64 during our routine patching cycle and after reboot we are noticing a subset of systems reporting high Load Averages .

Upon closer inspection, reverting the a previous kernel, we observe more historical load averages. I am aware there was a change to the calc method for load averages in 2.6.32-573.20.1.el6 but this seems like it should have resolved an issue rather than create a new one.

  • Fri Jan 22 2016 Frantisek Hrbata fhrbata@redhat.com [2.6.32-573.20.1.el6]
  • [sched] kernel: sched: Fix nohz load accounting -- again (Rafael Aquini) [1300349 1167755]

This system has no running user processes and yet reports a higher load average on kernel 2.6.32-573.22.1.el6.x86_64

with kernel 2.6.32-573.22.1.el6.x86_64
#> uptime
07:44:29 up 15 min, 1 user, load average: 0.98, 0.72, 0.46
#> uname -r
2.6.32-573.22.1.el6.x86_64

with kernel 2.6.32-573.18.1.el6.x86_64
#> uptime
08:02:03 up 15 min, 1 user, load average: 0.00, 0.00, 0.00
#> uname -r
2.6.32-573.18.1.el6.x86_64

About to pull the trigger on rolling our lower environment back to 2.6.32-573.18. Have a support case open already too, but waiting in queue.

TIA for any input

Responses

There is a post similar to this one on the CentOS mailing list:

Kernel 2.6.32-573.22.1.el6.x86_64, higher than usual load

Interestingly the changelog entry mentions multiple load average fixes in 2.6.32-573.20.1.el6

2016-01-22 Frantisek Hrbata <fhrbata@redhat.com> [2.6.32-573.20.1.el6]

    - [sched] kernel: sched: Fix nohz load accounting -- again (Rafael Aquini) [1300349 1167755]
    - [sched] kernel: sched: Move sched_avg_update to update_cpu_load (Rafael Aquini) [1300349 1167755]
    - [sched] kernel: sched: Cure more NO_HZ load average woes (Rafael Aquini) [1300349 1167755]
    - [sched] kernel: sched: Cure load average vs NO_HZ woes (Rafael Aquini) [1300349 1167755]

Full changelog is here: https://access.redhat.com/downloads/content/rhel---6/x86_64/168/kernel/2.6.32-573.22.1.el6/x86_64/fd431d51/package-changelog

If the bugs were public they may provide some more details.

Have Red Hat Support provided any more insight?

Does the CPU and memory usage look the same across both kernels? it's only the load number that has changed?

If you have SAR installed it would be worth comparing before/after and seeing if the %idle CPU is similar for both periods (also check other CPU and memory metrics for the period).

Have provided outputs of 'first60' as well as a whole bunch of sar/top/etc output to Red Hat on Friday and haven't heard back since.

All other metrics appear the same with either kernel.. The Load Averages are the only numbers that seem to differ.

On Saturday we rolled back about 170 servers to 2.6.32-573.18.1.el6.x86_64 and any alerts we had for high Load Averages cleared almost as soon as the kernel was restored..

I'm highly suspicious of something since 2.6.32-573.18.1.el6.x86_64 that was introduced...

Does it mean that the higher load is correct and the displayed load before the patch was not?

No.. the higher load reported by the newer kernel is not correct. The systems used for sampling (about 10 of 180) all reported Load Averages > .5 for 1m, 5m and 15m (1 cpu system) when there were no processes or other external forces present. When the previous kernel (.18) was reloaded, uptimes of near zero were reported for 1m, 5m and 15m I sent some more empirical data to Support yesterday and they are still chewing on it (in Engineering now I believe)

If these are VM's it would be interesting to see the performance metrics before/after from the hypervisors perspective. vRealize Operations would be perfect to provide this information if you were on the VMware stack.

Appreciate the updates Will!

So I had the same thought originally. I quickly ruled out the hypervisor since swapping the kernel had an immediate affect on the Load Averages.
But out of curiousity I just went back to vRealize and vSphere and compared hypervisor stats to the stats in the VM we had gathered, and at no time has the HV registered anything near the level of Load that the kernel is reporting under .22.

This further supports my, and others, claims that this kernel has a performance metric reporting issue. Not to be confused with a performance issue.

As of this afternoon Red Hat has opened a Private BZ Bug 1326373 for the issue.

Thanks again for the update Will.

It's unfortunate that the bug is private (what happened to 'default to open'?). I am hoping it's because the bug report includes customer information, not because Red Hat don't want to share the discussion.

RE: "It's unfortunate that the bug is private (what happened to 'default to open'?). I am hoping it's because the bug report includes customer information, not because Red Hat don't want to share the discussion."

I've worked in Red Hat support for years and many of us have pushed (for years) to have a "default to open" policy; however, unfortunately we still don't due to the risk of newer (untrained) support engineers posting public BZs with sensitive customer data. :(

Observing the same on hundreds of servers. Also considering going to the previous kernel version.

Good luck Kristof. I've been fortunate it only impacted a small portion of our LAB (and we spread our lab updates out over 2 Thursdays to mitigate risk). We also had a good rollback plan in place from the beginning, so going back was mostly a non-issue.

Same thing observed here on our servers (approx one hundred). Some raising monitoring alerts (we will modify the checks temporarily as a workaround). Hoping this will get corrected for the next update.

Sam experience here. after updating to 2.6.32-573.22.1.el6.x86_64, load averages multiply whereas cpu usage (measured from atop and from Vcenter) stays the same. Tested with a freshly installed, blank redhat 6.7, no further user processes. Downgrade to previous kernel fixes the issue. Has anyone opened a call?

I'm having the same issue. The 2.6.32-573.26.1 kernel appears to be marginally better, but I again rolled back to 2.6.32-573.18.1 to work around this.

The interesting thing here is that there is that the number of runnable processes plus uninterruptable sleep processes (which by definition, is system load) don't add up to the load I'm seeing. This appears to be phantom load. I've got several graphs I can lay side by side for each kernel and the only thing that is elevated is the system load.

To say that I'm disappointed at this point is a bit of an understatement. I haven't received any knowledgeable response to my support case in weeks now. It's obvious this is impacting large installations and small alike, and it's also obvious that there is something wrong, and Red Hat doesn't seem to have even acknowledged that.

So now we are several kernel versions behind, which I think include at least one security errata, and I'm having to take extra steps in reporting and management to ensure that we don't deploy these newer errata and keep out security folks at bay.

[frustrated]

So I did just find this https://access.redhat.com/solutions/2253721

You may wish to follow that for information as it becomes available

We've noticed this behavior in 22.1 and 26.1 as well across all our machines. We've reverted back to 18.1 until Red Hat supplies a new kernel patch that fixes it.

I would like to caution that if you have long running processes on your servers, like databases, and do not want to use the latest kernel, I would go further back than .18.1 because there is a potential kernel panic for those workloads: https://access.redhat.com/solutions/2181181

For more about the changes to the way load is calculated in later RHEL 6 kernels, please see the following Knowledge Base article: https://access.redhat.com/solutions/2253721

Jennifer,

This knowledge base article doesn't explain the changes to how load is calculated, it just summarises that the load number output from the kernel has changed from the prior kernel version.

The changelog suggests these changes were made 22nd of January 2016, is there any indication as to why this has taken 4+ months to resolve? Are you able to advise why the related Bugzilla is private (linked in the knowledge base article)?

we have also down graded to to kernel .18 on a handful of servers, most servers show the problem, but only these appear to have caused users issues.

Still experiencing this issue in 2.6.32-642.el6. After kernel upgrade from 2.6.32-573.12.1.el6 servers doing virtually nothing show load spikes to 12. Before, the load was max 0.8.

Heard from support late last week and Red Hat will be removing the patches that appear to be causing this.

When?!?!?!?!?!

June 17 was the last update from Red Hat I received.

Status update:

[Private] Bug-1326373 [sched] Update to 2.6.32-573.22.1 shows a mild increase in load average

The Bug has been moved to "ON_QA" status which means the bug fix is available for the Assigned Quality Engineer to test.

It will be soon fixed with a z-stream errata release.

Problem seems still to exist with 2.6.32-642.1.1.el6.x86_64.

I reported this to RH on March 29, 2016.

In April, the support tech provided us with a kernel which had these changes reverted:

   - [sched] kernel: sched: Fix nohz load accounting -- again (Rafael Aquini) [1300349 1167755]
   - [sched] kernel: sched: Move sched_avg_update to update_cpu_load (Rafael Aquini) [1300349 1167755]
   - [sched] kernel: sched: Cure more NO_HZ load average woes (Rafael Aquini) [1300349 1167755]

We ran with this for a while, but our last quarterly maintenance cycle pulled the newer kernel, 2.6.32-642.1.1.el6.x86_64, and the issue has returned.

I presume that these will be reverted in the pending update?

Due to a few applications crashing on various platforms running the previous kernel, we've had to start rolling kernel-2.6.32-642.1.1.el6.x86_64 out.
In all cases the cores from the crashes pointed to bugs that, according to support, are resolved in later releases of the kernel.

After completing about 200 systems this week, we're seeing the same issues as before. Rather than deal with false alarms, we just changed the way we alarm on CPU usage (eg from /proc/stat directly) rather than using Load Averages for the time being. It's a cruddy workaround, but it'll do....

New Kernel released yesterday includes load average fix.

After upgrading the kernel, CPU load average increased compared to the prior
kernel version due to the modification of the scheduler. The provided patch set
reverts the calculation algorithm of this load average to the the previous
version thus resulting in relatively lower values under the same system load.
(BZ#1343015)

see https://access.redhat.com/solutions/2253721

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.