RHEL 6.1/6.2 System Experiencing Short, Recoverable Hangs

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 6.1
  • Red Hat Enterprise Linux 6.2
  • Nehalem or Westmere CPUs

Issue

  • System become unresponsive for 3 to 90 seconds after upgrading from kernel-2.6.32-71.29.1.el6.x86_64 to kernel-2.6.32-131.0.15.el6.x86_64.
  • Verified that moving back to kernel-2.6.32-71.el6.x86_64 causes the problem to go away.
  • The problem also goes away if booted with intel_idle.max_cstate=0 kernel parameter.

Resolution

  • Upgrade the kernel version as per errata for issue resolution.

  • Available work-around if kernel upgrade not possible:

    1. The deep-C states have to disabled in the BIOS first.

    2. Then any one among the two sets of kernel parameters given below has to used as well. Note that the first set involves two kernel parameters, both of which have to be given.

      • intel_idle.max_cstate=0 processor.max_cstate=1 <-- most stable, and seen to resolve issues more widely

      • intel_idle.max_cstate=2 <------- resolves the issue in some systems only, less power consumption

Root Cause

Some Nehalem and Westmere CPUs have bug in the C3/C6 state transition that can cause system hangs. The new intel_idle module introduced in RHEL 6.1 implements these deeper sleep states, and that is the reason the problem shows up.

The work on the system hang associated with the processor deep C-states, had identified that the issue was related to the kernel high resoultion time keeping code getting messed up under circumstances of unstable TSC or delayed timer IPIs, which could be caused by the SMI (system management interrupt) or other BIOS code involved with the processor deep C-states.

The bug fixes were related to improving the accuracy of TSC calibration, which in turn is expected to prevent the reported issues, associated with the processor deep C-states. These bug fixes are mentioned in the change-log of 'kernel-2.6.32-220.7.1.el6', as given below.

$ rpm -qp --changelog kernel-2.6.32-220.7.1.el6.x86_64.rpm | grep 772884
....
- [x86] hpet: Disable per-cpu hpet timer if ARAT is supported (Prarit Bhargava) [772884 750201]
- [x86] Improve TSC calibration using a delayed workqueue (Prarit Bhargava) [772884 750201]
- [kernel] clocksource: Add clocksource_register_hz/khz interface (Prarit Bhargava) [772884 750201]
- [kernel] clocksource: Provide a generic mult/shift factor calculation (Prarit Bhargava) [772884 750201]
....

Diagnostic Steps

Check /proc/cpuinfo for model name entries, for example:

model name  : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz

  • Check if your CPUs is on the list of those with the issue
    • Nehalem CPUs
    • 34xx
    • 35xx
    • 55xx
    • 75xx
    • Westmere CPUs
    • 36xx
    • 56xx

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.