Maximum CPU frequency is not maintained on Intel CPUs

Solution Verified - Updated -

Red Hat Insights can detect this issue

Proactively detect and remediate issues impacting your systems.
View matching systems and remediation

Environment

  • Red Hat Enterprise Linux 7.2 and higher

Issue

With tuned

  • Maximum CPU frequency is not maintained despite the tuned profile in use setting maximum performance frequencies for CPU (throughput-performance, latency-performance, etc.)

Without tuned

  • Maximum CPU frequency is not maintained despite the kernel parameters including intel_idle.max_cstate=0 processor.max_cstate=1 and/or the BIOS is configured to lock the CPU frequencies to maximum status values.

Resolution

  • Disable tuned
  • Add the following to the end of the GRUB command line:

    intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable
    
  • Starting with RHEL7.5 there is another approach is to use passive mode in intel_pstate:

    intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=passive
    
    • In case of passive mode the intel_pstate behaves like a traditional cpufreq driver and doesn't implement or force its internal governors and uses cpufreq generic governors (like "performance", "ondemand", "powersave").

Root Cause

In Nehalem ("no version"), Sandy Bridge (V1) and Ivy Bridge (V2) processors, p-states were managed by a simple algorithm that allowed a user to request a value from a preselected list of values in case of acpi-based cpufreq driver was used (and not intel_pstate). The preselected values are a list of discrete values provided by the hardware to, for instance, the acpi-cpufreq driver.

For example, the acpi-cpufreq driver might show frequencies of 2200MHz, 2100MHz, 2000MHz, 1500MHz, 1200MHz, 800MHz.
Each of these specific frequencies can be selected by the user, and the system is guaranteed to respond with that frequency if operating conditions are normal (no overheating, no thermal throttling, no other limitations enforced by BIOS/firmware).

Starting with some models of Sandy Bridge, Ivy Bridge and Haswell (V3), and continuing with Broadwell (V4), p-states were handled with a new driver, called the intel_pstate driver. Depending on the setup, BIOS revision, hardware, specific CPU model, sometimes the another acpi-based driver will be used by default on these models rather than intel_pstate.
The intel_pstate driver used a discrete mechanism by which you could request and receive a specific cpu frequency.

For example, the intel-pstate driver might show a maximum frequency of 2200MHz, and a minimum frequency of 1200MHz.
Any value in that range could be requested (for example, 1983MHz) and that frequency would be guaranteed (or the very close value to that frequency will be guaranteed, usually not less than requested value).

Starting with some later models of Broadwell (V4) and Skylake (V5), Intel introduced a mechanism called Hardware P-States (HWP).
The behavior of HWP is similar to that of Haswell (V3), and early Broadwell (V4), except that the returned frequency of the cpu is no longer guaranteed.
This is because HWP uses hardware information (aperf/mperf counters) to determine the "best" frequency for the cpus.
For example, if a user requested 800MHz, the hardware using system temperature information, states of other CPUs and other factors may determine that a better frequency for the workload is really 900MHz and will raise the frequencies temporarily to that value.
Similarly the hardware can choose to drop the frequency of a CPU (or several cpus) in order to prevent overheating issues or provide better performance to another CPU core (according to internal algorithm).

The above HWP model means that at both low end and high end frequencies the frequency returned by the processor is no longer guaranteed to be the requested frequency.

Additionally, in scenarios where HWP is disabled or the kernel command line parameter of intel_pstate=no_hwp the system will default to either of two tuned profiles "performance" or "powersave". In this given scenario, when any given CPU is in an idle state, the frequency of that CPU may not necessarily be at max frequency as the P-state is not well defined for idle CPUs. E.g. with intel_pstate=no_hwp and the "performance" governor set, an example of a systems CPU frequencies can be found below.

    # turbostat -s Busy%,Bzy_MHz,TSC_MHz -q -i 10 -n 1
    Busy% Bzy_MHz TSC_MHz
    0.31  2038  2297
    0.31  2031  2298
    0.31  2041  2298
    0.30  2053  2298
    0.30  2053  2298
    0.32  2030  2298
    0.30  2055  2298
    0.30  2054  2298
    0.30  2060  2298
    0.30  2052  2297
    0.30  2060  2297
    0.30  2053  2297
    0.30  2060  2297
    0.30  2060  2297
    0.30  2061  2297
    0.34  2188  2297
    0.30  2061  2297
    0.35  1992  2297
    0.46  1877  2297
    0.30  2053  2297
    0.30  2055  2297
    0.30  2054  2297
    0.30  2061  2297

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments