Are hardware power management features causing latency spikes in my application?

Updated -

Abstract:

Modern CPUs are quite aggressive in their desire to transition into power-saving states (called C-states). Unfortunately, transitioning from power saving states back to fully-powered-up-running state takes time and can introduce undesired application delays when powering on components, refilling caches, etc.

Real-time applications can avoid these delays by preventing the system from making C-state transitions. There are two ways to do this:

A system may be prevented from entering power-saving states by booting with the processor.max_cstates=1 command line option. Additionally, the idle=poll option may be added for the fastest time out of the idle state. Unfortunately both of these options tend to cause power usage to spike.

If more fine-grained control of power saving states is desired, a latency sensitive application may use the Power management Quality of Service (PM QOS) interface, /dev/cpu_dma_latency, to prevent the processor from entering deep sleep states and causing unexpected latencies when exiting deep sleep states. Opening /dev/cpu_dma_latency and writing a zero to it will prevent transitions to deep sleep states while the file descriptor is held open. Additionally, writing a zero to it emulates the idle=poll behavior.

Details:

Modern processors such as Intel's Nehalem-based processors are very aggressive about transitioning into power-saving states. The various power saving levels are known as C-states and are entered via the operating system's idle routine. C-states are numbered, starting at C0 (every processor component turned on) and moving up where each level has more components of the processor turned off, saving more power. Higher numbered C-states are also known as deeper C-states.

Whenever a CPU core is idle, the builtin power-saving logic kicks in and tries to transition the core from the current C-state to a higher C-state, turning off various processor components to save power. When the CPU core is needed to run programs, it is sent an interrupt to wake up and it moves from whatever C-state it is in to C0. Transitioning out of deep C-states back to C0 takes time, due to having to turn power back on to various components of the processor. It also has to be done in an atomic context, so that nothing tries to use the core while it's being powered up.

Periodically on a multicore system, the workloads being run will align so that many of the cores are simultaneously idle and therefore transitioning to deeper C-states.  If they all try and wake up at the same time, a large number of Inter-Processor Interupts (IPIs) can be generated while the cores are powered up out of deep sleep states. Due to locking that is required while processing interrupts, the system can effectively stall for quite some time while handling all the interrupts, which can cause large delays in application response to events (latencies).

Latency sensitive applications do not want the processor to transition into deeper C-states, due to the delays induced by coming out of the C-states back to C0. These delays can range from hundreds of microseconds to milliseconds. There are two main ways to prevent the system from transitioning to deeper C-states.

The first is a big hammer approach. By booting with the kernel command line argument processor.max_cstate=1 and idle=poll the system will never enter a C-state other than zero and will not even use the MWAIT mechanism to temporarily halt in the idle routine. While this will provide the fastest scheduler response time, it is very wasteful of power and will generate heat that requires the air conditioning system to run (and waste more power).  This method is not recommended for general use, but can be quite handy for testing whether your application is being affected by cstates and idle transition latency. Booting your kernel and adding the commandline options:

processor.max_cstate=1 idle=poll

can be a quick way to see if your application's response time smooths out.

The second method, which has a bit more fine-grained control of the power management features, is to use the Power Management Quality of Service interface (PM QOS). The file /dev/cpu_dma_latency is the interface which when opened registers a quality-of-service request for latency with the operating system. A program should open /dev/cpu_dma_latency, write a 32-bit number to it representing a maximum response time in microseconds and then keep the file descriptor open while low-latency operation is desired.  Writing a zero means that you want the fastest response time
possible.

For example:


static int pm_qos_fd = -1; void start_low_latency(void) { s32_t target = 0; if (pm_qos_fd >= 0) return; pm_qos_fd = open("/dev/cpu_dma_latency", O_RDWR); if (pm_qos_fd < 0) { fprintf(stderr, "Failed to open PM QOS file: %s", strerror(errno)); exit(errno); } write(pm_qos_fd, &target, sizeof(target)); } void stop_low_latency(void) { if (pm_qos_fd >= 0) close(pm_qos_fd); }

Using the above routines, the application would first call start_low_latency(), then would do required latency-sensitive processing, then call stop_low_latency(). Obviously since opening and closing files is a high-overhead operation, the processing time between calling the start and stop routines should be fairly large. That is, call the start routine, do a lot of processing, then call the stop routine.

Summary:

Modern processors power management features can cause unwanted delays in time-sensitive application processing by transitioning your processors into power-saving C-states.

To see if your application is being affected by power management transitions, boot your kernel with the command line options:

processor.max_cstate=1 idle=poll

This will disable power management in the system (but will also consume more power).

For more fine-grained controlof when power-management is turned off, use the PM QOS interface in your application to tell the kernel when to disable power saving state transitions.

Comments