Chapter 4. CPU

The term CPU, which stands for central processing unit, is a misnomer for most systems, since central implies single, whereas most modern systems have more than one processing unit, or core. Physically, CPUs are contained in a package attached to a motherboard in a socket. Each socket on the motherboard has various connections: to other CPU sockets, memory controllers, interrupt controllers, and other peripheral devices. A socket to the operating system is a logical grouping of CPUs and associated resources. This concept is central to most of our discussions on CPU tuning.
Red Hat Enterprise Linux keeps a wealth of statistics about system CPU events; these statistics are useful in planning out a tuning strategy to improve CPU performance. Section 4.1.2, “Tuning CPU Performance” discusses some of the more useful statistics, where to find them, and how to analyze them for performance tuning.

Topology

Older computers had relatively few CPUs per system, which allowed an architecture known as Symmetric Multi-Processor (SMP). This meant that each CPU in the system had similar (or symmetric) access to available memory. In recent years, CPU count-per-socket has grown to the point that trying to give symmetric access to all RAM in the system has become very expensive. Most high CPU count systems these days have an architecture known as Non-Uniform Memory Access (NUMA) instead of SMP.
AMD processors have had this type of architecture for some time with their Hyper Transport (HT) interconnects, while Intel has begun implementing NUMA in their Quick Path Interconnect (QPI) designs. NUMA and SMP are tuned differently, since you need to account for the topology of the system when allocating resources for an application.

Threads

Inside the Linux operating system, the unit of execution is known as a thread. Threads have a register context, a stack, and a segment of executable code which they run on a CPU. It is the job of the operating system (OS) to schedule these threads on the available CPUs.
The OS maximizes CPU utilization by load-balancing the threads across available cores. Since the OS is primarily concerned with keeping CPUs busy, it does not make optimal decisions with respect to application performance. Moving an application thread to a CPU on another socket can worsen performance more than simply waiting for the current CPU to become available, since memory access operations can slow drastically across sockets. For high-performance applications, it is usually better for the designer to determine where threads are placed. Section 4.2, “CPU Scheduling” discusses how to best allocate CPUs and memory to best execute application threads.

Interrupts

One of the less obvious (but nonetheless important) system events that can impact application performance is the interrupt (also known as IRQs in Linux). These events are handled by the operating system, and are used by peripherals to signal the arrival of data or the completion of an operation, such as a network write or a timer event.
The manner in which the OS or CPU that is executing application code handles an interrupt does not affect the application's functionality. However, it can impact the performance of the application. This chapter also discusses tips on preventing interrupts from adversely impacting application performance.

4.1. CPU Topology

4.1.1. CPU and NUMA Topology

The first computer processors were uniprocessors, meaning that the system had a single CPU. The illusion of executing processes in parallel was done by the operating system rapidly switching the single CPU from one thread of execution (process) to another. In the quest for increasing system performance, designers noted that increasing the clock rate to execute instructions faster only worked up to a point (usually the limitations on creating a stable clock waveform with the current technology). In an effort to get more overall system performance, designers added another CPU to the system, allowing two parallel streams of execution. This trend of adding processors has continued over time.
Most early multiprocessor systems were designed so that each CPU had the same logical path to each memory location (usually a parallel bus). This let each CPU access any memory location in the same amount of time as any other CPU in the system. This type of architecture is known as a Symmetric Multi-Processor (SMP) system. SMP is fine for a small number of CPUs, but once the CPU count gets above a certain point (8 or 16), the number of parallel traces required to allow equal access to memory uses too much of the available board real estate, leaving less room for peripherals.
Two new concepts combined to allow for a higher number of CPUs in a system:
  1. Serial buses
  2. NUMA topologies
A serial bus is a single-wire communication path with a very high clock rate, which transfers data as packetized bursts. Hardware designers began to use serial buses as high-speed interconnects between CPUs, and between CPUs and memory controllers and other peripherals. This means that instead of requiring between 32 and 64 traces on the board from each CPU to the memory subsystem, there was now one trace, substantially reducing the amount of space required on the board.
At the same time, hardware designers were packing more transistors into the same space by reducing die sizes. Instead of putting individual CPUs directly onto the main board, they started packing them into a processor package as multi-core processors. Then, instead of trying to provide equal access to memory from each processor package, designers resorted to a Non-Uniform Memory Access (NUMA) strategy, where each package/socket combination has one or more dedicated memory area for high speed access. Each socket also has an interconnect to other sockets for slower access to the other sockets' memory.
As a simple NUMA example, suppose we have a two-socket motherboard, where each socket has been populated with a quad-core package. This means the total number of CPUs in the system is eight; four in each socket. Each socket also has an attached memory bank with four gigabytes of RAM, for a total system memory of eight gigabytes. For the purposes of this example, CPUs 0-3 are in socket 0, and CPUs 4-7 are in socket 1. Each socket in this example also corresponds to a NUMA node.
It might take three clock cycles for CPU 0 to access memory from bank 0: a cycle to present the address to the memory controller, a cycle to set up access to the memory location, and a cycle to read or write to the location. However, it might take six clock cycles for CPU 4 to access memory from the same location; because it is on a separate socket, it must go through two memory controllers: the local memory controller on socket 1, and then the remote memory controller on socket 0. If memory is contested on that location (that is, if more than one CPU is attempting to access the same location simultaneously), memory controllers need to arbitrate and serialize access to the memory, so memory access will take longer. Adding cache consistency (ensuring that local CPU caches contain the same data for the same memory location) complicates the process further.
The latest high-end processors from both Intel (Xeon) and AMD (Opteron) have NUMA topologies. The AMD processors use an interconnect known as HyperTransport™ or HT, while Intel uses one named QuickPath Interconnect™ or QPI. The interconnects differ in how they physically connect to other interconnects, memory, or peripheral devices, but in effect they are a switch that allows transparent access to one connected device from another connected device. In this case, transparent refers to the fact that there is no special programming API required to use the interconnect, not a "no cost" option.
Because system architectures are so diverse, it is impractical to specifically characterize the performance penalty imposed by accessing non-local memory. We can say that each hop across an interconnect imposes at least some relatively constant performance penalty per hop, so referencing a memory location that is two interconnects from the current CPU imposes at least 2N + memory cycle time units to access time, where N is the penalty per hop.
Given this performance penalty, performance-sensitive applications should avoid regularly accessing remote memory in a NUMA topology system. The application should be set up so that it stays on a particular node and allocates memory from that node.
To do this, there are a few things that applications will need to know:
  1. What is the topology of the system?
  2. Where is the application currently executing?
  3. Where is the closest memory bank?

4.1.2. Tuning CPU Performance

Read this section to understand how to tune for better CPU performance, and for an introduction to several tools that aid in the process.
NUMA was originally used to connect a single processor to multiple memory banks. As CPU manufacturers refined their processes and die sizes shrank, multiple CPU cores could be included in one package. These CPU cores were clustered so that each had equal access time to a local memory bank, and cache could be shared between the cores; however, each 'hop' across an interconnect between core, memory, and cache involves a small performance penalty.
The example system in Figure 4.1, “Local and Remote Memory Access in NUMA Topology” contains two NUMA nodes. Each node has four CPUs, a memory bank, and a memory controller. Any CPU on a node has direct access to the memory bank on that node. Following the arrows on Node 1, the steps are as follows:
  1. A CPU (any of 0-3) presents the memory address to the local memory controller.
  2. The memory controller sets up access to the memory address.
  3. The CPU performs read or write operations on that memory address.
The CPU icon used in this image is part of the Nuvola 1.0 (KDE 3.x icon set), and is held under the LGPL-2.1: http://www.gnu.org/licenses/lgpl-2.1.html

Figure 4.1. Local and Remote Memory Access in NUMA Topology

However, if a CPU on one node needs to access code that resides on the memory bank of a different NUMA node, the path it has to take is less direct:
  1. A CPU (any of 0-3) presents the remote memory address to the local memory controller.
    1. The CPU's request for that remote memory address is passed to a remote memory controller, local to the node containing that memory address.
  2. The remote memory controller sets up access to the remote memory address.
  3. The CPU performs read or write operations on that remote memory address.
Every action needs to pass through multiple memory controllers, so access can take more than twice as long when attempting to access remote memory addresses. The primary performance concern in a multi-core system is therefore to ensure that information travels as efficiently as possible, via the shortest, or fastest, path.
To configure an application for optimal CPU performance, you need to know:
  • the topology of the system (how its components are connected),
  • the core on which the application executes, and
  • the location of the closest memory bank.
Red Hat Enterprise Linux 6 ships with a number of tools to help you find this information and tune your system according to your findings. The following sections give an overview of useful tools for CPU performance tuning.

4.1.2.1. Setting CPU Affinity with taskset

taskset retrieves and sets the CPU affinity of a running process (by process ID). It can also be used to launch a process with a given CPU affinity, which binds the specified process to a specified CPU or set of CPUs. However, taskset will not guarantee local memory allocation. If you require the additional performance benefits of local memory allocation, we recommend numactl over taskset; see Section 4.1.2.2, “Controlling NUMA Policy with numactl for further details.
CPU affinity is represented as a bitmask. The lowest-order bit corresponds to the first logical CPU, and the highest-order bit corresponds to the last logical CPU. These masks are typically given in hexadecimal, so that 0x00000001 represents processor 0, and 0x00000003 represents processors 0 and 1.
To set the CPU affinity of a running process, execute the following command, replacing mask with the mask of the processor or processors you want the process bound to, and pid with the process ID of the process whose affinity you wish to change.
# taskset -p mask pid
To launch a process with a given affinity, run the following command, replacing mask with the mask of the processor or processors you want the process bound to, and program with the program, options, and arguments of the program you want to run.
# taskset mask -- program
Instead of specifying the processors as a bitmask, you can also use the -c option to provide a comma-delimited list of separate processors, or a range of processors, like so:
# taskset -c 0,5,7-9 -- myprogram
Further information about taskset is available from the man page: man taskset.

4.1.2.2. Controlling NUMA Policy with numactl

numactl runs processes with a specified scheduling or memory placement policy. The selected policy is set for that process and all of its children. numactl can also set a persistent policy for shared memory segments or files, and set the CPU affinity and memory affinity of a process. It uses the /sys file system to determine system topology.
The /sys file system contains information about how CPUs, memory, and peripheral devices are connected via NUMA interconnects. Specifically, the /sys/devices/system/cpu directory contains information about how a system's CPUs are connected to one another. The /sys/devices/system/node directory contains information about the NUMA nodes in the system, and the relative distances between those nodes.
In a NUMA system, the greater the distance between a processor and a memory bank, the slower the processor's access to that memory bank. Performance-sensitive applications should therefore be configured so that they allocate memory from the closest possible memory bank.
Performance-sensitive applications should also be configured to execute on a set number of cores, particularly in the case of multi-threaded applications. Because first-level caches are usually small, if multiple threads execute on one core, each thread will potentially evict cached data accessed by a previous thread. When the operating system attempts to multitask between these threads, and the threads continue to evict each other's cached data, a large percentage of their execution time is spent on cache line replacement. This issue is referred to as cache thrashing. It is therefore recommended to bind a multi-threaded application to a node rather than a single core, since this allows the threads to share cache lines on multiple levels (first-, second-, and last-level cache) and minimizes the need for cache fill operations. However, binding an application to a single core may be performant if all threads are accessing the same cached data.
numactl allows you to bind an application to a particular core or NUMA node, and to allocate the memory associated with a core or set of cores to that application. Some useful options provided by numactl are:
--show
Display the NUMA policy settings of the current process. This parameter does not require further parameters, and can be used like so: numactl --show.
--hardware
Displays an inventory of the available nodes on the system.
--membind
Only allocate memory from the specified nodes. When this is in use, allocation will fail if memory on these nodes is insufficient. Usage for this parameter is numactl --membind=nodes program, where nodes is the list of nodes you want to allocate memory from, and program is the program whose memory requirements should be allocated from that node. Node numbers can be given as a comma-delimited list, a range, or a combination of the two. Further details are available on the numactl man page: man numactl.
--cpunodebind
Only execute a command (and its child processes) on CPUs belonging to the specified node(s). Usage for this parameter is numactl --cpunodebind=nodes program, where nodes is the list of nodes to whose CPUs the specified program (program) should be bound. Node numbers can be given as a comma-delimited list, a range, or a combination of the two. Further details are available on the numactl man page: man numactl.
--physcpubind
Only execute a command (and its child processes) on the specified CPUs. Usage for this parameter is numactl --physcpubind=cpu program, where cpu is a comma-delimited list of physical CPU numbers as displayed in the processor fields of /proc/cpuinfo, and program is the program that should execute only on those CPUs. CPUs can also be specified relative to the current cpuset. Refer to the numactl man page for further information: man numactl.
--localalloc
Specifies that memory should always be allocated on the current node.
--preferred
Where possible, memory is allocated on the specified node. If memory cannot be allocated on the node specified, fall back to other nodes. This option takes only a single node number, like so: numactl --preferred=node. Refer to the numactl man page for further information: man numactl.
The libnuma library included in the numactl package offers a simple programming interface to the NUMA policy supported by the kernel. It is useful for more fine-grained tuning than the numactl utility. Further information is available on the man page: man numa(7).

4.1.3. Hardware performance policy (x86_energy_perf_policy)

The cpupowerutils package includes x86_energy_perf_policy, a tool that allows administrators to define the relative importance of performance compared to energy efficiency. This information can then be used to influence processors that support this feature when they are selecting options that trade off between performance and energy efficiency. Processor support is indicated by CPUID.06H.ECX.bit3.
x86_energy_perf_policy requires root privileges, and operates on all CPUs by default.
To view the current policy, run the following command:
# x86_energy_perf_policy -r
To set a new policy, run the following command:
# x86_energy_perf_policy profile_name
Replace profile_name with one of the following profiles.
performance
The processor is unwilling to sacrifice any performance for the sake of saving energy. This is the default value.
normal
The processor tolerates minor performance compromises for potentially significant energy savings. This is a reasonable setting for most desktops and servers.
powersave
The processor accepts potentially significant hits to performance in order to maximise energy efficiency.
For further information about this tool, refer to the man page: man x86_energy_perf_policy.

4.1.4. turbostat

The turbostat tool is part of the cpupowerutils package. It reports processor topology, frequency, idle power-state statistics, temperature, and power usage on Intel 64 processors.
Turbostat can help administrators to identify servers that use more power than necessary, do not enter deep sleep states when expected, or are idle enough to consider virtualizing if a platform is readily available (thus allowing the physical server to be decommissioned). It can also help administrators to identify the rate of system management interrupts (SMIs), and any latency-sensitive applications that may be prompting SMIs unnecessarily. Turbostat can also be used in conjunction with the powertop utility to identify services that may be preventing the processor from entering deep sleep states.
Turbostat requires root privileges to run. It also requires processor support for invariant time stamp counters, and APERF and MPERF model-specific registers.
By default, turbostat prints a summary of counter results for the entire system, followed by counter results every 5 seconds, under the following headings:
pkg
The processor package number.
core
The processor core number.
CPU
The Linux CPU (logical processor) number.
%c0
The percentage of the interval for which the CPU retired instructions.
GHz
The average clock speed while the CPU was in the c0 state. When this number is higher than the value in TSC, the CPU is in turbo mode.
TSC
The average clock speed over the course of the entire interval. When this number is lower than the value in TSC, the CPU is in turbo mode.
%c1, %c3, and %c6
The percentage of the interval for which the processor was in the c1, c3, or c6 state, respectively.
%pc3 or %pc6
The percentage of the interval for which the processor was in the pc3 or pc6 state, respectively.
Specify a different period between counter results with the -i option, for example, run turbostat -i 10 to print results every 10 seconds instead.

Note

Upcoming Intel processors may add additional C-states. As of Red Hat Enterprise Linux 6.5, turbostat provides support for the c7, c8, c9, and c10 states.
For more information about turbostat, refer to the man page: man turbostat.

4.1.5. numastat

Important

Previously, the numastat tool was a Perl script written by Andi Kleen. It has been significantly rewritten for Red Hat Enterprise Linux 6.4.
While the default command (numastat, with no options or parameters) maintains strict compatibility with the previous version of the tool, note that supplying options or parameters to this command significantly changes both the output content and its format.
numastat displays memory statistics (such as allocation hits and misses) for processes and the operating system on a per-NUMA-node basis. By default, running numastat displays how many pages of memory are occupied by the following event categories for each node.
Optimal CPU performance is indicated by low numa_miss and numa_foreign values.
This updated version of numastat also shows whether process memory is spread across a system or centralized on specific nodes using numactl.
Cross-reference numastat output with per-CPU top output to verify that process threads are running on the same nodes to which memory is allocated.

Default Tracking Categories

numa_hit
The number of attempted allocations to this node that were successful.
numa_miss
The number of attempted allocations to another node that were allocated on this node because of low memory on the intended node. Each numa_miss event has a corresponding numa_foreign event on another node.
numa_foreign
The number of allocations initially intended for this node that were allocated to another node instead. Each numa_foreign event has a corresponding numa_miss event on another node.
interleave_hit
The number of attempted interleave policy allocations to this node that were successful.
local_node
The number of times a process on this node successfully allocated memory on this node.
other_node
The number of times a process on another node allocated memory on this node.
Supplying any of the following options changes the displayed units to megabytes of memory (rounded to two decimal places), and changes other specific numastat behaviors as described below.
-c
Horizontally condenses the displayed table of information. This is useful on systems with a large number of NUMA nodes, but column width and inter-column spacing are somewhat unpredictable. When this option is used, the amount of memory is rounded to the nearest megabyte.
-m
Displays system-wide memory usage information on a per-node basis, similar to the information found in /proc/meminfo.
-n
Displays the same information as the original numastat command (numa_hit, numa_miss, numa_foreign, interleave_hit, local_node, and other_node), with an updated format, using megabytes as the unit of measurement.
-p pattern
Displays per-node memory information for the specified pattern. If the value for pattern is comprised of digits, numastat assumes that it is a numerical process identifier. Otherwise, numastat searches process command lines for the specified pattern.
Command line arguments entered after the value of the -p option are assumed to be additional patterns for which to filter. Additional patterns expand, rather than narrow, the filter.
-s
Sorts the displayed data in descending order so that the biggest memory consumers (according to the total column) are listed first.
Optionally, you can specify a node, and the table will be sorted according to the node column. When using this option, the node value must follow the -s option immediately, as shown here:
numastat -s2
Do not include white space between the option and its value.
-v
Displays more verbose information. Namely, process information for multiple processes will display detailed information for each process.
-V
Displays numastat version information.
-z
Omits table rows and columns with only zero values from the displayed information. Note that some near-zero values that are rounded to zero for display purposes will not be omitted from the displayed output.

4.1.6. NUMA Affinity Management Daemon (numad)

numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management (and therefore system performance).
Depending on system workload, numad can provide benchmark performance improvements of up to 50%. To achieve these performance gains, numad periodically accesses information from the /proc file system to monitor available system resources on a per-node basis. The daemon then attempts to place significant processes on NUMA nodes that have sufficient aligned memory and CPU resources for optimum NUMA performance. Current thresholds for process management are at least 50% of one CPU and at least 300 MB of memory. numad attempts to maintain a resource utilization level, and rebalances allocations when necessary by moving processes between NUMA nodes.
numad also provides a pre-placement advice service that can be queried by various job management systems to provide assistance with the initial binding of CPU and memory resources for their processes. This pre-placement advice service is available regardless of whether numad is running as a daemon on the system. Refer to the man page for further details about using the -w option for pre-placement advice: man numad.

4.1.6.1. Benefits of numad

numad primarily benefits systems with long-running processes that consume significant amounts of resources, particularly when these processes are contained in a subset of the total system resources.
numad may also benefit applications that consume multiple NUMA nodes' worth of resources. However, the benefits that numad provides decrease as the percentage of consumed resources on a system increases.
numad is unlikely to improve performance when processes run for only a few minutes, or do not consume many resources. Systems with continuous unpredictable memory access patterns, such as large in-memory databases, are also unlikely to benefit from numad use.

4.1.6.2. Modes of operation

Note

If KSM is in use, change the /sys/kernel/mm/ksm/merge_nodes tunable to 0 to avoid merging pages across NUMA nodes. Kernel memory accounting statistics can eventually contradict each other after large amounts of cross-node merging. As such, numad can become confused after the KSM daemon merges large amounts of memory. If your system has a large amount of free memory, you may achieve higher performance by turning off and disabling the KSM daemon.
numad can be used in two ways:
  • as a service
  • as an executable
4.1.6.2.1. Using numad as a service
While the numad service runs, it will attempt to dynamically tune the system based on its workload.
To start the service, run:
# service numad start
To make the service persist across reboots, run:
# chkconfig numad on
4.1.6.2.2. Using numad as an executable
To use numad as an executable, just run:
# numad
numad will run until it is stopped. While it runs, its activities are logged in /var/log/numad.log.
To restrict numad management to a specific process, start it with the following options.
# numad -S 0 -p pid
-p pid
Adds the specified pid to an explicit inclusion list. The process specified will not be managed until it meets the numad process significance threshold.
-S mode
The -S parameter specifies the type of process scanning. Setting it to 0 as shown limits numad management to explicitly included processes.
To stop numad, run:
# numad -i 0
Stopping numad does not remove the changes it has made to improve NUMA affinity. If system use changes significantly, running numad again will adjust affinity to improve performance under the new conditions.
For further information about available numad options, refer to the numad man page: man numad.

4.1.7. Dynamic Resource Affinity on Power Architecture

On Power Architecture Platform Reference systems that support logical partitions (LPARs), processing may be transparently moved to either unused CPU or memory resources. The most common causes of this are either new resources being added, or existing resources being taken out of service. When this occurs, the new memory or CPU may be in a different NUMA domain and this may result in memory affinity which is not optimal because the Linux kernel is unaware of the change.
When any CPU or memory is transparently moved, firmware generates a Platform Resource Reassignment Notification (PRRN) event to the LPAR. This event is received in the Linux kernel and then passed out to userspace where tools from the powerpc-utils and ppc64-diag packages process the event and update the system with the new CPU or memory affinity information.