Chapter 20. Configuring an operating system to optimize CPU utilization
This section describes how to configure the operating system to optimize CPU utilization across their workloads.
20.1. Tools for monitoring and diagnosing processor issues
The following are the tools available in Red Hat Enterprise Linux 8 to monitor and diagnose processor-related performance issues:
turbostattool prints counter results at specified intervals to help administrators identify unexpected behavior in servers, such as excessive power usage, failure to enter deep sleep states, or system management interrupts (SMIs) being created unnecessarily.
numactlutility provides a number of options to manage processor and memory affinity. The
numactlpackage includes the
libnumalibrary which offers a simple programming interface to the NUMA policy supported by the kernel, and can be used for more fine-grained tuning than the
numastattool displays per-NUMA node memory statistics for the operating system and its processes, and shows administrators whether the process memory is spread throughout a system or is centralized on specific nodes. This tool is provided by the
numadis an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management.
/proc/interruptsfile displays the interrupt request (IRQ) number, the number of similar interrupt requests handled by each processor in the system, the type of interrupt sent, and a comma-separated list of devices that respond to the listed interrupt request.
pqosutility is available in the
intel-cmt-catpackage. It monitors CPU cache and memory bandwidth on recent Intel processors. It monitors:
- The instructions per cycle (IPC).
- The count of last level cache MISSES.
- The size in kilobytes that the program executing in a given CPU occupies in the LLC.
- The bandwidth to local memory (MBL).
- The bandwidth to remote memory (MBR).
x86_energy_perf_policytool allows administrators to define the relative importance of performance and energy efficiency. This information can then be used to influence processors that support this feature when they select options that trade off between performance and energy efficiency.
tasksettool is provided by the
util-linuxpackage. It allows administrators to retrieve and set the processor affinity of a running process, or launch a process with a specified processor affinity.
For more information, see the man pages of
20.2. Determining system topology
In modern computing, the idea of a CPU is a misleading one, as most modern systems have multiple processors. The topology of the system is the way these processors are connected to each other and to other system resources. This can affect system and application performance, and the tuning considerations for a system.
20.2.1. Types of system topology
The following are the two primary types of topology used in modern computing:
- Symmetric Multi-Processor (SMP) topology
- SMP topology allows all processors to access memory in the same amount of time. However, because shared and equal memory access inherently forces serialized memory accesses from all the CPUs, SMP system scaling constraints are now generally viewed as unacceptable. For this reason, practically all modern server systems are NUMA machines.
- Non-Uniform Memory Access (NUMA) topology
NUMA topology was developed more recently than SMP topology. In a NUMA system, multiple processors are physically grouped on a socket. Each socket has a dedicated area of memory and processors that have local access to that memory, these are referred to collectively as a node. Processors on the same node have high speed access to that node’s memory bank, and slower access to memory banks not on their node.
Therefore, there is a performance penalty when accessing non-local memory. Thus, performance sensitive applications on a system with NUMA topology should access memory that is on the same node as the processor executing the application, and should avoid accessing remote memory wherever possible.
Multi-threaded applications that are sensitive to performance may benefit from being configured to execute on a specific NUMA node rather than a specific processor. Whether this is suitable depends on your system and the requirements of your application. If multiple application threads access the same cached data, then configuring those threads to execute on the same processor may be suitable. However, if multiple threads that access and cache different data execute on the same processor, each thread may evict cached data accessed by a previous thread. This means that each thread 'misses' the cache and wastes execution time fetching data from memory and replacing it in the cache. Use the
perftool to check for an excessive number of cache misses.
20.2.2. Displaying system topologies
There are a number of commands that help understand the topology of a system. This procedure describes how to determine the system topology.
To display an overview of your system topology:
$ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 4 8 12 16 20 24 28 32 36 node 0 size: 65415 MB node 0 free: 43971 MB [...]
To gather the information about the CPU architecture, such as the number of CPUs, threads, cores, sockets, and NUMA nodes:
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 1 Core(s) per socket: 10 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 47 Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz Stepping: 2 CPU MHz: 2394.204 BogoMIPS: 4787.85 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36 NUMA node1 CPU(s): 2,6,10,14,18,22,26,30,34,38 NUMA node2 CPU(s): 1,5,9,13,17,21,25,29,33,37 NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39
To view a graphical representation of your system:
# yum install hwloc-gui # lstopo
Figure 20.1. The
To view the detailed textual output:
# yum install hwloc # lstopo-no-graphics Machine (15GB) Package L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#4) HostBridge L#0 PCI 8086:5917 GPU L#0 "renderD128" GPU L#1 "controlD64" GPU L#2 "card0" PCIBridge PCI 8086:24fd Net L#3 "wlp61s0" PCIBridge PCI 8086:f1a6 PCI 8086:15d7 Net L#4 "enp0s31f6"
For more information, see the
20.3. Tuning scheduling policy
In Red Hat Enterprise Linux, the smallest unit of process execution is called a thread. The system scheduler determines which processor runs a thread, and for how long the thread runs. However, because the scheduler’s primary concern is to keep the system busy, it may not schedule threads optimally for application performance.
For example, say an application on a NUMA system is running on Node A when a processor on Node B becomes available. To keep the processor on Node B busy, the scheduler moves one of the application’s threads to Node B. However, the application thread still requires access to memory on Node A. But, this memory will take longer to access because the thread is now running on Node B and Node A memory is no longer local to the thread. Thus, it may take longer for the thread to finish running on Node B than it would have taken to wait for a processor on Node A to become available, and then to execute the thread on the original node with local memory access.
Performance sensitive applications often benefit from the designer or administrator determining where threads are run. The Linux scheduler implements a number of scheduling policies which determine where and for how long a thread runs. The following are the two major categories of scheduling policies:
- Normal policies: Normal threads are used for tasks of normal priority.
- Realtime policies: Realtime policies are used for time-sensitive tasks that must complete without interruptions. Realtime threads are not subject to time slicing. This means the thread runs until they block, exit, voluntarily yield, or are preempted by a higher priority thread. The lowest priority realtime thread is scheduled before any thread with a normal policy. For more information, see Section 20.3.1, “Static priority scheduling with SCHED_FIFO” and Section 20.3.2, “Round robin priority scheduling with SCHED_RR”.
20.3.1. Static priority scheduling with SCHED_FIFO
SCHED_FIFO, also called static priority scheduling, is a realtime policy that defines a fixed priority for each thread. This policy allows administrators to improve event response time and reduce latency. It is recommended to not execute this policy for an extended period of time for time sensitive tasks.
SCHED_FIFO is in use, the scheduler scans the list of all the
SCHED_FIFO threads in order of priority and schedules the highest priority thread that is ready to run. The priority level of a
SCHED_FIFO thread can be any integer from 1 to 99, where 99 is treated as the highest priority. Red Hat recommends starting with a lower number and increasing priority only when you identify latency issues.
Because realtime threads are not subject to time slicing, Red Hat does not recommend setting a priority as 99. This keeps your process at the same priority level as migration and watchdog threads; if your thread goes into a computational loop and these threads are blocked, they will not be able to run. Systems with a single processor will eventually hang in this situation.
Administrators can limit
SCHED_FIFO bandwidth to prevent realtime application programmers from initiating realtime tasks that monopolize the processor.
The following are some of the parameters used in this policy:
This parameter defines the time period, in microseconds, that is considered to be one hundred percent of the processor bandwidth. The default value is
1000000 μs, or
This parameter defines the time period, in microseconds, that is devoted to running real-time threads. The default value is
950000 μs, or
20.3.2. Round robin priority scheduling with SCHED_RR
SCHED_RR is a round-robin variant of the
SCHED_FIFO. This policy is useful when multiple threads need to run at the same priority level.
SCHED_RR is a realtime policy that defines a fixed priority for each thread. The scheduler scans the list of all SCHED_RR threads in order of priority and schedules the highest priority thread that is ready to run. However, unlike
SCHED_FIFO, threads that have the same priority are scheduled in a round-robin style within a certain time slice.
You can set the value of this time slice in milliseconds with the
sched_rr_timeslice_ms kernel parameter in the
/proc/sys/kernel/sched_rr_timeslice_ms file. The lowest value is
20.3.3. Normal scheduling with SCHED_OTHER
SCHED_OTHER is the default scheduling policy in Red Hat Enterprise Linux 8. This policy uses the Completely Fair Scheduler (CFS) to allow fair processor access to all threads scheduled with this policy. This policy is most useful when there are a large number of threads or when data throughput is a priority, as it allows more efficient scheduling of threads over time.
When this policy is in use, the scheduler creates a dynamic priority list based partly on the niceness value of each process thread. Administrators can change the niceness value of a process, but cannot change the scheduler’s dynamic priority list directly.
20.3.4. Setting scheduler policies
Check and adjust scheduler policies and priorities by using the
chrt command line tool. It can start new processes with the desired properties, or change the properties of a running process. It can also be used for setting the policy at runtime.
View the process ID (PID) of the active processes:
-poption with the
pscommand to view the details of the particular PID.
Check the scheduling policy, PID, and priority of a particular process:
# chrt -p 468 pid 468's current scheduling policy: SCHED_FIFO pid 468's current scheduling priority: 85 # chrt -p 476 pid 476's current scheduling policy: SCHED_OTHER pid 476's current scheduling priority: 0
Here, 468 and 476 are PID of a process.
Set the scheduling policy of a process:
For example, to set the process with PID 1000 to SCHED_FIFO, with a priority of 50:
# chrt -f -p 50 1000
For example, to set the process with PID 1000 to SCHED_OTHER, with a priority of 0:
# chrt -o -p 0 1000
For example, to set the process with PID 1000 to SCHED_RR, with a priority of 10:
# chrt -r -p 10 1000
To start a new application with a particular policy and priority, specify the name of the application:
# chrt -f 36 /bin/my-app
- For more information on the policy options, see Policy Options for the chrt command.
- For information on setting the policy in a persistent manner, see Section 20.3.6, “Changing the priority of services during the boot process”.
20.3.5. Policy options for the chrt command
To set the scheduling policy of a process, use the appropriate command option:
Table 20.1. Policy Options for the chrt Command
|Short option||Long option||Description|
| || || |
Set schedule to
| || || |
Set schedule to
| || || |
Set schedule to
20.3.6. Changing the priority of services during the boot process
systemd service, it is possible to set up real-time priorities for services launched during the boot process. The unit configuration directives are used to change the priority of a service during the boot process.
The boot process priority change is done by using the following directives in the service section:
Sets the CPU scheduling policy for executed processes. It is used to set
Sets the CPU scheduling priority for executed processes. The available priority range depends on the selected CPU scheduling policy. For real-time scheduling policies, an integer between
1(lowest priority) and
99(highest priority) can be used.
The following procedure describes how to change the priority of a service, during the boot process, using the
Install the tuned package:
# yum install tuned
Enable and start the tuned service:
# systemctl enable --now tuned
View the scheduling priorities of running threads:
# tuna --show_threads thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 1 OTHER 0 0xff 3181 292 systemd 2 OTHER 0 0xff 254 0 kthreadd 3 OTHER 0 0xff 2 0 rcu_gp 4 OTHER 0 0xff 2 0 rcu_par_gp 6 OTHER 0 0 9 0 kworker/0:0H-kblockd 7 OTHER 0 0xff 1301 1 kworker/u16:0-events_unbound 8 OTHER 0 0xff 2 0 mm_percpu_wq 9 OTHER 0 0 266 0 ksoftirqd/0 [...]
Create a supplementary
mcelogservice configuration directory file and insert the policy name and priority in this file:
# cat <<-EOF > /etc/systemd/system/mcelog.system.d/priority.conf > [SERVICE] CPUSchedulingPolicy=_fifo_ CPUSchedulingPriority=_20_ EOF
Reload the systemd scripts configuration:
# systemctl daemon-reload
Restart the mcelog service:
# systemctl restart mcelog
mcelogpriority set by
# tuna -t mcelog -P thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 826 FIFO 20 0,1,2,3 13 0 mcelog
For more information, see the man pages of
- For more information about priority range, see Description of the priority range.
20.3.7. Priority map
Priorities are defined in groups, with some groups dedicated to certain kernel functions.
Table 20.2. Description of the priority range
Low priority kernel threads
This priority is usually reserved for the tasks that need to be just above SCHED_OTHER.
2 - 49
Available for use
The range used for typical application priorities.
Default hard-IRQ value
51 - 98
High priority threads
Use this range for threads that execute periodically and must have quick response times. Do not use this range for CPU-bound threads as you will starve interrupts.
Watchdogs and migration
System threads that must run at the highest priority.
20.3.8. cpu-partitioning profile
cpu-partitioning profile is used to isolate CPUs from system level interruptions. Once you have isolated these CPUs, you can allocate them for specific applications. This is very useful in low-latency environments or in environments where you wish to extract the maximum performance from your hardware.
This profile also lets you designate housekeeping CPUs. A housekeeping CPU is used to run all services, daemons, shell processes, and kernel threads.
You can configure the
cpu-partitioning profile in the
/etc/tuned/cpu-partitioning-variables.conf file using the following configuration options:
- Lists CPUs to isolate. The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as 3-5. This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping CPU.
Lists CPUs which are not considered by the kernel during system wide process load-balancing. This option is optional. This is usually the same list as
20.3.9. Additional resources
For more information, see the man pages of
20.4. Configuring kernel tick time
By default, Red Hat Enterprise Linux 8 uses a tickless kernel, which does not interrupt idle CPUs in order to reduce power usage and allow new processors to take advantage of deep sleep states.
Red Hat Enterprise Linux 8 also offers a dynamic tickless option, which is useful for latency-sensitive workloads, such as high performance computing or realtime computing. By default, the dynamic tickless option is disabled. Red Hat recommends using the
cpu-partitioning Tuned profile to enable the dynamic tickless option for cores specified as
This procedure describes how to manually persistently enable dynamic tickless behavior.
To enable dynamic tickless behavior in certain cores, specify those cores on the kernel command line with the
nohz_fullparameter. On a 16 core system, append this parameter on the
GRUB_CMDLINE_LINUXoption in the
This enables dynamic tickless behavior on cores 1 through 15, moving all timekeeping to the only unspecified core (core 0).
To persistently enable the dynamic tickless behavior, regenerate the GRUB2 configuration using the edited default file. On systems with BIOS firmware, execute the following command:
# grub2-mkconfig -o /etc/grub2.cfg
On systems with UEFI firmware, execute the following command:
# grub2-mkconfig -o /etc/grub2-efi.cfg
When the system boots, manually move the
rcuthreads to the non-latency-sensitive core, in this case core 0:
# for i in `pgrep rcu[^c]` ; do taskset -pc 0 $i ; done
Optional: Use the
isolcpusparameter on the kernel command line to isolate certain cores from user-space tasks.
Optional: Set the CPU affinity for the kernel’s
write-back bdi-flushthreads to the housekeeping core:
echo 1 > /sys/bus/workqueue/devices/writeback/cpumask
Once the system is rebooted, verify if
# journalctl -xe | grep dynticks Mar 15 18:34:54 rhel-server kernel: NO_HZ: Full dynticks CPUs: 1-15.
Verify that the dynamic tickless configuration is working correctly:
# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 sleep 3
This command measures ticks on CPU 1 while telling CPU 1 to sleep for 3 seconds.
The default kernel timer configuration shows around 3100 ticks on a regular CPU:
# perf stat -C 0 -e irq_vectors:local_timer_entry taskset -c 0 sleep 3 Performance counter stats for 'CPU(s) 0': 3,107 irq_vectors:local_timer_entry 3.001342790 seconds time elapsed
With the dynamic tickless kernel configured, you should see around 4 ticks instead:
# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 sleep 3 Performance counter stats for 'CPU(s) 1': 4 irq_vectors:local_timer_entry 3.001544078 seconds time elapsed
For more information, see the man pages of
- All about nohz_full kernel parameter Red Hat Knowledgebase article.
- How to verify the list of "isolated" and "nohz_full" CPU information from sysfs? Red Hat Knowledgebase article.
- Setting a Tuned profile.
20.5. Setting interrupt affinity systems
An interrupt request or IRQ is a signal for immediate attention sent from a piece of hardware to a processor. Each device in a system is assigned one or more IRQ numbers which allow it to send unique interrupts. When interrupts are enabled, a processor that receives an interrupt request immediately pauses execution of the current application thread in order to address the interrupt request.
Because interrupt halts normal operation, high interrupt rates can severely degrade system performance. It is possible to reduce the amount of time taken by interrupts by configuring interrupt affinity or by sending a number of lower priority interrupts in a batch (coalescing a number of interrupts).
Interrupt requests have an associated affinity property,
smp_affinity, which defines the processors that handle the interrupt request. To improve application performance, assign interrupt affinity and process affinity to the same processor, or processors on the same core. This allows the specified interrupt and application threads to share cache lines.
On systems that support interrupt steering, modifying the
smp_affinity property of an interrupt request sets up the hardware so that the decision to service an interrupt with a particular processor is made at the hardware level with no intervention from the kernel.
20.5.1. Balancing interrupts manually
If your BIOS exports its NUMA topology, the
irqbalance service can automatically serve interrupt requests on the node that is local to the hardware requesting service.
- Check which devices correspond to the interrupt requests that you want to configure.
Find the hardware specification for your platform. Check if the chipset on your system supports distributing interrupts.
- If it does, you can configure interrupt delivery as described in the following steps. Additionally, check which algorithm your chipset uses to balance interrupts. Some BIOSes have options to configure interrupt delivery.
- If it does not, your chipset always routes all interrupts to a single, static CPU. You cannot configure which CPU is used.
Check which Advanced Programmable Interrupt Controller (APIC) mode is in use on your system:
$ journalctl --dmesg | grep APIC
If your system uses a mode other than
flat, you can see a line similar to
Setting APIC routing to physical flat.
If you can see no such message, your system uses
If your system uses
x2apicmode, you can disable it by adding the
nox2apicoption to the kernel command line in the
Only non-physical flat mode (
flat) supports distributing interrupts to multiple CPUs. This mode is available only for systems that have up to 8 CPUs.
- If your system uses a mode other than
smp_affinity mask. For more information on how to calculate the
smp_affinity mask, see Section 20.5.2, “Setting the smp_affinity mask”.
20.5.2. Setting the smp_affinity mask
smp_affinity value is stored as a hexadecimal bit mask representing all processors in the system. Each bit configures a different CPU. The least significant bit is CPU 0. The default value of the mask is
f, which means that an interrupt request can be handled on any processor in the system. Setting this value to 1 means that only processor 0 can handle the interrupt.
In binary, use the value 1 for CPUs that handle the interrupts. For example, to set CPU 0 and CPU 7 to handle interrupts, use
0000000010000001as the binary code:
Table 20.3. Binary Bits for CPUs
CPU 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Convert the binary code to hexadecimal:
For example, to convert the binary code using Python:
>>> hex(int('0000000010000001', 2)) '0x81'
On systems with more than 32 processors, you must delimit the
smp_affinityvalues for discrete 32 bit groups. For example, if you want only the first 32 processors of a 64 processor system to service an interrupt request, use
The interrupt affinity value for a particular interrupt request is stored in the associated
/proc/irq/irq_number/smp_affinityfile. Set the
smp_affinitymask in this file:
# echo mask > /proc/irq/irq_number/smp_affinity
20.5.3. Additional resources
For more information, see the man pages of