Red Hat Training

A Red Hat training course is available for Red Hat Enterprise Linux

6.3. Configuration Suggestions

Red Hat Enterprise Linux provides a number of tools to assist administrators in configuring the system. This section outlines the available tools and provides examples of how they can be used to solve processor related performance problems in Red Hat Enterprise Linux 7.

6.3.1. Configuring Kernel Tick Time

By default, Red Hat Enterprise Linux 7 uses a tickless kernel, which does not interrupt idle CPUs in order to reduce power usage and allow newer processors to take advantage of deep sleep states.
Red Hat Enterprise Linux 7 also offers a dynamic tickless option (disabled by default), which is useful for very latency-sensitive workloads, such as high performance computing or realtime computing.
To enable dynamic tickless behavior in certain cores, specify those cores on the kernel command line with the nohz_full parameter. On a 16 core system, specifying nohz_full=1-15 enables dynamic tickless behavior on cores 1 through 15, moving all timekeeping to the only unspecified core (core 0). This behavior can be enabled either temporarily at boot time, or persistently via the GRUB_CMDLINE_LINUX option in the /etc/default/grub file. For persistent behavior, run the grub2-mkconfig -o /boot/grub2/grub.cfg command to save your configuration.
Enabling dynamic tickless behavior does require some manual administration.
  • When the system boots, you must manually move rcu threads to the non-latency-sensitive core, in this case core 0.
    # for i in `pgrep rcu[^c]` ; do taskset -pc 0 $i ; done
  • Use the isolcpus parameter on the kernel command line to isolate certain cores from user-space tasks.
  • Optionally, set CPU affinity for the kernel's write-back bdi-flush threads to the housekeeping core:
    echo 1 > /sys/bus/workqueue/devices/writeback/cpumask
Verify that the dynamic tickless configuration is working correctly by executing the following command, where stress is a program that spins on the CPU for 1 second.
# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 stress -t 1 -c 1
One possible replacement for stress is a script that runs something like while :; do d=1; done.
The default kernel timer configuration shows 1000 ticks on a busy CPU:
# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 stress -t 1 -c 1
1000 irq_vectors:local_timer_entry
With the dynamic tickless kernel configured, you should see 1 tick instead:
# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 stress -t 1 -c 1
1 irq_vectors:local_timer_entry

6.3.2. Setting Hardware Performance Policy (x86_energy_perf_policy)

The x86_energy_perf_policy tool allows administrators to define the relative importance of performance and energy efficiency. This information can then be used to influence processors that support this feature when they select options that trade off between performance and energy efficiency.
By default, it operates on all processors in performance mode. It requires processor support, which is indicated by the presence of CPUID.06H.ECX.bit3, and must be run with root privileges.
x86_energy_perf_policy is provided by the kernel-tools package. For details of how to use x86_energy_perf_policy, see Section A.9, “x86_energy_perf_policy” or refer to the man page:
$ man x86_energy_perf_policy

6.3.3. Setting Process Affinity with taskset

The taskset tool is provided by the util-linux package. Taskset allows administrators to retrieve and set the processor affinity of a running process, or launch a process with a specified processor affinity.

Important

taskset does not guarantee local memory allocation. If you require the additional performance benefits of local memory allocation, Red Hat recommends using numactl instead of taskset.
For more information about taskset, see Section A.15, “taskset” or the man page:
$ man taskset

6.3.4. Managing NUMA Affinity with numactl

Administrators can use numactl to run a process with a specified scheduling or memory placement policy. Numactl can also set a persistent policy for shared memory segments or files, and set the processor affinity and memory affinity of a process.
In a system with NUMA topology, a processor's memory access slows as the distance between the processor and the memory bank increases. Therefore, it is important to configure applications that are sensitive to performance so that they allocate memory from the closest possible memory bank. It is best to use memory and CPUs that are in the same NUMA node.
Multi-threaded applications that are sensitive to performance may benefit from being configured to execute on a specific NUMA node rather than a specific processor. Whether this is suitable depends on your system and the requirements of your application. If multiple application threads access the same cached data, then configuring those threads to execute on the same processor may be suitable. However, if multiple threads that access and cache different data execute on the same processor, each thread may evict cached data accessed by a previous thread. This means that each thread 'misses' the cache, and wastes execution time fetching data from memory and replacing it in the cache. You can use the perf tool, as documented in Section A.6, “perf”, to check for an excessive number of cache misses.
Numactl provides a number of options to assist you in managing processor and memory affinity. See Section A.11, “numastat” or the man page for details:
$ man numactl

Note

The numactl package includes the libnuma library. This library offers a simple programming interface to the NUMA policy supported by the kernel, and can be used for more fine-grained tuning than the numactl application. For more information, see the man page:
$ man numa

6.3.5. Automatic NUMA Affinity Management with numad

numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management.
numad also provides a pre-placement advice service that can be queried by various job management systems to provide assistance with the initial binding of CPU and memory resources for their processes. This pre-placement advice is available regardless of whether numad is running as an executable or a service.
For details of how to use numad, see Section A.13, “numad” or refer to the man page:
$ man numad

6.3.6. Tuning Scheduling Policy

The Linux scheduler implements a number of scheduling policies, which determine where and for how long a thread runs. There are two major categories of scheduling policies: normal policies and realtime policies. Normal threads are used for tasks of normal priority. Realtime policies are used for time-sensitive tasks that must complete without interruptions.
Realtime threads are not subject to time slicing. This means they will run until they block, exit, voluntarily yield, or are pre-empted by a higher priority thread. The lowest priority realtime thread is scheduled before any thread with a normal policy.

6.3.6.1. Scheduling Policies

6.3.6.1.1. Static Priority Scheduling with SCHED_FIFO
SCHED_FIFO (also called static priority scheduling) is a realtime policy that defines a fixed priority for each thread. This policy allows administrators to improve event response time and reduce latency, and is recommended for time sensitive tasks that do not run for an extended period of time.
When SCHED_FIFO is in use, the scheduler scans the list of all SCHED_FIFO threads in priority order and schedules the highest priority thread that is ready to run. The priority level of a SCHED_FIFO thread can be any integer from 1 to 99, with 99 treated as the highest priority. Red Hat recommends starting at a low number and increasing priority only when you identify latency issues.

Warning

Because realtime threads are not subject to time slicing, Red Hat does not recommend setting a priority of 99. This places your process at the same priority level as migration and watchdog threads; if your thread goes into a computational loop and these threads are blocked, they will not be able to run. Systems with a single processor will eventually hang in this situation.
Administrators can limit SCHED_FIFO bandwidth to prevent realtime application programmers from initiating realtime tasks that monopolize the processor.
/proc/sys/kernel/sched_rt_period_us
This parameter defines the time period in microseconds that is considered to be one hundred percent of processor bandwidth. The default value is 1000000 μs, or 1 second.
/proc/sys/kernel/sched_rt_runtime_us
This parameter defines the time period in microseconds that is devoted to running realtime threads. The default value is 950000 μs, or 0.95 seconds.
6.3.6.1.2. Round Robin Priority Scheduling with SCHED_RR
SCHED_RR is a round-robin variant of SCHED_FIFO. This policy is useful when multiple threads need to run at the same priority level.
Like SCHED_FIFO, SCHED_RR is a realtime policy that defines a fixed priority for each thread. The scheduler scans the list of all SCHED_RR threads in priority order and schedules the highest priority thread that is ready to run. However, unlike SCHED_FIFO, threads that have the same priority are scheduled round-robin style within a certain time slice.
You can set the value of this time slice in milliseconds with the sched_rr_timeslice_ms kernel parameter (/proc/sys/kernel/sched_rr_timeslice_ms). The lowest value is 1 millisecond.
6.3.6.1.3. Normal Scheduling with SCHED_OTHER
SCHED_OTHER is the default scheduling policy in Red Hat Enterprise Linux 7. This policy uses the Completely Fair Scheduler (CFS) to allow fair processor access to all threads scheduled with this policy. This policy is most useful when there are a large number of threads or data throughput is a priority, as it allows more efficient scheduling of threads over time.
When this policy is in use, the scheduler creates a dynamic priority list based partly on the niceness value of each process thread. Administrators can change the niceness value of a process, but cannot change the scheduler's dynamic priority list directly.
For details about changing process niceness, see the Red Hat Enterprise Linux 7 System Administrator's Guide.

6.3.6.2. Isolating CPUs

You can isolate one or more CPUs from the scheduler with the isolcpus boot parameter. This prevents the scheduler from scheduling any user-space threads on this CPU.
Once a CPU is isolated, you must manually assign processes to the isolated CPU, either with the CPU affinity system calls or the numactl command.
To isolate the third and sixth to eighth CPUs on your system, add the following to the kernel command line:
isolcpus=2,5-7
You can also use the Tuna tool to isolate a CPU. Tuna can isolate a CPU at any time, not just at boot time. However, this method of isolation is subtly different from the isolcpus parameter, and does not currently achieve the performance gains associated with isolcpus. See Section 6.3.8, “Configuring CPU, Thread, and Interrupt Affinity with Tuna” for more details about this tool.

6.3.7. Setting Interrupt Affinity on AMD64 and Intel 64

Interrupt requests have an associated affinity property, smp_affinity, which defines the processors that will handle the interrupt request. To improve application performance, assign interrupt affinity and process affinity to the same processor, or processors on the same core. This allows the specified interrupt and application threads to share cache lines.

Important

This section covers only the AMD64 and Intel 64 architecture. Interrupt affinity configuration is significantly different on other architectures.

Procedure 6.1. Balancing Interrupts Automatically

  • If your BIOS exports its NUMA topology, the irqbalance service can automatically serve interrupt requests on the node that is local to the hardware requesting service.
    For details on configuring irqbalance, see Section A.1, “irqbalance”.

Procedure 6.2. Balancing Interrupts Manually

  1. Check which devices correspond to the interrupt requests that you want to configure.
    Starting with Red Hat Enterprise Linux 7.5, the system configures the optimal interrupt affinity for certain devices and their drivers automatically. You can no longer configure their affinity manually. This applies to the following devices:
    • Devices using the be2iscsi driver
    • NVMe PCI devices
  2. Find the hardware specification for your platform. Check if the chipset on your system supports distributing interrupts.
    • If it does, you can configure interrupt delivery as described in the following steps.
      Additionally, check which algorithm your chipset uses to balance interrupts. Some BIOSes have options to configure interrupt delivery.
    • If it does not, your chipset will always route all interrupts to a single, static CPU. You cannot configure which CPU is used.
  3. Check which Advanced Programmable Interrupt Controller (APIC) mode is in use on your system.
    Only non-physical flat mode (flat) supports distributing interrupts to multiple CPUs. This mode is available only for systems that have up to 8 CPUs.
    $ journalctl --dmesg | grep APIC
    In the command output:
    • If your system uses a mode other than flat, you can see a line similar to Setting APIC routing to physical flat.
    • If you can see no such message, your system uses flat mode.
    If your system uses x2apic mode, you can disable it by adding the nox2apic option to the kernel command line in the bootloader configuration.
  4. Calculate the smp_affinity mask.
    The smp_affinity value is stored as a hexadecimal bit mask representing all processors in the system. Each bit configures a different CPU. The least significant bit is CPU 0.
    The default value of the mask is f, meaning that an interrupt request can be handled on any processor in the system. Setting this value to 1 means that only processor 0 can handle the interrupt.

    Procedure 6.3. Calculating the Mask

    1. In binary, use the value 1 for CPUs that will handle the interrupts.
      For example, to handle interrupts by CPU 0 and CPU 7, use 0000000010000001 as the binary code:

      Table 6.1. Binary Bits for CPUs

      CPU1514131211109876543210
      Binary0000000010000001
    2. Convert the binary code to hexadecimal.
      For example, to convert the binary code using Python:
      >>> hex(int('0000000010000001', 2))
      
      '0x81'
      
    On systems with more than 32 processors, you must delimit smp_affinity values for discrete 32 bit groups. For example, if you want only the first 32 processors of a 64 processor system to service an interrupt request, use 0xffffffff,00000000.
  5. Set the smp_affinity mask.
    The interrupt affinity value for a particular interrupt request is stored in the associated /proc/irq/irq_number/smp_affinity file.
    Write the calculated mask to the associated file:
    # echo mask > /proc/irq/irq_number/smp_affinity

Additional Resources

  • On systems that support interrupt steering, modifying the smp_affinity property of an interrupt request sets up the hardware so that the decision to service an interrupt with a particular processor is made at the hardware level with no intervention from the kernel.
    For more information about interrupt steering, see Chapter 9, Networking.

6.3.8. Configuring CPU, Thread, and Interrupt Affinity with Tuna

Tuna is a tool for tuning running processes and can control CPU, thread, and interrupt affinity, and also provides a number of actions for each type of entity it can control. For information about Tuna, see Chapter 4, Tuna.