Chapter 1. Real-time kernel tuning in RHEL 9
Latency, or response time, refers to the time from an event and to the system response. It is generally measured in microseconds (μs).
For most applications running under a Linux environment, basic performance tuning can improve latency sufficiently. For those industries where latency must be low, accountable, and predictable, Red Hat has a replacement kernel that can be tuned so that latency meets those requirements. RHEL for Real Time 9 provides seamless integration with RHEL 9 and offers clients the opportunity to measure, configure, and record latency times within their organization.
RHEL for Real Time 9 is designed to be used on well-tuned systems, for applications with extremely high determinism requirements. Kernel system tuning offers the vast majority of the improvement in determinism.
Before you begin, perform general system tuning of the standard RHEL 9 system before using RHEL for Real Time 9.
Failure to perform these tasks might prevent getting consistent performance from a RHEL Real Time deployment.
1.1. Tuning guidelines
Real-time tuning is an iterative process; you will almost never be able to tweak a few variables and know that the change is the best that can be achieved. Be prepared to spend days or weeks narrowing down the set of tuning configurations that work best for your system.
Additionally, always make long test runs. Changing some tuning parameters then doing a five minute test run is not a good validation of a set of tunes. Make the length of your test runs adjustable and run them for longer than a few minutes. Try to narrow down to a few different tuning configuration sets with test runs of a few hours, then run those sets for many hours or days at a time to try and catch corner-cases of highest latency or resource exhaustion.
- Build a measurement mechanism into your application, so that you can accurately gauge how a particular set of tuning changes affect the application’s performance. Anecdotal evidence (for example, "The mouse moves more smoothly.") is usually wrong and varies from person to person. Do hard measurements and record them for later analysis.
- It is very tempting to make multiple changes to tuning variables between test runs, but doing so means that you do not have a way to narrow down which tune affected your test results. Keep the tuning changes between test runs as small as you can.
- It is also tempting to make large changes when tuning, but it is almost always better to make incremental changes. You will find that working your way up from the lowest to highest priority values will yield better results in the long run.
Use the available tools. The
tunatuning tool makes it easy to change processor affinities for threads and interrupts, thread priorities and to isolate processors for application use. The
chrtcommand line utilities allow you to do most of what Tuna does. If you run into performance problems, the
perfutilities can help locate latency issues.
- Rather than hard-coding values into your application, use external tools to change policy, priority and affinity. Using external tools allows you to try many different combinations and simplifies your logic. Once you have found some settings that give good results, you can either add them to your application, or set up startup logic to implement the settings when the application starts.
1.2. Thread scheduling policies
Linux uses three main thread scheduling policies.
This is the default thread policy and has dynamic priority controlled by the kernel. The priority is changed based on thread activity. Threads with this policy are considered to have a real-time priority of 0 (zero).
SCHED_FIFO(First in, first out)
A real-time policy with a priority range of from
1 - 99, with
1being the lowest and
SCHED_FIFOthreads always have a higher priority than
SCHED_OTHERthreads (for example, a
SCHED_FIFOthread with a priority of
1will have a higher priority than any
SCHED_OTHERthread). Any thread created as a
SCHED_FIFOthread has a fixed priority and will run until it is blocked or preempted by a higher priority thread.
SCHED_RRis a modification of
SCHED_FIFO. Threads with the same priority have a quantum and are round-robin scheduled among all equal priority
SCHED_RRthreads. This policy is rarely used.
1.3. Balancing logging parameters
syslog server forwards log messages from programs over a network. The less often this occurs, the larger the pending transaction is likely to be. If the transaction is very large, it can cause an I/O spike. To prevent this, keep the interval reasonably small.
The system logging daemon,
syslogd, is used to collect messages from different programs. It also collects information reported by the kernel from the kernel logging daemon,
syslogd logs to a local file, but it can also be configured to log over a network to a remote logging server.
To enable remote logging:
- Configure the machine to which the logs will be sent. For more information, see Remote Syslogging with rsyslog on Red Hat Enterprise Linux.
Configure each system that will send logs to the remote log server, so that its
syslogoutput is written to the server, rather than to the local file system. To do so, edit the
/etc/rsyslog.conffile on each client system. For each of the logging rules defined in that file, replace the local log file with the address of the remote logging server.
# Log all kernel messages to remote logging host. kern.* @my.remote.logging.server
The example above configures the client system to log all kernel messages to the remote machine at
Alternatively, you can configure
syslogdto log all locally generated system messages, by adding the following line to the
# Log all messages to a remote logging server: . @my.remote.logging.server
syslogd daemon does not include built-in rate limiting on its generated network traffic. Therefore, Red Hat recommends that when using RHEL for Real Time systems, only log messages that are required to be remotely logged by your organization. For example, kernel warnings, authentication requests, and the like. Other messages should be logged locally.
1.4. Improving performance by avoiding running unnecessary applications
Every running application uses system resources. Ensuring that there are no unnecessary applications running on your system can significantly improve performance.
- You have root permissions on the system.
Do not run the graphical interface where it is not absolutely required, especially on servers.
Check if the system is configured to boot into the GUI by default:
# systemctl get-default
If the output of the command is
graphical.target, configure the system to boot to text mode:
# systemctl set-default multi-user.target
Unless you are actively using a Mail Transfer Agent (MTA) on the system you are tuning, disable it. If the MTA is required, ensure it is well-tuned or consider moving it to a dedicated machine.
For more information, refer to the MTA’s documentation.Important
MTAs are used to send system-generated messages, which are executed by programs such as
cron. This includes reports generated by logging functions like
logwatch(). You will not be able to receive these messages if the MTAs on your machine are disabled.
Peripheral devices, such as mice, keyboards, webcams send interrupts that may negatively affect latency. If you are not using a graphical interface, remove all unused peripheral devices and disable them.
For more information, refer to the devices' documentation.
Check for automated
cronjobs that might impact performance.
# crontab -l
crondservice or any unneeded
- Check your system for third-party applications and any components added by external hardware vendors, and remove any that are unnecessary.
1.5. Non-Uniform Memory Access
taskset utility only works on CPU affinity and has no knowledge of other NUMA resources such as memory nodes. If you want to perform process binding in conjunction with NUMA, use the
numactl command instead of
For more information about the NUMA API, see Andi Kleen’s whitepaper An NUMA API for Linux.
1.6. Ensuring that debugfs is mounted
debugfs file system is specially designed for debugging and making information available to users. It is mounted automatically in RHEL 8 in the
debugfs file system is mounted using the
To verify that
debugfs is mounted:
Run the following command:
# mount | grep ^debugfs debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime,seclabel)
debugfsis mounted, the command displays the mount point and properties for
debugfsis not mounted, the command returns nothing.
1.7. InfiniBand in RHEL for Real Time
InfiniBand is a type of communications architecture often used to increase bandwidth, improve quality of service (QOS), and provide for failover. It can also be used to improve latency by using the Remote Direct Memory Access (RDMA) mechanism.
The support for InfiniBand on RHEL for Real Time is the same as the support available on Red Hat Enterprise Linux 9. For more information, see Configuring InfiniBand and RDMA networks.
1.8. Using RoCEE and High-Performance Networking
RoCEE (RDMA over Converged Enhanced Ethernet) is a protocol that implements Remote Direct Memory Access (RDMA) over Ethernet networks. It allows you to maintain a consistent, high-speed environment in your data centers, while providing deterministic, low latency data transport for critical transactions.
High Performance Networking (HPN) is a set of shared libraries that provides
RoCEE interfaces into the kernel. Instead of going through an independent network infrastructure,
HPN places data directly into remote system memory using standard Ethernet infrastructure, resulting in less CPU overhead and reduced infrastructure costs.
HPN under RHEL for Real Time does not differ from the support offered under RHEL 8.
1.9. Reducing CPU performance spikes
The kernel command line
skew_tick parameter smooths jitter on moderate to large systems with latency-sensitive applications running. A common source of latency spikes on a real time Linux system is when multiple CPUs contend on common locks in the Linux kernel timer tick handler.
- You have administrator permissions.
skew_tickboot parameter to
1.10. Real time scheduling issues and solutions
This section provides information about real time scheduling issues and the available solutions.
Real time scheduling policies
The two real time scheduling policies in RHEL for Real Time share one main characteristic: they run until they are preempted by a higher priority thread or until they "wait", either by sleeping or performing I/O. In the case of
SCHED_RR, a thread may be preempted by the operating system so that another thread of equal
SCHED_RR priority may run. In either of these cases, no provision is made by the POSIX specifications that define the policies for allowing lower priority threads to get any CPU time.
This characteristic of real-time threads means that it is easy to write an application which monopolizes 100% of a given CPU. However, this causes problems for the operating system. For example, the operating system is responsible for managing both system-wide and per-CPU resources and must periodically examine data structures describing these resources and perform housekeeping activities with them. But if a core is monopolized by a
SCHED_FIFO thread, it cannot perform its housekeeping tasks. Eventually the entire system becomes unstable, potentially crashing.
On the RHEL for Real Time kernel, interrupt handlers run as threads with a
SCHED_FIFO priority. The default priority is
50. A cpu-hog thread with a
SCHED_RR policy higher than the interrupt handler threads can prevent interrupt handlers from running. This causes programs waiting for data signaled by those interrupts to be starved and fail.
Real time scheduler throttling
RHEL for Real Time includes with a safeguard mechanism that allows the system administrator to allocate bandwith for use by real time tasks. This safeguard mechanism is known as real time scheduler throttling. Real time scheduler throttling is controlled by two parameters in the
/proc file system:
Defines the period in μs (microseconds) to be considered 100% of CPU bandwidth. The default value is
1,000,000 μs(1 second). Changes to the value of the period must be very well thought out, as a period too long or too small are equally dangerous.
The total bandwidth available for all real-time tasks. The default value is
950,000 μs(0.95 s), which is 95% of the CPU bandwidth. Setting the value to
-1means that real time tasks may use up to 100% of CPU time. This is only adequate when the real time tasks are well engineered and have no obvious caveats, such as unbounded polling loops.
The default values for the real time throttling mechanism define that the real time tasks can use 95% of the CPU time. The remaining 5% will be devoted to non-real time tasks, such as tasks running under
SCHED_OTHERand similar scheduling policies. It is important to note that if a single real time task occupies that 95% CPU time slot, the remaining real time tasks on that CPU will not run. Only non-real time tasks use the remaining 5% of CPU time.
The impact of the default values include the following:
- Rogue real time tasks do not lock up the system by not allowing non-real time tasks to run.
- Real time tasks have at most 95% of CPU time available for them, which can affect their performance.
Real time thread starvation
Thread starvation occurs when a thread is on a CPU run queue for longer than the starvation threshold and does not make progress. A common cause of thread starvation is to run a fixed-priority polling application, such as
SCHED_FIFO or SCHED_RR bound to a CPU. Since the polling application does not block for I/O, this can prevent other threads, such as
kworkers, from running on that CPU.
An early attempt to reduce thread starvation is called as real-time throttling. In real-time throttling, each CPU has a portion of the execution time dedicated to non real-time tasks. The default setting for throttling is
on with 95 percent of the CPU for real-time tasks and 5 percent reserved for non real-time tasks. This works if you have a single real-time task causing starvation but does not work if there are multiple real-time tasks assigned to a CPU.
stalldmechanism is an alternative for real-time throttling and avoids some of the throttling drawbacks.
stalldis a daemon to periodically monitor the state of each thread in the system and looks for threads that are on the run queue for a specified length of time without being run.
stalldtemporarily changes that thread to use the
SCHED_DEADLINEpolicy and gives the thread a small slice of time on the specified CPU. The thread then runs, and when the time slice is used, the thread returns to its original scheduling policy and
stalldcontinues to monitor thread states.
Housekeeping CPUs are CPUs that run all daemons, shell processes, kernel threads, interrupt handlers, and all work that can be dispatched from isolated CPUs. For housekeeping CPUs with real-time throttling disabled,
stalldmonitors the CPU running the main workload. It allows the CPU to run as a
SCHED_FIFObusy loop, detecting stalled threads and improving their priority when required with a previously defined acceptable added noise.
stalldcan be a preference if the real-time throttling mechanism causes an unreasonable noise in the main workload.
stalld, you can more precisely control the noise introduced by boosting starved threads.
stalldincludes the shell script
/usr/bin/throttlectl, which automatically disables RT throttling when
stalldis run. You can list the current throttling values by using the
Disabling real-time throttling
The following two files in the /proc filesystem can control real-time throttling:
/proc/sys/kernel/sched_rt_period_us, specifies the number of microseconds in a period and defaults to 1 million, which is one second.
/proc/sys/kernel/sched_rt_runtime_us, specifies the number of microseconds that can be used by a real-time task before throttling occurs and it defaults to 950,000 or 95% of the available CPU cycles. You can disable throttling by passing a value of -1 into the
echo -1 > /proc/sys/kernel/sched_rt_runtime_us.Note
stalldmechanism causes conflicts with the real-time throttling mechanism. At machine startup,
stalldautomatically invokes the
throttlectlscript with a value of
off. It saves the current throttling values and writes a -1 to the
runtimefile. When the
stalldservice is shut down, the
throttlectlscript runs with a value of
on. It restores the previous saved values for real-time throttling.
1.11. Tuning containers for RHEL for real-time
The main RHEL kernels enable the real time group scheduling feature,
CONFIG_RT_GROUP_SCHED, by default. However, for real-time kernels, this feature is disabled.
CONFIG_RT_GROUP_SCHED feature was developed independently of the
PREEMPT_RT patchset used in the
kernel-rt package and is intended to operate on real time processes on the main RHEL kernel. The
CONFIG_RT_GROUP_SCHED feature might cause latency spikes and is therefore disabled on
PREEMPT_RT enabled kernels. Therefore, when testing your workload in a container running on the main RHEL kernel, some real-time bandwidth must be allocated to the container to be able to run the
SCHED_RR tasks inside it.
Configure the following global setting before using podman’s
--cpu-rt-runtimecommand line option:
# echo 950000 > /sys/fs/cgroup/cpu,cpuacct/machine.slice/cpu.rt_runtime_us
- For CPU isolation, use the existing recommendations for setting aside a set of cores for the RT workload.
podman run --cpuset-cpuswith the list of isolated CPU cores to be used.
Specify the Non-Uniform Memory Access (NUMA) memory nodes to use.
*podman run --cpuset-mems=number-of-memory-nodes
This avoids cross-NUMA node memory access.
To verify that the minimal amount of memory required by the real-time workload running on the container is available at container start time, use the
*podman run --memory-reservation=limitcommand.