Chapter 1. Before You Start Tuning Your Red Hat Enterprise Linux for Real Time System

Red Hat Enterprise Linux for Real Time is designed to be used on well-tuned systems for applications with extremely high determinism requirements. Kernel system tuning offers the vast majority of the improvement in determinism. For example, in many workloads thorough system tuning improves consistency of results by around 90%. This is why we typically recommend that customers first perform the Chapter 2, General System Tuning of standard Red Hat Enterprise Linux before using Red Hat Enterprise Linux for Real Time.

Things to Remember While You Are Tuning Your Red Hat Enterprise Linux for Real Time Kernel

  1. Be Patient
    Real-time tuning is an iterative process; you will almost never be able to tweak a few variables and know that the change is the best that can be achieved. Be prepared to spend days or weeks narrowing down the set of tunings that work best for your system.
    Additionally, always make long test runs. Changing some tuning parameters then doing a five minute test run is not a good validation of a set of tunes. Make the length of your test runs adjustable and run them for longer than a few minutes. Try to narrow down to a few different tuning sets with test runs of a few hours, then run those sets for many hours or days at a time, to try and catch corner-cases of max latencies or resource exhaustion.
  2. Be Accurate
    Build a measurement mechanism into your application, so that you can accurately gauge how a particular set of tuning changes affect the application's performance. Anecdotal evidence (for example, "The mouse moves more smoothly") is usually wrong and varies from person to person. Do hard measurements and record them for later analysis.
  3. Be Methodical
    It is very tempting to make multiple changes to tuning variables between test runs, but doing so means that you do not have a way to narrow down which tune affected your test results. Keep the tuning changes between test runs as small as you can.
  4. Be Conservative
    It is also tempting to make large changes when tuning, but it is almost always better to make incremental changes. You will find that working your way up from the lowest to highest priority values will yield better results in the long run.
  5. Be Smart
    Use the tools you have available. The Tuna graphical tuning tool makes it easy to change processor affinities for threads and interrupts, thread priorities and to isolate processors for application use. The taskset and chrt command line utilities allow you to do most of what Tuna does. If you run into performance problems, the ftrace and perf tools can help locate latency issues.
  6. Be Flexible
    Rather than hard-coding values into your application, use external tools to change policy, priority and affinity. This allows you to try many different combinations and simplifies your logic. Once you have found some settings that give good results, you can either add them to your application, or set up some startup logic to implement the settings when the application starts.
Scheduling Policies

Linux uses three main scheduling policies:

SCHED_OTHER (sometimes called SCHED_NORMAL)
This is the default thread policy and has dynamic priority controlled by the kernel. The priority is changed based on thread activity. Threads with this policy are considered to have a real-time priority of 0 (zero).
SCHED_FIFO (First in, first out)
A real-time policy with a priority range of from 1 - 99, with 1 being the lowest and 99 the highest. SCHED_FIFO threads always have a higher priority than SCHED_OTHER threads (for example, a SCHED_FIFO thread with a priority of 1 will have a higher priority than any SCHED_OTHER thread). Any thread created as a SCHED_FIFO thread has a fixed priority and will run until it is blocked or preempted by a higher priority thread.
SCHED_RR (Round-Robin)
SCHED_RR is a modification of SCHED_FIFO. Threads with the same priority have a quantum and are round-robin scheduled among all equal priority SCHED_RR threads. This policy is rarely used.

1.1. Running Latency Tests and Interpreting Their Results

To verify that the potential hardware platform is suitable for real-time operations, you should run some latency and performance tests with the Real Time kernel. These tests can highlight BIOS or system tuning (including partitioning) issues that might be experienced under a load.

1.1.1. Preliminary Steps

Procedure 1.1. To successfully test your system and interpret the results:

  1. Check the vendor documentation for any tuning steps required for low latency operation.
    This step aims to reduce or remove any System Management Interrupts (SMIs) that would transition the system into System Management Mode (SMM). While a system is in SMM it is running firmware and not running operating system code, meaning any timers that expire while in SMM will have to wait until the system transitions back into normal operation. This can cause unexplained latencies since SMIs cannot be blocked by Linux and the only indication that we actually took an SMI may be found in vendor-specific performance counter registers.

    Warning

    Red Hat strongly recommends that you do not completely disable SMIs, as it can result in catastrophic hardware failure.
  2. Ensure that RHEL-RT and rt-tests package is installed.
    This step verifies that you have tuned the system properly.
  3. Run the hwlatdetect program.
    hwlatdetect looks for hardware-firmware induced latencies by polling the clock-source and looking for unexplained gaps.
    Generally, you do not need to run any sort of load on the system while running hwlatdetect, since the program is looking for latencies introduced by hardware architecture or BIOS/EFI firmware.
    A typical output of hwlatdetect looks like this:
    # hwlatdetect --duration=60s
    hwlatdetect:  test duration 60 seconds
    	detector: tracer
    	parameters:
    		Latency threshold: 10us
    		Sample window:     1000000us
    		Sample width:      500000us
    		Non-sampling period:  500000us
    		Output File:       None
    
    Starting test
    test finished
    Max Latency: Below threshold
    Samples recorded: 0
    Samples exceeding threshold: 0
    The above result represents a system that was tuned to minimize system interruptions from firmware.
    However, not all systems can be tuned to minimize system interruptions as shown below:
    # hwlatdetect --duration=10s
    hwlatdetect:  test duration 10 seconds
    	detector: tracer
    	parameters:
    		Latency threshold: 10us
    		Sample window:     1000000us
    		Sample width:      500000us
    		Non-sampling period:  500000us
    		Output File:       None
    
    Starting test
    test finished
    Max Latency: 18us
    Samples recorded: 10
    Samples exceeding threshold: 10
    SMIs during run: 0
    ts: 1519674281.220664736, inner:17, outer:15
    ts: 1519674282.721666674, inner:18, outer:17
    ts: 1519674283.722667966, inner:16, outer:17
    ts: 1519674284.723669259, inner:17, outer:18
    ts: 1519674285.724670551, inner:16, outer:17
    ts: 1519674286.725671843, inner:17, outer:17
    ts: 1519674287.726673136, inner:17, outer:16
    ts: 1519674288.727674428, inner:16, outer:18
    ts: 1519674289.728675721, inner:17, outer:17
    ts: 1519674290.729677013, inner:18, outer:17
    The above result shows that while doing consecutive reads of the system clocksource, there were 10 delays that showed up in the 15-18 us range.
    hwlatdetect was using the tracer mechanism as the detector for unexplained latencies. Previous versions used a kernel module rather than ftrace tracer.
    parameters report a latency and how the detection was run. The default latency threshold was 10 microseconds (10 us), the sample window was 1 second, the sampling window was 0.5 seconds.
    As a result, tracer ran a detector thread that ran for one half of each second of the specified duration.
    The detector thread runs a loop which does the following pseudocode:
    t1 = timestamp()
    	loop:
    		t0 = timestamp()
    		if (t0 - t1) > threshold
    		   outer = (t0 - t1)
    		t1 = timestamp
    		if (t1 - t0) > threshold
    		   inner = (t1 - t0)
    		if inner or outer:
    		   print
    		if t1 > duration:
    		   goto out
    		goto loop
    	out:
    The inner loop comparison checks that t0 - t1 does not exceed the specified threshold (10 us default). The outer loop comparison checks the time between the bottom of the loop and the top t1 - t0. The time between consecutive reads of the timestamp register should be dozens of nanoseconds (essentially a register read, a comparison and a conditional jump) so any other delay between consecutive reads is introduced by firmware or by the way the system components were connected.

    Note

    The values printed out by the hwlatdetector for inner and outer are the best case maximum latency. The latency values are the deltas between consecutive reads of the current system clocksource (usually the Time Stamp Counter or TSC register, but potentially the HPET or ACPI power management clock) and any delays between consecutive reads, introduced by the hardware-firmware combination.
After finding the suitable hardware-firmware combination, the next step is to test the real-time performance of the system while under a load.

1.1.2. Testing the System Real-time Performance under Load

RHEL-RT provides the rteval utility to test the system real-time performance under load. rteval starts a heavy system load of SCHED_OTHER tasks and then measures real-time response on each online CPU. The loads are a parallel make of the Linux kernel tree in a loop and the hackbench synthetic benchmark.
The goal is to bring the system into a state, where each core always has a job to schedule. The jobs perform various tasks, such as memory allocation/free, disk I/O, computational tasks, memory copies, and other.
Once the loads have started up, rteval then starts the cyclictest measurement program. This program starts the SCHED_FIFO real-time thread on each online core and then measures real-time scheduling response time. Each measurement thread takes a timestamp, sleeps for an interval, then takes another timestamp after waking up. The latency measured is t1 - (t0 + i), which is the difference between the actual wakeup time t1, and the theoretical wakeup time of the first timestamp t0 plus the sleep interval i.
The details for the rteval run are written to the XML file along with the boot log for the system. Then the rteval-<date>-N.tar.bz2 file is generated. N is a counter for the Nth run on <date>. A report, generated from the XML file, similar to the below, will be printed to the screen:
System:  
Statistics: 
	Samples:           1440463955
	Mean:              4.40624790712us
	Median:            0.0us
	Mode:              4us
	Range:             54us
	Min:               2us
	Max:               56us
	Mean Absolute Dev: 1.0776661507us
	Std.dev:           1.81821060672us

CPU core 0       Priority: 95
Statistics: 
	Samples:           36011847
	Mean:              5.46434910711us
	Median:            4us
	Mode:              4us
	Range:             38us
	Min:               2us
	Max:               40us
	Mean Absolute Dev: 2.13785341159us
	Std.dev:           3.50155558554us
The report above brings details on the hardware, length of the run, options used, and the timing results, both per-cpu and system-wide. You can regenerate the report by running the # rteval --summarize rteval-<date>-n.tar.bz2 command.