Reference guide for the Realtime component of Red Hat Enterprise MRG
Legal Notice
Abstract
- Preface
- Preface
- I. Hardware
- II. Application architecture
- III. Library services
- 16. More information
- A. Revision history
Preface
1. Document Conventions
1.1. Typographic Conventions
Mono-spaced Bold
To see the contents of the filemy_next_bestselling_novelin your current working directory, enter thecat my_next_bestselling_novelcommand at the shell prompt and press Enter to execute the command.
Press Enter to execute the command.Press Ctrl+Alt+F2 to switch to a virtual terminal.
mono-spaced bold. For example:
File-related classes includefilesystemfor file systems,filefor files, anddirfor directories. Each class has its own associated set of permissions.
Choose → → from the main menu bar to launch Mouse Preferences. In the Buttons tab, select the Left-handed mouse check box and click to switch the primary mouse button from the left to the right (making the mouse suitable for use in the left hand).To insert a special character into a gedit file, choose → → from the main menu bar. Next, choose → from the Character Map menu bar, type the name of the character in the Search field and click . The character you sought will be highlighted in the Character Table. Double-click this highlighted character to place it in the Text to copy field and then click the button. Now switch back to your document and choose → from the gedit menu bar.
Mono-spaced Bold Italic or Proportional Bold Italic
To connect to a remote machine using ssh, typessh username@domain.nameat a shell prompt. If the remote machine isexample.comand your username on that machine is john, typessh john@example.com.Themount -o remount file-systemcommand remounts the named file system. For example, to remount the/homefile system, the command ismount -o remount /home.To see the version of a currently installed package, use therpm -q packagecommand. It will return a result as follows:package-version-release.
Publican is a DocBook publishing system.
1.2. Pull-quote Conventions
mono-spaced roman and presented thus:
books Desktop documentation drafts mss photos stuff svn books_tests Desktop1 downloads images notes scripts svgs
mono-spaced roman but add syntax highlighting as follows:
static int kvm_vm_ioctl_deassign_device(struct kvm *kvm,
struct kvm_assigned_pci_dev *assigned_dev)
{
int r = 0;
struct kvm_assigned_dev_kernel *match;
mutex_lock(&kvm->lock);
match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
assigned_dev->assigned_dev_id);
if (!match) {
printk(KERN_INFO "%s: device hasn't been assigned before, "
"so cannot be deassigned\n", __func__);
r = -EINVAL;
goto out;
}
kvm_deassign_device(kvm, match);
kvm_free_assigned_device(kvm, match);
out:
mutex_unlock(&kvm->lock);
return r;
}1.3. Notes and Warnings
Note
Important
Warning
2. Getting Help and Giving Feedback
2.1. Do You Need Help?
- Search or browse through a knowledge base of technical support articles about Red Hat products.
- Submit a support case to Red Hat Global Support Services (GSS).
- Access other product documentation.
2.2. We Need Feedback
Preface
- Messaging — Cross platform, high performance, reliable messaging using the Advanced Message Queuing Protocol (AMQP) standard.
- Realtime — Consistent low-latency and predictable response times for applications that require microsecond latency.
- Grid — Distributed High Throughput (HTC) and High Performance Computing (HPC).
Part I. Hardware
Table of Contents
Chapter 1. Processor cores
1.1. Caches
1.2. Interconnects
Chapter 2. Memory allocation
2.1. Demand paging
pgfault value in the /proc/vmstat file.
/proc directory. For a particular process PID, use the cat command to view the /proc/PID/stat file. The relevant entries in this file are:
Field 2- filename of the executableField 10- number of minor page faultsField 12- number of major page faults
Example 2.1. Using the /proc file to check for page faults
/proc file to check for page faults in a running process.
cat command and a pipe function to return only the second, tenth, and twelfth lines of the /proc/PID/stat file:
# cat /proc/3366/stat | cut -d\ -f2,10,12 (bash) 5389 0
bash, and it has reported 5389 minor page faults, and no major page faults.
Note
- Linux System Programming by Robert Love
2.2. Using mlock to avoid memory faults
mlock and mlockall system calls tell the system to lock to a specified memory range, and to not allow that memory to be paged. This means that once the physical page has been allocated to the page table entry, references to that page will always be fast.
mlock system calls available. The mlock and munlock calls lock and unlock a specific range of addresses. The mlockall and munlockall calls lock or unlock the entire program space.
mlock calls should be examined carefully and used with caution. If the application is large, or if it has a large data domain, the mlock calls can cause thrashing if the system cannot allocate memory for other tasks.
Note
mlock with care. Using it excessively can lead to an out of memory (OOM) error. Do not just put an mlockall call at the start of your application. It is recommended that only the data and text of the realtime portion of the application be locked.
mlock will not guarantee that the program will experience no page I/O. It is used to ensure that the data will stay in memory, but can not ensure that it will stay in the same page. Other functions such as move_pages and memory compactors can move data around despite the mlock.
Important
CAP_IPC_LOCK capability in order to be able to use mlockall or mlock on large buffers. See the capabilities(7) man page for details.
mlock or mlockall, they will be unlocked by a single call to munlock for the corresponding page, or by munlockall. Thus, the application must be aware of which pages it is unlocking in order to prevent this double-lock/single-unlock problem.
- Tracking the memory areas allocated and locked, and creating a wrapper function that, before unlocking a page, verifies how many users (allocations) that page has. This is the resource counting principle used in device drivers.
- Perform allocations considering the page size and aignment, in order to prevent a double-lock in the same page.
mlock depends on the application's needs and system resources. Although there is no single solution for all the applications, the following code example can be used as a starting point for the implementation of a function that will allocate and lock memory buffers.
Example 2.2. Using mlock in an application
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
void *
alloc_workbuf(size_t size)
{
void *ptr;
int retval;
/*
* alloc memory aligned to a page, to prevent two mlock() in the
* same page.
*/
retval = posix_memalign(&ptr, (size_t) sysconf(_SC_PAGESIZE), size);
/* return NULL on failure */
if (retval)
return NULL;
/* lock this buffer into RAM */
if (mlock(ptr, size)) {
free(ptr);
return NULL;
}
return ptr;
}
void
free_workbuf(void *ptr, size_t size)
{
/* unlock the address range */
munlock(ptr, size);
/* free the memory */
free(ptr);
}alloc_workbuf dynamically allocates a memory buffer and locks it. The memory allocation is performed by posix_memalig in order to align the memory area to a page. If size is smaller then a page size, regular malloc allocation will be able to use the remainder of the page. But, to safely use this method advantage, no mlock calls can be made on regular malloc allocations. This will prevent the double-lock/single-unlock problem. The function free_workbuf will unlock and free the memory area.
mlock and mlockall, it is possible to allocate and lock a memory area using mmap with the MAP_LOCKED flag. The following example is the implementation of the aforementioned code using mmap.
Example 2.3. Using mmap in an application
#include <sys/mman.h>
#include <stdlib.h>
void *
alloc_workbuf(size_t size)
{
void *ptr;
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
if (ptr == MAP_FAILED)
return NULL;
return ptr;
}
void
free_workbuf(void *ptr, size_t size)
{
munmap(ptr, size);
}mmap allocates memory on a page basis, there are no two locks in the same page, helping to prevent the double-lock/single-unlock problem. On the other hand, if the size variable is not a multiple of the page size, the rest of the page is wasted. Furthermore, a call to munlockall unlocks the memory locked by mmap.
mlockall prior to entering a time-sensitive region of the code, followed by munlockall at the end of the time-sensitive region. This can reduce paging while in the critical section. Similarly, mlock can be used on a data region that is relatively static or that will grow slowly but needs to be accessed without page faulting.
Note
- capabilities(7)
- mlock(2)
- mlock(3)
- mlockall(2)
- mmap(2)
- move_pages(2)
- posix_memalign(3)
- posix_memalign(3p)
Chapter 3. Hardware interrupts
Example 3.1. Viewing interrupts on your system
cat command to view /proc/interrupts:
$ cat /proc/interrupts CPU0 CPU1 0: 13072311 0 IO-APIC-edge timer 1: 18351 0 IO-APIC-edge i8042 8: 190 0 IO-APIC-edge rtc0 9: 118508 5415 IO-APIC-fasteoi acpi 12: 747529 86120 IO-APIC-edge i8042 14: 1163648 0 IO-APIC-edge ata_piix 15: 0 0 IO-APIC-edge ata_piix 16: 12681226 126932 IO-APIC-fasteoi ahci, uhci_hcd:usb2, radeon, yenta, eth0 17: 3717841 0 IO-APIC-fasteoi uhci_hcd:usb3, HDA, iwl3945 18: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 19: 577 68 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb5 NMI: 0 0 Non-maskable interrupts LOC: 3755270 9388684 Local timer interrupts RES: 1184857 2497600 Rescheduling interrupts CAL: 12471 2914 function call interrupts TLB: 14555 15567 TLB shootdowns TRM: 0 0 Thermal event interrupts SPU: 0 0 Spurious interrupts ERR: 0 MIS: 0
3.1. Level-signalled interrupts
3.2. Message-signalled interrupts
pci=nomsi on the kernel command line.
3.3. Non-maskable interrupts
3.4. System management interrupts
Note
hwlatdetect utility, which is available in the rt-tests package. This utility is designed to measure periods of time during which the CPU has been stolen by an SMI handling routine.
3.5. Advanced programmable interrupt controller
Part II. Application architecture
Chapter 4. Threads and processes
- Process
- A UNIX®-style process is an operating system construct that contains:
- Address mappings for virtual memory
- An execution context (PC, stack, registers)
- State/Accounting information
Linux processes started as exactly this style of process. When the concept of more than one process running inside one address space was developed, Linux turned to a process structure that shares an address space with another process. This works well, as long as the process data structure is kept small. For the remainder of this document, the term process refers to an independent address space, potentially containing multiple threads. - Thread
- Strictly, a thread is a schedulable entity that contains:
- A program counter (PC)
- A register context
- A stack pointer
Multiple threads can exist within a process.
- Use the
forkandexecfunctions to create new processes - Use the Posix Threads (pthreads) API to create new threads within an already running process
Note
Note
- fork(2)
- exec(2)
- Programming with POSIX Threads, David R. Butenhof, Addison-Wesley, ISBN 0-201-63392-2
- Advanced Programming in the UNIX Environment, 2nd Ed., W. Richard Stevens and Stephen A. Rago, Addison-Wesley, ISBN 0-201-43307-9
- “POSIX Threads Programming”, Blaise Barney, Lawrence Livermore National Laboratory, http://www.llnl.gov/computing/tutorials/pthreads/
Chapter 5. Priorities and policies
SCHED_OTHERorSCHED_NORMAL: The default policySCHED_BATCH: Similar toSCHED_OTHER, but with a throughput orientationSCHED_IDLE: A lower priority thanSCHED_OTHERSCHED_FIFO: A first in/first out realtime policySCHED_RR: A round-robin realtime policy
SCHED_OTHER, SCHED_FIFO, and SCHED_RR.
SCHED_OTHER or SCHED_NORMAL is the default scheduling policy for Linux threads. It has a dynamic priority that is changed by the system based on the characteristics of the thread. Another thing that effects the priority of SCHED_OTHER threads is their nice value. The nice value is a number between -20 (highest priority) and 19 (lowest priority). By default, SCHED_OTHER threads have a nice value of 0. Adjusting the nice value will change the way the thread is handled.
SCHED_FIFO policy will run ahead of SCHED_OTHER tasks. Instead of using nice values, SCHED_FIFO uses a fixed priority between 1 (lowest) and 99 (highest). A SCHED_FIFO thread with a priority of 1 will always be scheduled ahead of any SCHED_OTHER thread.
SCHED_RR policy is very similar to the SCHED_FIFO policy. In the SCHED_RR policy, threads of equal priority are scheduled in a round-robin fashion. Generally, SCHED_FIFO is preferred over SCHED_RR.
SCHED_FIFO and SCHED_RR threads will run until one of the following events occurs:
- The thread goes to sleep or begins waiting for an event
- A higher-priority realtime thread becomes ready to run
Table 5.1. Policy priorities
| Policy | Default priority value | Lowest priority value | Highest priority value |
|---|---|---|---|
SCHED_FIFO | 1 | 99 | |
SCHED_RR | 1 | 99 | |
SCHED_OTHER | 0 | -20 | 19 |
Chapter 6. Affinity
- Reserve one CPU core for all system processes and allow the application to run on the remainder of the core, with one CPU core per application thread.
- Allow a thread application and a given kernel thread (such as the network softirq or a driver thread) on the same CPU.
- Pair producer and consumer threads on each CPU.
Tuna tool, or through the use of shell scripts to modify the bitmask value. The taskset command can be used to change the affinity of a process, while modifying the /proc filesystem entry changes the affinity of an interrupt.
Note
- MRG Tuna User Guide
- taskset(1)
6.1. Using the taskset command to set processor affinity
taskset command sets and checks affinity information for a given process. These tasks can also be achieved using the Tuna tool.
-p, or --pid option and the PID of the process to be checked. The -c or --cpu-list displays the information as a numerical list of cores, instead of as a bitmask.
# taskset -p -c 1000 pid 1000's current affinity list: 0,1
# taskset -p -c 1 1000 pid 1000's current affinity list: 0,1 pid 1000's new affinity list: 1
# taskset -p -c 0,1 1000 pid 1000's current affinity list: 1 pid 1000's new affinity list: 0,1
taskset command can also be used to start a new process with a particular affinity. This command will run the /bin/my-app application on CPU 4:
# taskset -c 4 /bin/my-app
/bin/my-app application on CPU 4, with a SCHED_FIFO policy and a priority of 78:
# taskset -c 5 chrt -f 78 /bin/my-app
6.2. Using the sched_getaffinity() system call to set processor affinity
taskset command, processor affinity can also be set using the sched_getaffinity() system call.
int sched_getaffinity(pid_t pid, size_t setsize, const cpu_set_t *set)
Note
- sched_getaffinity(2)
- sched_setaffinity(2)
Chapter 7. Thread synchronization
- Mutexes
- Barriers
- Condvars
7.1. Mutexes
pthread_create_mutex library call. A mutex serializes access to each section of code, so that only one thread of an application is running the code at any one time.
7.2. Barriers
7.3. Condvars
7.4. Other types of synchronization
Chapter 8. Sockets
8.1. Socket options
TCP_NODELAY and TCP_CORK.
TCP_NODELAYTCP_NODELAY is a socket option that can be used to turn this behavior off. It can be enabled through the setsockopt sockets API, with the following function:
int one = 1; setsockopt(descriptor, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
TCP_NODELAY can also interact with other optimization heuristics to result in poor overall performance.
TCP_NODELAY enabled.
writev on a socket with TCP_NODELAY enabled.
TCP_CORKTCP_CORK. When enabled, TCP will delay all packets until the application removes the cork, and allows the stored packets to be sent. This allows applications to build a packet in kernel space, which is useful when different libraries are being used to provide layer abstractions.
TCP_CORK option can can be enabled by using the following function:
int one = 1; setsockopt(descriptor, SOL_TCP, TCP_CORK, &one, sizeof(one));
TCP_CORK is often referred to as corking the socket.
int zero = 0; setsockopt(descriptor, SOL_TCP, TCP_CORK, &zero, sizeof(zero));Once the socket is uncorked, TCP will send the accumulated logical package immediately, without waiting for further packets from the application.
Example 8.1. Using TCP_NODELAY and TCP_CORK
TCP_NODELAY and TCP_CORK can have on an application.
$ ./tcp_nodelay_server 5001 10000
no_delay option to enable TCP_NODELAY socket options. Use the cork option to enable TCP_CORK. In all cases it will send 15 packets, each of two bytes, and wait for a response from the server.
TCP_NODELAY nor TCP_CORK are in use. This is a baseline measurement. TCP coalesces writes and has to wait to check if the application has more data than can optimally fit in the network packet:
$ ./tcp_nodelay_client localhost 5001 10000 10000 packets of 30 bytes sent in 400129.781250 ms: 0.749757 bytes/ms
TCP_NODELAY only. TCP is instructed not to coalesce small packets, but to send buffers immediately. This improves performance significantly, but creates a large number of network packets for each logical packet.
$ ./tcp_nodelay_client localhost 5001 10000 no_delay 10000 packets of 30 bytes sent in 1649.771240 ms: 181.843399 bytes/ms using TCP_NODELAY
TCP_CORK only. It halves the time required to the send the same number of logical packets. This is because TCP coalesces full logical packets in its buffers, and sends fewer overall network packets.
$ ./tcp_nodelay_client localhost 5001 10000 cork 10000 packets of 30 bytes sent in 850.796448 ms: 352.610779 bytes/ms using TCP_CORK
TCP_CORK is the best technique to use. It allows the application to precisely convey the information that a packet is finished and must be sent without delay. When developing programs, if they need to send bulk data from a file, consider using TCP_CORK with sendfile.
Note
- sendfile(2)
- “TCP nagle sample applications”, which are example applications of ghost protocols written in C. To download them, right-click and save from the following links:
Chapter 9. Shared memory
shmem set of calls. These calls are quite capable, but overly complicated and cumbersome for the vast majority of use cases. For this reason, they have been deprecated on the MRG Realtime kernel and should no longer be used.
shm_open and mmap.
Note
- shm_open(3)
- shm_overview(7)
- mmap(2)
Chapter 10. Shared libraries
ld.so system loader. From there, they are mapped into the address space of processes that require symbols from the library. Until the first reference to a symbol is encountered it cannot be evaluated. Evaluating the symbol only when it is referenced can be a source of latency. This is because memory pages can be on disk, and caches can become invalidated. Evaluating symbols in advance is a safe side procedure that can help to improve latency. .
LD_BIND_NOW environment variable. Setting LD_BIND_NOW to any value other than null will cause the system loader to lookup all unresolved symbols at program load time.
Note
- ld.so(8)
Part III. Library services
Chapter 11. Setting the scheduler
11.1. Using chrt to set the scheduler
chrt is used to check and adjust scheduler policies and priorities. It can start new processes with the desired properties, or change the properties of a running process.
--pid or -p option alone to specify the process ID (PID):
# chrt -p 468 pid 468's current scheduling policy: SCHED_FIFO pid 468's current scheduling priority: 85 # chrt -p 476 pid 476's current scheduling policy: SCHED_OTHER pid 476's current scheduling priority: 0
Table 11.1. Policy options for the chrt command
| Short option | Long option | Description |
|---|---|---|
-f | --fifo | Set schedule to SCHED_FIFO |
-o | --other | Set schedule to SCHED_OTHER |
-r | --rr | Set schedule to SCHED_RR |
SCHED_FIFO, with a priority of 50:
# chrt -f -p 50 1000
SCHED_OTHER, with a priority of 0:
# chrt -o -p 0 1000
SCHED_FIFO and a priority of 36:
# chrt -f 36 /bin/my-app
Note
- Tuna User Guide
- chrt(1)
11.2. Preemption
/proc/PID/status, where PID is the PID of the process. The following command checks the preemption of the process with PID 1000:
# grep voluntary /proc/1000/status voluntary_ctxt_switches: 194529 nonvoluntary_ctxt_switches: 195338
Note
- Tuna User Guide
- grep(1)
11.3. Using library calls to set priority
nicegetprioritysetpriority
Important
sched.h header file. Ensure you always check the return codes from functions. The appropriate man pages outline the various codes used.
Further reading
- nice(2)
- getpriority(2)
- setpriority(2)
11.3.1. sched_getscheduler
sched_getscheduler() function retrieves the scheduler policy for a given PID:
#include <sched.h> int policy; policy = sched_getscheduler(pid_t pid);
SCHED_OTHER, SCHED_RR and SCHED_FIFO are also defined in sched.h. They can be used to check the defined policy or to set the policy:
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
main(int argc, char *argv[])
{
pid_t pid;
int policy;
if (argc < 2)
pid = 0;
else
pid = atoi(argv[1]);
printf("Scheduler Policy for PID: %d -> ", pid);
policy = sched_getscheduler(pid);
switch(policy) {
case SCHED_OTHER: printf("SCHED_OTHER\n"); break;
case SCHED_RR: printf("SCHED_RR\n"); break;
case SCHED_FIFO: printf("SCHED_FIFO\n"); break;
default: printf("Unknown...\n");
}
}
11.3.2. sched_setscheduler
sched_setscheduler() function. Currently, realtime policies have one parameter, sched_priority. This parameter is used to adjust the priority of the process.
sched_setscheduler function requires three parameters, in the form: sched_setscheduler(pid_t pid, int policy, const struct sched_param *sp);
Note
sched_setscheduler(2) man page lists all possible return values of sched_setscheduler, including the error codes.
pid is zero, the sched_setscheduler() function will act on the calling process.
SCHED_FIFO and the priority to 50:
struct sched_param sp = { .sched_priority = 50 };
int ret;
ret = sched_setscheduler(0, SCHED_FIFO, &sp);
if (ret == -1) {
perror("sched_setscheduler");
return 1;
}
11.3.3. sched_getparam and sched_setparam
sched_setparam() function is used to set the scheduling parameters of a particular process. This can then be verified using the sched_getparam() function.
sched_getscheduler() function, which only returns the scheduling policy, the sched_getparam() function returns all scheduling parameters for the given process.
struct sched_param sp; int ret; /* reads priority and increments it by 2 */ ret = sched_getparam(0, &sp); sp.sched_priority += 2; /* sets the new priority */ ret = sched_setparam(0, &sp);
Note
Important
11.3.4. sched_get_priority_min and sched_get_priority_max
sched_get_priority_min and sched_get_priority_max functions are used to check the valid priority range for a given scheduler policy.
-1 and errno will be set to EINVAL.
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
main()
{
printf("Valid priority range for SCHED_OTHER: %d - %d\n",
sched_get_priority_min(SCHED_OTHER),
sched_get_priority_max(SCHED_OTHER));
printf("Valid priority range for SCHED_FIFO: %d - %d\n",
sched_get_priority_min(SCHED_FIFO),
sched_get_priority_max(SCHED_FIFO));
printf("Valid priority range for SCHED_RR: %d - %d\n",
sched_get_priority_min(SCHED_RR),
sched_get_priority_max(SCHED_RR));
}Note
SCHED_FIFO and SCHED_RR can be any number within the range of 1 to 99. POSIX is not guaranteed to honor this range, however, and portable programs should use these calls.
11.3.5. sched_rr_get_interval
SCHED_RR policy differs slightly from the SCHED_FIFO policy. SCHED_RR allocates concurrent processes that have the same priority in a round-robin rotation. In this way, each process is assigned a timeslice. The sched_rr_get_interval() function will report the timeslice that has been allocated to each process.
SCHED_RR processes, the sched_rr_get_interval() function is able to retrieve the timeslice length of any process on Linux.
timespec, or the number of seconds and nanoseconds since the base time of 00:00:00 GMT, 1 January 1970:
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
}sched_rr_get_interval function requires the PID of the process, and a struct timespec:
#include <stdio.h>
#include <sched.h>
main()
{
struct timespec ts;
int ret;
/* real apps must check return values */
ret = sched_rr_get_interval(0, &ts);
printf("Timeslice: %lu.%lu\n", ts.tv_sec, ts.tv_nsec);
}sched_03, with varying policies and priorities. Processes with a SCHED_FIFO policy will return a timeslice of 0 seconds and 0 nanoseconds, indicating that it is infinite:
$ chrt -o 0 ./sched_03 Timeslice: 0.38994072
$ chrt -r 10 ./sched_03 Timeslice: 0.99984800
$ chrt -f 10 ./sched_03 Timeslice: 0.0
Chapter 12. Creating threads and processes
Chapter 13. Mmap
mmap system call allows a file (or parts of a file) to be mapped to memory. This allows the file content to be changed with a memory operation, avoiding system calls and input/output operations.
Note
- mmap(2)
- Linux System Programming by Robert Love
Chapter 14. System calls
14.1. sched_yield
sched_yield function was originally designed to cause a processor to select a different process other than the running one. This type of request is prone to failure when issued from within a poorly-written application.
sched_yield() function is used within processes with realtime priorities, it can display unexpected behavior. The process that has called sched_yield gets moved to the tail of the queue of processes running at that priority. When this occurs in a situation where there are no other processes running at the same priority, the process that called sched_yield continues running. If the priority of that process is high, it can potentially create a busy loop, rendering the machine unusable.
sched_yield on realtime processes.
14.2. getrusage()
getrusage function is used to retrieve important information from a given process or its threads. This will not provide all the information available, but will report on information such as context switches and page faults.
Chapter 15. Timestamping
15.1. Hardware clocks
/sys/devices/system/clocksource/clocksource0/available_clocksource file:
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc hpet acpi_pm
/sys/devices/system/clocksource/clocksource0/current_clocksource file:
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc
/sys/devices/system/clocksource/clocksource0/available_clocksource file. To do so, write the name of the clock source into the /sys/devices/system/clocksource/clocksource0/current_clocksource file. For example, the following command sets HPET as the clock source in use:
# echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
Important
idle=poll parameter forces the clock to avoid entering the idle state, and the processor.max_cstate=1 parameter prevents the clock from entering deeper C-states. Note however that in both cases there would be an increase on energy consumption, as the system would always run at top speed.
Note
15.1.1. Reading hardware clock sources
Example 15.1. Comparing the cost of reading hardware clock sources
cat command. The time command is used to view the duration required to read the clock source 10 million times.
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc # time ./clock_timing real 0m0.601s user 0m0.592s sys 0m0.002s
# echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource # cat /sys/devices/system/clocksource/clocksource0/current_clocksource hpet # time ./clock_timing real 0m12.263s user 0m12.197s sys 0m0.001s
# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource # cat /sys/devices/system/clocksource/clocksource0/current_clocksource acpi_pm # time ./clock_timing real 0m24.461s user 0m0.504s sys 0m23.776s
time(1) man page provides detailed information on how to use the command and interpret its output. The example above uses the following categories:
real: The total time spent beginning from program invocation until the process ends.realincludesuserandsystimes, and will usually be larger than the sum of the latter two. If this process is interrupted by an application with higher priority, or by a system event such as a hardware interrupt (IRQ), this time spent waiting is also computed underreal.user: The time the process spent in user space, performing tasks that did not require kernel intervention.sys: The time spent by the kernel while performing tasks required by the user process. These tasks include opening files, reading and writing to files or I/O ports, memory allocation, thread creation and network related activities.
15.2. POSIX clocks
CLOCK_REALTIME: it represents the time in the real world, also referred to as 'wall time' meaning the time as read from the clock on the wall. This clock is used to timestamp events, and when interfacing with the user. It can be modified by an user with the right privileges. However, user modification should be used with caution as it can lead to erroneous data if the clock has its value changed between two readings.CLOCK_MONOTONIC: represents the time monotonically increased since the system boot. This clock cannot be set by any process, and is the preferred clock for calculating the time difference between events. The following examples in this section useCLOCK_MONOTONICas the POSIX clock.
Note
- clock_gettime()
- Linux System Programming by Robert Love
clock_gettime(), which is defined at <time.h>. The clock_gettime() command takes two parameters: the POSIX clock ID and a timespec structure which will be filled with the duration used to read the clock. The following example shows the function to measure the cost of reading the clock:
Example 15.2. Using clock_gettime() to measure the cost of reading POSIX clocks
#include <time.h>
main()
{
int rc;
long i;
struct timespec ts;
for(i=0; i<10000000; i++) {
rc = clock_gettime(CLOCK_MONOTONIC, &ts);
}
}
clock_gettime(), to verify the value of the rc variable, or to ensure the content of the ts structure is to be trusted. The clock_gettime() manpage provides more information to help you write more reliable applications.
Important
clock_gettime() function must be linked with the rt library by adding '-lrt' to the gcc command line.
cc clock_timing.c -o clock_timing -lrt
15.2.1. CLOCK_MONOTONIC_COARSE and CLOCK_REALTIME_COARSE
clock_gettime() and gettimeofday() have a counterpart in the kernel, in the form of a system call. When the user process calls clock_gettime(), the corresponding C library (glibc) calls the sys_clock_gettime() system call which performs the requested operation and then returns the result to the user program.
CLOCK_MONOTONIC_COARSE and CLOCK_REALTIME_COARSE POSIX clocks was created in the form of a VDSO library function. The _COARSE variants are faster to read and have a precision (also known as resolution) of one millisecond (ms).
15.2.2. Using clock_getres() to compare clock resolution
clock_getres() function you can check the resolution of a given POSIX clock. clock_getres() uses the same two parameters as clock_gettime(): the ID of the POSIX clock to be used, and a pointer to the timespec structure where the result is returned. The following function enables you to compare the precision between CLOCK_MONOTONIC and CLOCK_MONOTONIC_COARSE:
main()
{
int rc;
struct timespec res;
rc = clock_getres(CLOCK_MONOTONIC, &res);
if (!rc)
printf("CLOCK_MONOTONIC: %ldns\n", res.tv_nsec);
rc = clock_getres(CLOCK_MONOTONIC_COARSE, &res);
if (!rc)
printf("CLOCK_MONOTONIC_COARSE: %ldns\n", res.tv_nsec);
}
Example 15.3. Sample output of clock_getres
TSC: # ./clock_resolution CLOCK_MONOTONIC: 1ns CLOCK_MONOTONIC_COARSE: 999848ns (about 1ms) HPET: # ./clock_resolution CLOCK_MONOTONIC: 1ns CLOCK_MONOTONIC_COARSE: 999848ns (about 1ms) ACPI_PM: # ./clock_resolution CLOCK_MONOTONIC: 1ns CLOCK_MONOTONIC_COARSE: 999848ns (about 1ms)
15.2.3. Using C code to compare clock resolution
CLOCK_MONOTONIC POSIX clock. All nine digits in the tv_nsec field of the timespec structure are meaningful as the clock has a nanosecond resolution. The example function, named clock_test.c, is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
main()
{
int i;
struct timespec ts;
for(i=0; i<5; i++) {
clock_gettime(CLOCK_MONOTONIC, &ts);
printf("%ld.%ld\n", ts.tv_sec, ts.tv_nsec);
usleep(200);
}
}
Example 15.4. Sample output of clock_test.c and clock_test_coarse.c
# cc clock_test.c -o clock_test -lrt # ./clock_test 218449.986980853 218449.987330908 218449.987590716 218449.987849549 218449.988108248
clock_test_coarse.c and replacing CLOCK_MONOTONIC with CLOCK_MONOTONIC_COARSE, the result would look something like:
# ./clock_test_coarse 218550.844862154 218550.844862154 218550.844862154 218550.845862154 218550.845862154
_COARSE clocks have a one millisecond precision, therefore only the first three digits of the tv_nsec field of the timespec structure are significant. The result above could be read as:
# ./clock_test_coarse 218550.844 218550.844 218550.844 218550.845 218550.845
_COARSE variants of the POSIX clocks are particularly useful in cases where timestamping can be performed with millisecond precision. The benefits are more evident on systems which use hardware clocks with high costs for the reading operations, such as ACPI_PM.
15.2.4. Using the time command to compare cost of reading clocks
time command to read the clock source 10 million times in a row, you can compare the costs of reading CLOCK_MONOTONIC and CLOCK_MONOTONIC_COARSE representations of the hardware clocks available. The following example uses TSC, HPET and ACPI_PM hardware clocks. For more information on how to decipher the output of the time command see Section 15.1.1, “Reading hardware clock sources”.
Example 15.5. Comparing the cost of reading POSIX clocks
TSC: # time ./clock_timing_monotonic real 0m0.567s user 0m0.559s sys 0m0.002s # time ./clock_timing_monotonic_coarse real 0m0.120s user 0m0.118s sys 0m0.001s HPET: # time ./clock_timing_monotonic real 0m12.257s user 0m12.179s sys 0m0.002s # time ./clock_timing_monotonic_coarse real 0m0.119s user 0m0.118s sys 0m0.000s ACPI_PM: # time ./clock_timing_monotonic real 0m25.524s user 0m0.451s sys 0m24.932s # time ./clock_timing_monotonic_coarse real 0m0.119s user 0m0.117s sys 0m0.001s
sys time (the time spent by the kernel to perform tasks required by the user process) is greatly reduced when the _COARSE clocks are used. This is particularly evident in the ACPI_PM clock timings, which indicates that _COARSE variants of POSIX clocks yield high performance gains on clocks with high reading costs.
Chapter 16. More information
16.1. Reporting bugs
- Check that you have the latest version of the Red Hat Enterprise Linux 6 kernel, then boot into it from the grub menu. Try reproducing the problem with the standard kernel. If the problem still occurs, report a bug against Red Hat Enterprise Linux 6 not MRG Realtime.
- If the problem does not occur when using the standard kernel, then the bug is probably the result of changes introduced in either:
- The upstream kernel on which MRG Realtime is based. For example, Red Hat Enterprise Linux 6 is based on 2.6.32 and MRG Realtime is based on 3.8
- MRG Realtime specific enhancements Red Hat has applied on top of the baseline (3.8) kernel
To determine the problem, try to reproduce the problem on an unmodified upstream 3.8 kernel. For this reason, in addition to providing the MRG Realtime kernel, we also provide avanillakernel variant. Thevanillakernel is the upstream kernel build without the MRG Realtime additions.
- Create a Bugzilla account.
- Log in and click on Enter A New Bug Report.
- You will need to identify the product the bug occurs in. MRG Realtime appears under Red Hat Enterprise MRG in the Red Hat products list. It is important that you choose the correct product.
- Continue to enter the bug information by assigning an appropriate component and giving a detailed problem description. When entering the problem description be sure to include details of whether you were able to reproduce the problem on the standard Red Hat Enterprise Linux 6 or the supplied
vanillakernel.
16.2. Further reading
- Red Hat Enterprise MRG product information
- MRG Realtime Tuning Guide and other Red Hat Enterprise MRG documentation
- Mailing list
- To post to the list, send mail to
rhemrg-users-list@redhat.com - Subscribe to the mailing list at: https://www.redhat.com/mailman/listinfo/rhemrg-users-list
Appendix A. Revision history
| Revision History | ||||
|---|---|---|---|---|
| Revision 5-0 | Wed Nov 11 2015 | |||
| ||||
| Revision 4-2 | Fri May 30 2014 | |||
| ||||
| Revision 4-1 | Wed Sep 25 2013 | |||
| ||||
| Revision 4-0 | Wed Feb 27 2013 | |||
| ||||
| Revision 3-1 | Wed Dec 19 2012 | |||
| ||||
| Revision 3-0 | Thu May 31 2012 | |||
| ||||
| Revision 2-5 | Tue May 15 2012 | |||
| ||||
| Revision 2-4 | Wed May 9 2012 | |||
| ||||
| Revision 2-2 | Tue Apr 10 2012 | |||
| ||||
| Revision 2-1 | Tue Feb 28 2012 | |||
| ||||
| Revision 2-0 | Wed Dec 7 2011 | |||
| ||||
| Revision 1-6 | Wed Nov 16 2011 | |||
| ||||
| Revision 1-2 | Wed Oct 5 2011 | |||
| ||||
| Revision 1-1 | Thu Sep 22 2011 | |||
| ||||
| Revision 1-0 | Thu Jun 23 2011 | |||
| ||||
| Revision 0.1-3 | Thu May 19 2011 | |||
| ||||
| Revision 0.1-2 | Mon May 16 2011 | |||
| ||||
| Revision 0.1-1 | Tue Apr 05 2011 | |||
| ||||
| Revision 0.1-0 | Wed Feb 23 2011 | |||
| ||||
