Chapter 2. Memory allocation

Linux-based operating systems use a virtual memory system. Any address referenced by a user-space application must be translated into a physical address. This is achieved through a combination of page tables and address translation hardware in the underlying computer system.
One consequence of having the translation mechanism in between a program and the actual memory is that the operating system can steal pages when required. This is achieved by marking a previously used page table entry as invalid, so that even under normal memory pressure, the operating system might scavenge pages from one application to give to another. This can have adverse affects on systems that require deterministic behavior. Instructions that normally execute in a fixed amount of time can take longer than normal because a page fault has been triggered.

2.1. Demand paging

Under Linux, all memory addresses generated by a program get passed through an address translation mechanism in the processor. The addresses are converted from a process-specific virtual address to a physical memory address. This is referred to as virtual memory.
The two main components in the translation mechanism are page tables and translation lookaside buffers (TLBs). Page tables are multi-level tables in physical memory that contain mappings for virtual to physical memory. These mappings are readable by the virtual memory translation hardware in the processor. TLBs are caches for page table translations.
When a page table entry has been assigned a physical address, it is referred to as the resident working set. When the operating system needs to free memory for other processes, it can remove pages from the working set. When this happens, any reference to a virtual address within that page will create a page fault, and the page will be reallocated. If the system is extremely low on physical memory, then this process will start to thrash, constantly stealing pages from processes, and never allowing a process to complete. The virtual memory statistics can be monitored by looking for the pgfault value in the /proc/vmstat file.
TLBs are hardware caches of virtual memory translations. Any processor core with a TLB will check the TLB in parallel with initiating a memory read of a page table entry. If the TLB entry for a virtual address is valid, the memory read is aborted and the value in the TLB is used for the address translation.
TLBs operate on the principle of locality of reference. This means that if code stays in one region of memory for a significant period of time (such as loops or call-related functions) then the TLB references avoid the main memory for address translations. This can significantly speed up processing times. When writing deterministic and fast code, use functions that maintain locality of reference. This can mean using loops rather than recursion. If recursion cannot be avoided, place the recursion call at the end of the function. This is called tail-recursion, which makes the code work in a relatively small region of memory and avoid fetching table translations from main memory.
A potential source of memory latency is called a minor page fault. They are created when a process attempts to access a portion of memory before it has been initialized. In this case, the system will need to perform some operations to fill the memory maps or other management structures. The severity of a minor page fault can depend on system load and other factors, but they are usually short and have a negligable impact.
A more severe memory latency is a major page fault. These can occur when the system has to synchronize memory buffers with the disk, swap memory pages belonging to other processes, or undertake any other Input/Output activity to free memory. This occurs when the processor references a virtual memory address that has not had a physical page allocated to it. The reference to an empty page causes the processor to execute a fault, and instructs the kernel code to allocate a page and return, all of which increases latency dramatically.
When writing a multi-threaded application, it is important to consider the machine topology when designing the data decomposition. Topology is the memory heirarchy, and includes CPU caches and the NUMA node. Sharing data information very tightly on CPUs in different cache and NUMA domains can lead to traffic problems and bottlenecks.
Contention can create drastic performance problems. On some hardware, the traffic on the various memory buses are not subject to any fairness rules. Always check the hardware you are using in order to avoid this.
Memory allocation errors can not always be eliminated through the use of CPU affinity, scheduling policies, and priorities. When an application shows a performance drop, it can be beneficial to check if it is being affected by page faults. There are a number of ways of doing this, but a simple method is to look at the process information in the /proc directory. For a particular process PID, use the cat command to view the /proc/PID/stat file. The relevant entries in this file are:
  • Field 2 - filename of the executable
  • Field 10 - number of minor page faults
  • Field 12 - number of major page faults
When a process encounters a page fault all its threads will be frozen until the kernel handles the fault. There are several ways to address this problem, although the best solution is to adjust the source code to avoid page faults.

Example 2.1. Using the /proc file to check for page faults

This example uses the /proc file to check for page faults in a running process.
Use the cat command and a pipe function to return only the second, tenth, and twelfth lines of the /proc/PID/stat file:
# cat /proc/3366/stat | cut -d\  -f2,10,12
(bash) 5389 0

In the above output, PID 3366 is bash, and it has reported 5389 minor page faults, and no major page faults.

Further reading

For more information, or for further reading, the following sources are related to the information given in this section:
  • Linux System Programming by Robert Love