Red Hat Training

A Red Hat training course is available for RHEL 8

Chapter 35. Configuring an operating system to optimize memory access

You can configure the operating system to optimize memory access across workloads with the tools that are included in RHEL

35.1. Tools for monitoring and diagnosing system memory issues

The following tools are available in Red Hat Enterprise Linux 8 for monitoring system performance and diagnosing performance problems related to system memory:

  • vmstat tool, provided by the procps-ng package, displays reports of a system’s processes, memory, paging, block I/O, traps, disks, and CPU activity. It provides an instantaneous report of the average of these events since the machine was last turned on, or since the previous report.
  • valgrind framework provides instrumentation to user-space binaries. Install this tool, using the yum install valgrind command. It includes a number of tools, that you can use to profile and analyze program performance, such as:

    • memcheck option is the default valgrind tool. It detects and reports on a number of memory errors that can be difficult to detect and diagnose, such as:

      • Memory access that should not occur
      • Undefined or uninitialized value use
      • Incorrectly freed heap memory
      • Pointer overlap
      • Memory leaks

        Note

        Memcheck can only report these errors, it cannot prevent them from occurring. However, memcheck logs an error message immediately before the error occurs.

    • cachegrind option simulates application interaction with a system’s cache hierarchy and branch predictor. It gathers statistics for the duration of application’s execution and outputs a summary to the console.
    • massif option measures the heap space used by a specified application. It measures both useful space and any additional space allocated for bookkeeping and alignment purposes.

Additional resources

  • vmstat(8) and valgrind(1) man pages
  • /usr/share/doc/valgrind-version/valgrind_manual.pdf file

35.2. Overview of a system’s memory

The Linux Kernel is designed to maximize the utilization of a system’s memory resources (RAM). Due to these design characteristics, and depending on the memory requirements of the workload, part of the system’s memory is in use within the kernel on behalf of the workload, while a small part of the memory is free. This free memory is reserved for special system allocations, and for other low or high priority system services.

The rest of the system’s memory is dedicated to the workload itself, and divided into the following two categories:

File memory

Pages added in this category represent parts of files in permanent storage. These pages, from the page cache, can be mapped or unmapped in an application’s address spaces. You can use applications to map files into their address space using the mmap system calls, or to operate on files via the buffered I/O read or write system calls.

Buffered I/O system calls, as well as applications that map pages directly, can re-utilize unmapped pages. As a result, these pages are stored in the cache by the kernel, especially when the system is not running any memory intensive tasks, to avoid re-issuing costly I/O operations over the same set of pages.

Anonymous memory
Pages in this category are in use by a dynamically allocated process, or are not related to files in permanent storage. This set of pages back up the in-memory control structures of each task, such as the application stack and heap areas.

Figure 35.1. Memory usage patterns

RHEL Memory Usage Patterns

35.3. Virtual memory parameters

The virtual memory parameters are listed in the /proc/sys/vm directory.

The following are the available virtual memory parameters:

vm.dirty_ratio
Is a percentage value. When this percentage of the total system memory is modified, the system begins writing the modifications to the disk with the pdflush operation. The default value is 20 percent.
vm.dirty_background_ratio
A percentage value. When this percentage of total system memory is modified, the system begins writing the modifications to the disk in the background. The default value is 10 percent.
vm.overcommit_memory

Defines the conditions that determine whether a large memory request is accepted or denied.The default value is 0.

By default, the kernel performs checks if a virtual memory allocation request fits into the present amount of memory (total + swap) and rejects only large requests. Otherwise virtual memory allocations are granted, and this means they allow memory overcommitment.

Setting the overcommit_memory parameter’s value:

  • When this parameter is set to 1, the kernel performs no memory overcommit handling. This increases the possibility of memory overload, but improves performance for memory-intensive tasks.
  • When this parameter is set to 2, the kernel denies requests for memory equal to or larger than the sum of the total available swap space and the percentage of physical RAM specified in the overcommit_ratio. This reduces the risk of overcommitting memory, but is recommended only for systems with swap areas larger than their physical memory.
vm.overcommit_ratio
Specifies the percentage of physical RAM considered when overcommit_memory is set to 2. The default value is 50.
vm.max_map_count
Defines the maximum number of memory map areas that a process can use. The default value is 65530. Increase this value if your application needs more memory map areas.
vm.min_free_kbytes

Sets the size of the reserved free pages pool. It is also responsible for setting the min_page, low_page, and high_page thresholds that govern the behavior of the Linux kernel’s page reclaim algorithms. It also specifies the minimum number of kilobytes to keep free across the system. This calculates a specific value for each low memory zone, each of which is assigned a number of reserved free pages in proportion to their size.

Setting the vm.min_free_kbytes parameter’s value:

  • Increasing the parameter value effectively reduces the application working set usable memory. Therefore, you might want to use it for only kernel-driven workloads, where driver buffers need to be allocated in atomic contexts.
  • Decreasing the parameter value might render the kernel unable to service system requests, if memory becomes heavily contended in the system.

    Warning

    Extreme values can be detrimental to the system’s performance. Setting the vm.min_free_kbytes to an extremely low value prevents the system from reclaiming memory effectively, which can result in system crashes and failure to service interrupts or other kernel services. However, setting vm.min_free_kbytes too high considerably increases system reclaim activity, causing allocation latency due to a false direct reclaim state. This might cause the system to enter an out-of-memory state immediately.

    The vm.min_free_kbytes parameter also sets a page reclaim watermark, called min_pages. This watermark is used as a factor when determining the two other memory watermarks, low_pages, and high_pages, that govern page reclaim algorithms.

/proc/PID/oom_adj

In the event that a system runs out of memory, and the panic_on_oom parameter is set to 0, the oom_killer function kills processes, starting with the process that has the highest oom_score, until the system recovers.

The oom_adj parameter determines the oom_score of a process. This parameter is set per process identifier. A value of -17 disables the oom_killer for that process. Other valid values range from -16 to 15.

Note

Processes created by an adjusted process inherit the oom_score of that process.

vm.swappiness

The swappiness value, ranging from 0 to 200, controls the degree to which the system favors reclaiming memory from the anonymous memory pool, or the page cache memory pool.

Setting the swappiness parameter’s value:

  • Higher values favor file-mapped driven workloads while swapping out the less actively accessed processes’ anonymous mapped memory of RAM. This is useful for file-servers or streaming applications that depend on data, from files in the storage, to reside on memory to reduce I/O latency for the service requests.
  • Low values favor anonymous-mapped driven workloads while reclaiming the page cache (file mapped memory). This setting is useful for applications that do not depend heavily on the file system information, and heavily utilize dynamically allocated and private memory, such as mathematical and number crunching applications, and few hardware virtualization supervisors like QEMU.

    The default value of the vm.swappiness parameter is 60.

    Warning
    • Setting the vm.swappiness to 0 aggressively avoids swapping anonymous memory out to a disk, this increases the risk of processes being killed by the oom_killer function when under memory or I/O intensive workloads.
    • If you are using cgroupsV1, the per-cgroup swappiness value exclusive to cgroupsV1 will result in the system-wide swappiness configured by the vm.swappiness parameter having little-to-no effect on the swap behavior of the system. This issue might lead to unexpected and inconsistent swap behavior.

      In such cases, consider using the vm.force_cgroup_v2_swappiness parameter.

      For more information, see the Premature swapping with swappiness=0 while there is still plenty of pagecache to be reclaimed KCS solution.

force_cgroup_v2_swappiness
This control is used to deprecate the per-cgroup swappiness value available only in cgroupsV1. Most of all system and user processes are run within a cgroup. Cgroup swappiness values default to 60. This can lead to effects where systems swappiness value has little effect on the swap behavior of their system. If a user does not care about the per-cgroup swappiness feature they can configure their system with force_cgroup_v2_swappiness=1 to have more consistent swappiness behavior across their whole system.

Additional resources

35.4. File system parameters

The file system parameters are listed in the /proc/sys/fs directory. The following are the available file system parameters:

aio-max-nr
Defines the maximum allowed number of events in all active asynchronous input/output contexts. The default value is 65536, and modifying this value does not pre-allocate or resize any kernel data structures.
file-max

Determines the maximum number of file handles for the entire system. The default value on Red Hat Enterprise Linux 8 is either 8192 or one tenth of the free memory pages available at the time the kernel starts, whichever is higher.

Raising this value can resolve errors caused by a lack of available file handles.

Additional resources

  • sysctl(8) man page

35.5. Kernel parameters

The default values for the kernel parameters are located in the /proc/sys/kernel/ directory. These are set default values provided by the kernel or values specified by a user via sysctl.

The following are the available kernel parameters used to set up limits for the msg* and shm* System V IPC (sysvipc) system calls:

msgmax
Defines the maximum allowed size in bytes of any single message in a message queue. This value must not exceed the size of the queue (msgmnb). Use the sysctl msgmax command to determine the current msgmax value on your system.
msgmnb
Defines the maximum size in bytes of a single message queue. Use the sysctl msgmnb command to determine the current msgmnb value on your system.
msgmni
Defines the maximum number of message queue identifiers, and therefore the maximum number of queues. Use the sysctl msgmni command to determine the current msgmni value on your system.
shmall
Defines the total amount of shared memory pages that can be used on the system at one time. For example, a page is 4096 bytes on the AMD64 and Intel 64 architecture. Use the sysctl shmall command to determine the current shmall value on your system.
shmmax
Defines the maximum size in bytes of a single shared memory segment allowed by the kernel. Shared memory segments up to 1Gb are now supported in the kernel. Use the sysctl shmmax command to determine the current shmmax value on your system.
shmmni
Defines the system-wide maximum number of shared memory segments. The default value is 4096 on all systems.

Additional resources

  • sysvipc(7) and sysctl(8) man pages