Change in swap behavior between RHEL 7 and RHEL 8 kernels
Environment
- RHEL 8+
Issue
- Seeing swap usage even though swappiness is low and there is plenty of available memory in pagecache.
- vm.swappiness is properly configured and the cgroup swappiness issue is addressed already
Resolution
- Changes in virtual memory (VM) management between RHEL 7 and 8 may cause different patterns of swap usage even under similar workloads and tunings
- Swap usage or a trend towards 100% swap used is not necessarily indicative of memory pressure.
- Recommend monitoring other statistics in order to determine if the system is under memory pressure (see Diagnostic Steps)
Root Cause
- Numerous changes to the Linux kernel VM algorithm have occurred in the years between the release of RHEL 7 and 8.
- The primary driver has been the increase in disk speed between solid state disks and traditional spinning disks.
- SSDs/NVMe/etc have much lower impact on performance than traditional spinning disks, and the kernel VM model has been updated to reflect this.
- In general anonymous memory (memory owned by programs for general use) is more likely to be considered for writing to swap even with low vm.swappiness.
- Page weights are now counted in their actual page size. Previously a single transparent huge page (THP, which typically is equal to 512 normal pages) was equal to 1 pagecache page, so THP users were less likely to pressure the FILE LRU.
- Historically anonymous pages started on the active LRU. This meant that anonymous memory must first be demoted to inactive before being considered for eviction. Now both anonymous and file pages are equally inactive at their inception.
- Files which are faulted into memory, later evicted then faulted back in are considered refaulted. This trend is called cache thrashing.
- This occurs when the same files are read multiple times and their total size exceeds available memory/pagecache.
- In RHEL 8 the kernel is more aggressive about preventing cache thrashing and may evict/swap inactive anonymous pages in order to grow pagecache and reduce thrashing.
- This balancing does not consider active anonymous pages. i.e. pages that are actively used will not be swapped to grow pagecache.
- This concept is called workingset in the kernel. In /proc/vmstat there are a number of workingset metrics which describe this behavior.
- The vm.swappiness value for RHEL 8 now has a range of 0-200 (RHEL 7 and earlier was 0-100).
- A setting of vm.swappiness=0 will defer swapping nearly completely. We recommend avoiding a value of 0 or testing thoroughly before implementing.
Diagnostic Steps
- Swap utilization is not a good indicator of memory shortage. This is especially true with workloads with high volumes of IO.
- Instead, /proc/meminfo->MemAvailable is a better indicator of how much memory a system can allocate before it's low on memory or runs out altogether. If this value trends towards zero this may indicate the system is headed towards Out Of Memory conditions.
- In sar pswpin/s and pswpout/s represent reads and writes to the swap partition. If these are non-zero concurrently this may indicate the system is currently under memory pressure.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments