Premature swapping with swappiness=0 while there is still plenty of pagecache to be reclaimed
Red Hat Insights can detect this issue
Environment
- Red Hat Enterprise Linux 8
- Red Hat Enterprise Linux 9
- systems utilizing cgroups v1
Issue
- Even if memory is under pressure with
swappiness=0, the inactive pagecache is swapped out instead of pagecache being reclaimed. - The system wide swappiness value specified at
/proc/sys/vm/swappinesshas little-to-no effect on the swap characteristics of a system with cgroups v1. This issue may lead to unexpected and inconsistent swap behavior.
Resolution
In the following, various settings for /proc, /sys and sysctl are discussed. These can be implemented by various means, i.e. via custom systemd services, or /etc/rc.local.
Red Hat Enterprise Linux 8
To address this issue, Red Hat Enterprise Linux Engineering created a new sysctl option: vm.force_cgroup_v2_swappiness. When set to 1, all the cgroup's memory.swappiness value becomes deprecated, and all per-cgroups swappiness values mirrors the system-wide vm.swappiness sysctl value (ie /proc/sys/vm/swappiness file). As a result, the memory swapping behavior of cgroups is more consistent. This is the recommended solution while using cgroups v1.
The kernel may need to be updated to a version patched with the new sysctl:
- Red Hat Enterprise Linux 8.7 - update to
kernel-4.18.0-425.3.1.el8.x86_64or later, as per Errata: RHSA-2022:7683 - Red Hat Enterprise Linux 8.6.z(EUS) - update to
kernel-4.18.0-372.36.1.el8_6.x86_64or later, as per Errata: RHSA-2022:8809 - Red Hat Enterprise Linux 8.4.z(EUS) - update to
kernel-4.18.0-305.76.1.el8_4.x86_64or later, as per Errata: RHSA-2023:0496
Red Hat Enterprise Linux 9
Red Hat Enterprise Linux 9 uses cgroups v2 by default, which are not subject to per-cgroup swappiness value. If the user decides to switch to cgroups v1, it is recommended to check for the bug presence.
Workarounds and setting adjustments
The following workarounds are also available in case the recommended solution cannot be used.
Workaround #1: Switch to cgroup v2
-
Possible workaround to mitigate the issue might be using cgroup v2:
# grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1" (requires reboot)
Workaround #2: Adjust swappiness value of all existing cgroups
-
If it's not feasible to migrate to cgroups v2 for any reason, a solution would be to adjust all memory cgroups' swappiness to desired value, preferably in a pre-order fashion with regards to filesystem structure.
For example, the following command can be sufficient:# for cgfile in $(find /sys/fs/cgroup -name *swappiness); do echo $(cat /proc/sys/vm/swappiness) > $cgfile; doneNOTE:
findcommand unfortunately doesn't return the files in a pre-order fashion, hence it is prone to possible race condition with cgroup creation. It is recommended to confirm that all existing cgroups have correctly setmemory.swappinessvalues afterwards.Ideally a service can be crafted which would do that at boot time, specifically
After=systemd-sysctl.service.
Workaround #3: Push the desired global swappiness value into initramfs
-
To use per-cgroup swappiness and to change the default value from 60, the following change can be done to specify the desired swappiness value in the
/etc/sysctl.conffile:vm.swappiness=## -
After setting this value, the
initramfswill need to be refreshed and a reboot of the system will be required. This can be done with commanddracut -f. -
Note: This will change the default swappiness for the user.slice, init.scope, and machine.slice cgroups; however, this will have no effect on the system.slice cgroup, and may still lead to unexpected swap behavior.
Considerations for virtual guests
-
If
force_cgroup_v2_swappiness=1cannot be set and host-side swapping is occurring from memory pressure inside a guest, the swappiness value can be controlled for each guest or all guests. -
To have all virtual machine guests inherit the same swappiness value, the following command can be run before starting the virtual machines with
libvirtd:# echo [value] > /sys/fs/cgroup/memory/machine.slice/memory.swappiness -
To change the value after booting the guest, the following command can be run while ensuring to specify the guest name in the appropriate location:
# echo [value] > /sys/fs/cgroup/memory/machine.slice/<GUEST_NAME>/memory.swappiness -
Note: It is recommended to set the swappiness value for every cgroup created on the system if the desired result is for the system to honor the specified swapiness value across all cgroups.
Root Cause
- In cases where there is high memory pressure and page reclamation is needed, users may experience swapping earlier or more aggressively than expected with regards to the swappiness value. This issue is due the fact that
systemdruns its processes within cgroups and the root swappiness value has little-to-no effect on swap heuristics. - In cgroups v1 there is the per-cgroup swappiness value
memory.swappiness. This value controls the swap behavior of the given cgroup. These cgroups, which are initialized at boot, get created before the sysctl service is able to run and properly set the desired swappiness value. This leads to a default swappiness value of 60 for the processes running on the system. - This is not an issue in cgroups v2 as there is no swappiness parameter available to the memory controller in cgroups v2, and as such, cgroups v2 will utilize the sysfs value.
Diagnostic Steps
-
All the per-cgroup
memory.swappinessare set to 60(default value) while system-widevm.swappinessis set to 1.$ find /sys/fs/cgroup/memory/ -name memory.swappiness -exec cat {} \;|uniq -c 1 1 <-- sys/fs/cgroup/memory/memory.swappiness 117 60 <-- memory.swappiness under all the .slice/scope $ sysctl -a | grep vm.swappiness vm.swappiness = 1 -
To list the Memory cgroups that are utilized by your system you can run:
# grep . /sys/fs/cgroup/memory/*/memory.swappiness
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
6 Comments
Could some details regarding Workaround #2 be provided?
user.sliceandmachine.slice?system-swappiness.slice?Is the crafted service unit meant to imply I add this to an existing service where the swappiness is not honored? The statement could be interpreted as suggesting the creation of a new service unit which somehow fixes the issue, but not explaining what
ExecStartwould contain.Does simply starting the service after
systemd-sysctl.servicecause any existingvm.swappinessvalue in the/etc/sysctl.confor/etc/sysctl.d/directory to be honored, but requires updating initramfs?Thanks in advance for clarification.
Any and all cgroups (version 1) under memory controller inherit the parent cgroup's
memory.swappinessvalue. I'm not quite sure that you can specify amemory.swappinessvia a systemd.unit (.slice) option; at least I don't see such in eithersystemd.execnorsystemd.resource-controlman pages.The statement indeed suggests a creation of a dedicated new system.service unit. As
ExecStartyou can utilize any command (or custom script) that would do exactly what Workaround #2 suggests: change all existing cgroups'memory.swappinessvalues to a desired value. For convenience I updated the Solution with an example command and I'll also print here the following for inspiration:Last, but not least, when you start a service, regardless of any
AfterorBeforetiming, the new service cgroup inherits it's parent'smemory.swappinessvalue (be it some custom slice, or the system.slice or whatever). I think you might got confused by the mix of information in workaround #2 , hence I updated the Solution to separate the information (seems in time people have put another workaround in the mix and didn't separate it properly :P). The workaround #3 (now separated) is that you can recreate your booting initramfs such that it already contains a new global default swappiness value. This way the value value is applied in the dracut boot phase and hence already changed when systemd creates it's cgroup hierarchies. (Disclaimer: I personally haven't tested this one, so this explanation is rather theory-only).The article is now much clearer with Workaround #2 providing a command to resolve the issue, which can be incorporated into the service unit. The separation of
initramfsinto Workaround #3 rather than as an extra step in #2 when utilizing systemd was especially helpful in clearing things up.Thank you very much
Hello
Maybe a silly question, but I can't find any directory /sys/fs/cgroup/memory/init.slice as suggested in Workaround #3 ? But I found a /sys/fs/cgroup/memory/init.scope directory with the inode memory.swappiness A typo ?
And I found no machine.slice under /sys/fs/cgroup
Thanks in advance
Eric
Yes, it seems init.slice is a typo and should be init.scope.
Machine.slice, or any other ".slice" for that matter should be under a specific controller (possibly under each controller). So not under
/sys/fs/cgroup, but under/sys/fs/cgroup/<controller_name>(controllers include: memory, cpu, cpuset, ...).Ultimately, when in doubt, you can always use
lsto see what cgroups exist or not - the cgroupfs under (default)/sys/fs/cgroup[/<controller_name>]is quite clearly readable such that directories correspond to existing cgroups. Optionally, I recommend reading up on general overview cgroup docs, for example kernel Docs and/or man pages.The patch mentioned for RHEL 8.6 and 8.7 kernel has also been applied to RHEL 8.4 kernel 4.18.0-305.76.1. This article hasn't been updated. This is the sysctl value to "force_cgroup_v2_swappiness tuneable to deprecate cgv1 per-cgroup swappiness"