The memory subsystem generates automatic reports on memory resources used by the tasks in a cgroup, and sets limits on memory use of those tasks:

Note

By default, the memory subsystem uses 40 bytes of memory per physical page on x86_64 systems. These resources are consumed even if memory is not used in any hierarchy. If you do not plan to use the memory subsystem, you can disable it to reduce the resource consumption of the kernel.

To permanently disable the memory subsystem, open the /boot/grub/grub.conf configuration file as root and append the following text to the line that starts with the kernel keyword:

cgroup_disable=memory

For more information on working with /boot/grub/grub.conf, see the Configuring the GRUB Boot Loader chapter in the Red Hat Enterprise Linux 6 Deployment Guide.

To temporarily disable the memory subsystem for a single session, perform the following steps when starting the system:

At the GRUB boot screen, press any key to enter the GRUB interactive menu.
Select Red Hat Enterprise Linux with the version of the kernel that you want to boot and press the a key to modify the kernel parameters.
Type cgroup_disable=memory at the end of the line and press Enter to exit GRUB edit mode.

With cgroup_disable=memory enabled, memory is not visible as an individually mountable subsystem and it is not automatically mounted when mounting all cgroups in a single hierarchy. Please note that memory is currently the only subsystem that can be effectively disabled with cgroup_disable to save resources. Using this option with other subsystems only disables their usage, but does not cut their resource consumption. However, other subsystems do not consume as much resources as the memory subsystem.

The following tunable parameters are available for the memory subsystem:

memory.stat

reports a wide range of memory statistics, as described in the following table:

Table 3.2. Values reported by memory.stat
Statistic	Description
`cache`	page cache, including `tmpfs` (`shmem`), in bytes
`rss`	anonymous and swap cache, not including `tmpfs` (`shmem`), in bytes
`mapped_file`	size of memory-mapped mapped files, including `tmpfs` (`shmem`), in bytes
`pgpgin`	number of pages paged into memory
`pgpgout`	number of pages paged out of memory
`swap`	swap usage, in bytes
`active_anon`	anonymous and swap cache on active least-recently-used (LRU) list, including `tmpfs` (`shmem`), in bytes
`inactive_anon`	anonymous and swap cache on inactive LRU list, including `tmpfs` (`shmem`), in bytes
`active_file`	file-backed memory on active LRU list, in bytes
`inactive_file`	file-backed memory on inactive LRU list, in bytes
`unevictable`	memory that cannot be reclaimed, in bytes
`hierarchical_memory_limit`	memory limit for the hierarchy that contains the `memory` cgroup, in bytes
`hierarchical_memsw_limit`	memory plus swap limit for the hierarchy that contains the `memory` cgroup, in bytes

Additionally, each of these files other than hierarchical_memory_limit and hierarchical_memsw_limit has a counterpart prefixed total_ that reports not only on the cgroup, but on all its children as well. For example, swap reports the swap usage by a cgroup and total_swap reports the total swap usage by the cgroup and all its child groups.

When you interpret the values reported by memory.stat, note how the various statistics inter-relate:

active_anon + inactive_anon = anonymous memory + file cache for tmpfs + swap cache
Therefore, active_anon + inactive_anon ≠ rss, because rss does not include tmpfs.
active_file + inactive_file = cache - size of tmpfs

memory.usage_in_bytes

reports the total current memory usage by processes in the cgroup (in bytes).

memory.memsw.usage_in_bytes

reports the sum of current memory usage plus swap space used by processes in the cgroup (in bytes).

memory.max_usage_in_bytes

reports the maximum memory used by processes in the cgroup (in bytes).

memory.memsw.max_usage_in_bytes

reports the maximum amount of memory and swap space used by processes in the cgroup (in bytes).

memory.limit_in_bytes

sets the maximum amount of user memory (including file cache). If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units — k or K for kilobytes, m or M for megabytes, and g or G for gigabytes. For example, to set the limit to 1 gigabyte, execute:

~]# echo 1G > /cgroup/memory/lab1/memory.limit_in_bytes

You cannot use memory.limit_in_bytes to limit the root cgroup; you can only apply values to groups lower in the hierarchy.

Write -1 to memory.limit_in_bytes to remove any existing limits.

memory.memsw.limit_in_bytes

sets the maximum amount for the sum of memory and swap usage. If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units — k or K for kilobytes, m or M for megabytes, and g or G for gigabytes.

You cannot use memory.memsw.limit_in_bytes to limit the root cgroup; you can only apply values to groups lower in the hierarchy.

Write -1 to memory.memsw.limit_in_bytes to remove any existing limits.

Important

It is important to set the memory.limit_in_bytes parameter before setting the memory.memsw.limit_in_bytes parameter: attempting to do so in the reverse order results in an error. This is because memory.memsw.limit_in_bytes becomes available only after all memory limitations (previously set in memory.limit_in_bytes) are exhausted.

Consider the following example: setting memory.limit_in_bytes = 2G and memory.memsw.limit_in_bytes = 4G for a certain cgroup will allow processes in that cgroup to allocate 2 GB of memory and, once exhausted, allocate another 2 GB of swap only. The memory.memsw.limit_in_bytes parameter represents the sum of memory and swap. Processes in a cgroup that does not have the memory.memsw.limit_in_bytes parameter set can potentially use up all the available swap (after exhausting the set memory limitation) and trigger an Out Of Memory situation caused by the lack of available swap.

The order in which the memory.limit_in_bytes and memory.memsw.limit_in_bytes parameters are set in the /etc/cgconfig.conf file is important as well. The following is a correct example of such a configuration:

memory {
    memory.limit_in_bytes = 1G;
    memory.memsw.limit_in_bytes = 1G;
}

memory.failcnt

reports the number of times that the memory limit has reached the value set in memory.limit_in_bytes.

memory.memsw.failcnt

reports the number of times that the memory plus swap space limit has reached the value set in memory.memsw.limit_in_bytes.

memory.soft_limit_in_bytes

enables flexible sharing of memory. Under normal circumstances, control groups are allowed to use as much of the memory as needed, constrained only by their hard limits set with the memory.limit_in_bytes parameter. However, when the system detects memory contention or low memory, control groups are forced to restrict their consumption to their soft limits. To set the soft limit for example to 256 MB, execute:

~]# echo 256M > /cgroup/memory/lab1/memory.soft_limit_in_bytes

This parameter accepts the same suffixes as memory.limit_in_bytes to represent units. To have any effect, the soft limit must be set below the hard limit. If lowering the memory usage to the soft limit does not solve the contention, cgroups are pushed back as much as possible to make sure that one control group does not starve the others of memory. Note that soft limits take effect over a long period of time, since they involve reclaiming memory for balancing between memory cgroups.

memory.force_empty

when set to 0, empties memory of all pages used by tasks in the cgroup. This interface can only be used when the cgroup has no tasks. If memory cannot be freed, it is moved to a parent cgroup if possible. Use the memory.force_empty parameter before removing a cgroup to avoid moving out-of-use page caches to its parent cgroup.

memory.swappiness

sets the tendency of the kernel to swap out process memory used by tasks in this cgroup instead of reclaiming pages from the page cache. This is the same tendency, calculated the same way, as set in /proc/sys/vm/swappiness for the system as a whole. The default value is 60. Values lower than 60 decrease the kernel's tendency to swap out process memory, values greater than 60 increase the kernel's tendency to swap out process memory, and values greater than 100 permit the kernel to swap out pages that are part of the address space of the processes in this cgroup.

Note that a value of 0 does not prevent process memory being swapped out; swap out might still happen when there is a shortage of system memory because the global virtual memory management logic does not read the cgroup value. To lock pages completely, use mlock() instead of cgroups.

You cannot change the swappiness of the following groups:

the root cgroup, which uses the swappiness set in /proc/sys/vm/swappiness.
a cgroup that has child groups below it.

memory.move_charge_at_immigrate

allows moving charges associated with a task along with task migration. Charging is a way of giving a penalty to cgroups which access shared pages too often. These penalties, also called charges, are by default not moved when a task migrates from one cgroup to another. The pages allocated from the original cgroup still remain charged to it; the charge is dropped when the page is freed or reclaimed.

With memory.move_charge_at_immigrate enabled, the pages associated with a task are taken from the old cgroup and charged to the new cgroup. The following example shows how to enable memory.move_charge_at_immigrate:

~]# echo 1 > /cgroup/memory/lab1/memory.move_charge_at_immigrate

Charges are moved only when the moved task is a leader of a thread group. If there is not enough memory for the task in the destination cgroup, an attempt to reclaim memory is performed. If the reclaim is not successful, the task migration is aborted.

To disable memory.move_charge_at_immigrate, execute:

~]# echo 0 > /cgroup/memory/lab1/memory.move_charge_at_immigrate

memory.use_hierarchy

contains a flag (0 or 1) that specifies whether memory usage should be accounted for throughout a hierarchy of cgroups. If enabled (1), the memory subsystem reclaims memory from the children of and process that exceeds its memory limit. By default (0), the subsystem does not reclaim memory from a task's children.

memory.oom_control

contains a flag (0 or 1) that enables or disables the Out of Memory killer for a cgroup. If enabled (0), tasks that attempt to consume more memory than they are allowed are immediately killed by the OOM killer. The OOM killer is enabled by default in every cgroup using the memory subsystem; to disable it, write 1 to the memory.oom_control file:

      ~]# echo 1 > /cgroup/memory/lab1/memory.oom_control

When the OOM killer is disabled, tasks that attempt to use more memory than they are allowed are paused until additional memory is freed.

The memory.oom_control file also reports the OOM status of the current cgroup under the under_oom entry. If the cgroup is out of memory and tasks in it are paused, the under_oom entry reports the value 1.

The memory.oom_control file is capable of reporting an occurrence of an OOM situation using the notification API. For more information, refer to Section 2.13, “Using the Notification API” and Example 3.3, “OOM Control and Notifications”.

3.7.1. Example Usage

Example 3.3. OOM Control and Notifications

The following example demonstrates how the OOM killer takes action when a task in a cgroup attempts to use more memory than allowed, and how a notification handler can report OOM situations:

Attach the memory subsystem to a hierarchy and create a cgroup:

~]# mount -t cgroup -o memory memory /cgroup/memory
~]# mkdir /cgroup/memory/blue

Set the amount of memory which tasks in the blue cgroup can use to 100 MB:
```
~]# echo 104857600 > memory.limit_in_bytes
```

Change into the blue directory and make sure the OOM killer is enabled:

~]# cd /cgroup/memory/blue
blue]# cat memory.oom_control
oom_kill_disable 0
under_oom 0

Move the current shell process into the tasks file of the blue cgroup so that all other processes started in this shell are automatically moved to the blue cgroup:
```
blue]# echo $$ > tasks
```

Start a test program that attempts to allocate a large amount of memory exceeding the limit you set in step 2. As soon as the blue cgroup runs out of free memory, the OOM killer kills the test program and reports Killed to the standard output:

blue]# ~/mem-hog
Killed

The following is an example of such a test program^[5]:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define KB (1024)
#define MB (1024 * KB)
#define GB (1024 * MB)

int main(int argc, char *argv[])
{
	char *p;

again:
	while ((p = (char *)malloc(GB)))
		memset(p, 0, GB);

	while ((p = (char *)malloc(MB)))
		memset(p, 0, MB);

	while ((p = (char *)malloc(KB)))
		memset(p, 0,
				KB);

	sleep(1);

	goto again;

	return 0;
}

Disable the OOM killer and rerun the test program. This time, the test program remains paused waiting for additional memory to be freed:
```
blue]# echo 1 > memory.oom_control
blue]# ~/mem-hog
```
While the test program is paused, note that the under_oom state of the cgroup has changed to indicate that the cgroup is out of available memory:
```
~]# cat /cgroup/memory/blue/memory.oom_control
oom_kill_disable 1
under_oom 1
```
Reenabling the OOM killer immediately kills the test program.

To receive notifications about every OOM situation, create a program as specified in Section 2.13, “Using the Notification API”. For example^[6]:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/eventfd.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

static inline void die(const char *msg)
{
	fprintf(stderr, "error: %s: %s(%d)\n", msg, strerror(errno), errno);
	exit(EXIT_FAILURE);
}

static inline void usage(void)
{
	fprintf(stderr, "usage: oom_eventfd_test <cgroup.event_control> <memory.oom_control>\n");
	exit(EXIT_FAILURE);
}

#define BUFSIZE 256

int main(int argc, char *argv[])
{
	char buf[BUFSIZE];
	int efd, cfd, ofd, rb, wb;
	uint64_t u;

	if (argc != 3)
		usage();

	if ((efd = eventfd(0, 0)) == -1)
		die("eventfd");

	if ((cfd = open(argv[1], O_WRONLY)) == -1)
		die("cgroup.event_control");

	if ((ofd = open(argv[2], O_RDONLY)) == -1)
		die("memory.oom_control");

	if ((wb = snprintf(buf, BUFSIZE, "%d %d", efd, ofd)) >= BUFSIZE)
		die("buffer too small");

	if (write(cfd, buf, wb) == -1)
		die("write cgroup.event_control");

	if (close(cfd) == -1)
		die("close cgroup.event_control");

	for (;;) {
		if (read(efd, &u, sizeof(uint64_t)) != sizeof(uint64_t))
			die("read eventfd");

		printf("mem_cgroup oom event received\n");
	}

	return 0;
}

The above program detects OOM situations in a cgroup specified as an argument on the command line and reports them using the mem_cgroup oom event received string to the standard output.

Run the above notification handler program in a separate console, specifying the blue cgroup's control files as arguments:
```
~]$ ./oom_notification /cgroup/memory/blue/cgroup.event_control /cgroup/memory/blue/memory.oom_control
```
In a different console, run the mem_hog test program to create an OOM situation to see the oom_notification program report it in the standard output:
```
blue]# ~/mem-hog
```

^[5] Source code provided by Red Hat Engineer František Hrbata.

^[6] Source code provided by Red Hat Engineer František Hrbata.