Menu Close

Chapter 20. Using systemd to manage resources used by applications

RHEL 9 moves the resource management settings from the process level to the application level by binding the system of cgroup hierarchies with the systemd unit tree. Therefore, you can manage the system resources with the systemctl command, or by modifying the systemd unit files.

To achieve this, systemd takes various configuration options from the unit files or directly via the systemctl command. Then systemd applies those options to specific process groups by utilizing the Linux kernel system calls and features like cgroups and namespaces.

Note

You can review the full set of configuration options for systemd in the following manual pages:

  • systemd.resource-control(5)
  • systemd.exec(5)

20.1. Allocating system resources using systemd

To modify the distribution of system resources, you can apply one or more of the following distribution models:

Weights

You can distribute the resource by adding up the weights of all sub-groups and giving each sub-group the fraction matching its ratio against the sum.

For example, if you have 10 cgroups, each with weight of value 100, the sum is 1000. Each cgroup receives one tenth of the resource.

Weight is usually used to distribute stateless resources. For example the CPUWeight= option is an implementation of this resource distribution model.

Limits

A cgroup can consume up to the configured amount of the resource. The sum of sub-group limits can exceed the limit of the parent cgroup. Therefore it is possible to overcommit resources in this model.

For example the MemoryMax= option is an implementation of this resource distribution model.

Protections

You can set up a protected amount of a resource for a cgroup. If the resource usage is below the protection boundary, the kernel will try not to penalize this cgroup in favor of other cgroups that compete for the same resource. An overcommit is also possible.

For example the MemoryLow= option is an implementation of this resource distribution model.

Allocations
Exclusive allocations of an absolute amount of a finite resource. An overcommit is not possible. An example of this resource type in Linux is the real-time budget.
unit file option

A setting for resource control configuration.

For example, you can configure CPU resource with options like CPUAccounting=, or CPUQuota=. Similarly, you can configure memory or I/O resources with options like AllowedMemoryNodes= and IOAccounting=.

Procedure

To change the required value of the unit file option of your service, you can adjust the value in the unit file, or use systemctl command:

  1. Check the assigned values for the service of your choice.

    # systemctl show --property <unit file option> <service name>
  2. Set the required value of the CPU time allocation policy option:

    # systemctl set-property <service name> <unit file option>=<value>

Verification steps

  • Check the newly assigned values for the service of your choice.

    # systemctl show --property <unit file option> <service name>

Additional resources

  • systemd.resource-control(5), systemd.exec(5) manual pages

20.2. Role of systemd in resource management

The core function of systemd is service management and supervision. The systemd system and service manager ensures that managed services start at the right time and in the correct order during the boot process. The services have to run smoothly to use the underlying hardware platform optimally. Therefore, systemd also provides capabilities to define resource management policies, and to tune various options, which can improve the performance of the service.

Important

In general, Red Hat recommends you use systemd for controlling the usage of system resources. You should manually configure the cgroups virtual file system only in special cases. For example, when you need to use cgroup-v1 controllers that have no equivalents in cgroup-v2 hierarchy.

20.3. Overview of systemd hierarchy for cgroups

On the backend, the systemd system and service manager makes use of the slice, the scope and the service units to organize and structure processes in the control groups. You can further modify this hierarchy by creating custom unit files or using the systemctl command. Also, systemd automatically mounts hierarchies for important kernel resource controllers at the /sys/fs/cgroup/ directory.

Three systemd unit types are used for resource control:

  • Service - A process or a group of processes, which systemd started according to a unit configuration file. Services encapsulate the specified processes so that they can be started and stopped as one set. Services are named in the following way:

    <name>.service
  • Scope - A group of externally created processes. Scopes encapsulate processes that are started and stopped by the arbitrary processes through the fork() function and then registered by systemd at runtime. For example, user sessions, containers, and virtual machines are treated as scopes. Scopes are named as follows:

    <name>.scope
  • Slice - A group of hierarchically organized units. Slices organize a hierarchy in which scopes and services are placed. The actual processes are contained in scopes or in services. Every name of a slice unit corresponds to the path to a location in the hierarchy. The dash ("-") character acts as a separator of the path components to a slice from the -.slice root slice. In the following example:

    <parent-name>.slice

    parent-name.slice is a sub-slice of parent.slice, which is a sub-slice of the -.slice root slice. parent-name.slice can have its own sub-slice named parent-name-name2.slice, and so on.

The service, the scope, and the slice units directly map to objects in the control group hierarchy. When these units are activated, they map directly to control group paths built from the unit names.

The following is an abbreviated example of a control group hierarchy:

Control group /:
-.slice
├─user.slice
│ ├─user-42.slice
│ │ ├─session-c1.scope
│ │ │ ├─ 967 gdm-session-worker [pam/gdm-launch-environment]
│ │ │ ├─1035 /usr/libexec/gdm-x-session gnome-session --autostart /usr/share/gdm/greeter/autostart
│ │ │ ├─1054 /usr/libexec/Xorg vt1 -displayfd 3 -auth /run/user/42/gdm/Xauthority -background none -noreset -keeptty -verbose 3
│ │ │ ├─1212 /usr/libexec/gnome-session-binary --autostart /usr/share/gdm/greeter/autostart
│ │ │ ├─1369 /usr/bin/gnome-shell
│ │ │ ├─1732 ibus-daemon --xim --panel disable
│ │ │ ├─1752 /usr/libexec/ibus-dconf
│ │ │ ├─1762 /usr/libexec/ibus-x11 --kill-daemon
│ │ │ ├─1912 /usr/libexec/gsd-xsettings
│ │ │ ├─1917 /usr/libexec/gsd-a11y-settings
│ │ │ ├─1920 /usr/libexec/gsd-clipboard
…​
├─init.scope
│ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
└─system.slice
  ├─rngd.service
  │ └─800 /sbin/rngd -f
  ├─systemd-udevd.service
  │ └─659 /usr/lib/systemd/systemd-udevd
  ├─chronyd.service
  │ └─823 /usr/sbin/chronyd
  ├─auditd.service
  │ ├─761 /sbin/auditd
  │ └─763 /usr/sbin/sedispatch
  ├─accounts-daemon.service
  │ └─876 /usr/libexec/accounts-daemon
  ├─example.service
  │ ├─ 929 /bin/bash /home/jdoe/example.sh
  │ └─4902 sleep 1
  …​

The example above shows that services and scopes contain processes and are placed in slices that do not contain processes of their own.

Additional resources

20.4. Listing systemd units

The following procedure describes how to use the systemd system and service manager to list its units.

Procedure

  • To list all active units on the system, execute the # systemctl command and the terminal will return an output similar to the following example:

    # systemctl
    UNIT                                                LOAD   ACTIVE SUB       DESCRIPTION
    …​
    init.scope                                          loaded active running   System and Service Manager
    session-2.scope                                     loaded active running   Session 2 of user jdoe
    abrt-ccpp.service                                   loaded active exited    Install ABRT coredump hook
    abrt-oops.service                                   loaded active running   ABRT kernel log watcher
    abrt-vmcore.service                                 loaded active exited    Harvest vmcores for ABRT
    abrt-xorg.service                                   loaded active running   ABRT Xorg log watcher
    …​
    -.slice                                             loaded active active    Root Slice
    machine.slice                                       loaded active active    Virtual Machine and Container Slice system-getty.slice                                                                       loaded active active    system-getty.slice
    system-lvm2\x2dpvscan.slice                         loaded active active    system-lvm2\x2dpvscan.slice
    system-sshd\x2dkeygen.slice                         loaded active active    system-sshd\x2dkeygen.slice
    system-systemd\x2dhibernate\x2dresume.slice         loaded active active    system-systemd\x2dhibernate\x2dresume>
    system-user\x2druntime\x2ddir.slice                 loaded active active    system-user\x2druntime\x2ddir.slice
    system.slice                                        loaded active active    System Slice
    user-1000.slice                                     loaded active active    User Slice of UID 1000
    user-42.slice                                       loaded active active    User Slice of UID 42
    user.slice                                          loaded active active    User and Session Slice
    …​
    • UNIT - a name of a unit that also reflects the unit position in a control group hierarchy. The units relevant for resource control are a slice, a scope, and a service.
    • LOAD - indicates whether the unit configuration file was properly loaded. If the unit file failed to load, the field contains the state error instead of loaded. Other unit load states are: stub, merged, and masked.
    • ACTIVE - the high-level unit activation state, which is a generalization of SUB.
    • SUB - the low-level unit activation state. The range of possible values depends on the unit type.
    • DESCRIPTION - the description of the unit content and functionality.
  • To list inactive units, execute:

    # systemctl --all
  • To limit the amount of information in the output, execute:

    # systemctl --type service,masked

    The --type option requires a comma-separated list of unit types such as a service and a slice, or unit load states such as loaded and masked.

Additional resources

20.5. Viewing systemd control group hierarchy

The following procedure describes how to display control groups (cgroups) hierarchy and processes running in specific cgroups.

Procedure

  • To display the whole cgroups hierarchy on your system, execute # systemd-cgls:

    # systemd-cgls
    Control group /:
    -.slice
    ├─user.slice
    │ ├─user-42.slice
    │ │ ├─session-c1.scope
    │ │ │ ├─ 965 gdm-session-worker [pam/gdm-launch-environment]
    │ │ │ ├─1040 /usr/libexec/gdm-x-session gnome-session --autostart /usr/share/gdm/greeter/autostart
    …​
    ├─init.scope
    │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
    └─system.slice
      …​
      ├─example.service
      │ ├─6882 /bin/bash /home/jdoe/example.sh
      │ └─6902 sleep 1
      ├─systemd-journald.service
        └─629 /usr/lib/systemd/systemd-journald
      …​

    The example output returns the entire cgroups hierarchy, where the highest level is formed by slices.

  • To display the cgroups hierarchy filtered by a resource controller, execute # systemd-cgls <resource_controller>:

    # systemd-cgls memory
    Controller memory; Control group /:
    ├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
    ├─user.slice
    │ ├─user-42.slice
    │ │ ├─session-c1.scope
    │ │ │ ├─ 965 gdm-session-worker [pam/gdm-launch-environment]
    …​
    └─system.slice
      |
      …​
      ├─chronyd.service
      │ └─844 /usr/sbin/chronyd
      ├─example.service
      │ ├─8914 /bin/bash /home/jdoe/example.sh
      │ └─8916 sleep 1
      …​

    The example output of the above command lists the services that interact with the selected controller.

  • To display detailed information about a certain unit and its part of the cgroups hierarchy, execute # systemctl status <system_unit>:

    # systemctl status example.service
    ● example.service - My example service
       Loaded: loaded (/usr/lib/systemd/system/example.service; enabled; vendor preset: disabled)
       Active: active (running) since Tue 2019-04-16 12:12:39 CEST; 3s ago
     Main PID: 17737 (bash)
        Tasks: 2 (limit: 11522)
       Memory: 496.0K (limit: 1.5M)
       CGroup: /system.slice/example.service
               ├─17737 /bin/bash /home/jdoe/example.sh
               └─17743 sleep 1
    Apr 16 12:12:39 redhat systemd[1]: Started My example service.
    Apr 16 12:12:39 redhat bash[17737]: The current time is Tue Apr 16 12:12:39 CEST 2019
    Apr 16 12:12:40 redhat bash[17737]: The current time is Tue Apr 16 12:12:40 CEST 2019

Additional resources

20.6. Viewing cgroups of processes

The following procedure describes how to learn which control group (cgroup) a process belongs to. Then you can check the cgroup to learn which controllers and controller-specific configurations it uses.

Procedure

  1. To view which cgroup a process belongs to, run the # cat proc/<PID>/cgroup command:

    # cat /proc/2467/cgroup
    0::/system.slice/example.service

    The example output relates to a process of interest. In this case, it is a process identified by PID 2467, which belongs to the example.service unit. You can determine whether the process was placed in a correct control group as defined by the systemd unit file specifications.

  2. To display what controllers the cgroup utilizes and the respective configuration files, check the cgroup directory:

    # cat /sys/fs/cgroup/system.slice/example.service/cgroup.controllers
    memory pids
    
    # ls /sys/fs/cgroup/system.slice/example.service/
    cgroup.controllers
    cgroup.events
    …​
    cpu.pressure
    cpu.stat
    io.pressure
    memory.current
    memory.events
    …​
    pids.current
    pids.events
    pids.max
Note

The version 1 hierarchy of cgroups uses a per-controller model. Therefore the output from the /proc/PID/cgroup file shows, which cgroups under each controller the PID belongs to. You can find the respective cgroups under the controller directories at /sys/fs/cgroup/<controller_name>/.

Additional resources

  • cgroups(7) manual page
  • What are kernel resource controllers
  • Documentation in the /usr/share/doc/kernel-doc-<kernel_version>/Documentation/admin-guide/cgroup-v2.rst file (after installing the kernel-doc package)

20.7. Monitoring resource consumption

The following procedure describes how to view a list of currently running control groups (cgroups) and their resource consumption in real-time.

Procedure

  1. To see a dynamic account of currently running cgroups, execute the # systemd-cgtop command:

    # systemd-cgtop
    Control Group                            Tasks   %CPU   Memory  Input/s Output/s
    /                                          607   29.8     1.5G        -        -
    /system.slice                              125      -   428.7M        -        -
    /system.slice/ModemManager.service           3      -     8.6M        -        -
    /system.slice/NetworkManager.service         3      -    12.8M        -        -
    /system.slice/accounts-daemon.service        3      -     1.8M        -        -
    /system.slice/boot.mount                     -      -    48.0K        -        -
    /system.slice/chronyd.service                1      -     2.0M        -        -
    /system.slice/cockpit.socket                 -      -     1.3M        -        -
    /system.slice/colord.service                 3      -     3.5M        -        -
    /system.slice/crond.service                  1      -     1.8M        -        -
    /system.slice/cups.service                   1      -     3.1M        -        -
    /system.slice/dev-hugepages.mount            -      -   244.0K        -        -
    /system.slice/dev-mapper-rhel\x2dswap.swap   -      -   912.0K        -        -
    /system.slice/dev-mqueue.mount               -      -    48.0K        -        -
    /system.slice/example.service                2      -     2.0M        -        -
    /system.slice/firewalld.service              2      -    28.8M        -        -
    ...

    The example output displays currently running cgroups ordered by their resource usage (CPU, memory, disk I/O load). The list refreshes every 1 second by default. Therefore, it offers a dynamic insight into the actual resource usage of each control group.

Additional resources

  • systemd-cgtop(1) manual page

20.8. Using systemd unit files to set limits for applications

Each existing or running unit is supervised by the systemd, which also creates control groups for them. The units have configuration files in the /usr/lib/systemd/system/ directory. You can manually modify the unit files to set limits, prioritize, or control access to hardware resources for groups of processes.

Prerequisites

  • You have the root privileges.

Procedure

  1. Modify the /usr/lib/systemd/system/example.service file to limit the memory usage of a service:

    …​
    [Service]
    MemoryMax=1500K
    …​

    The configuration above places a maximum memory limit, which the processes in a control group cannot exceed. The example.service service is part of such a control group which has imposed limitations. You can use suffixes K, M, G, or T to identify Kilobyte, Megabyte, Gigabyte, or Terabyte as a unit of measurement.

  2. Reload all unit configuration files:

    # systemctl daemon-reload
  3. Restart the service:

    # systemctl restart example.service
Note

You can review the full set of configuration options for systemd in the following manual pages:

  • systemd.resource-control(5)
  • systemd.exec(5)

Verification

  1. Check that the changes took effect:

    # cat /sys/fs/cgroup/system.slice/example.service/memory.max
    1536000

    The example output shows that the memory consumption was limited at around 1,500 KB.

Additional resources

20.9. Using systemctl command to set limits to applications

CPU affinity settings help you restrict the access of a particular process to some CPUs. Effectively, the CPU scheduler never schedules the process to run on the CPU that is not in the affinity mask of the process.

The default CPU affinity mask applies to all services managed by systemd.

To configure CPU affinity mask for a particular systemd service, systemd provides CPUAffinity= both as a unit file option and a manager configuration option in the /etc/systemd/system.conf file.

The CPUAffinity= unit file option sets a list of CPUs or CPU ranges that are merged and used as the affinity mask.

After configuring CPU affinity mask for a particular systemd service, you must restart the service to apply the changes.

Procedure

To set CPU affinity mask for a particular systemd service using the CPUAffinity unit file option:

  1. Check the values of the CPUAffinity unit file option in the service of your choice:

    $ systemctl show --property <CPU affinity configuration option> <service name>
  2. As a root, set the required value of the CPUAffinity unit file option for the CPU ranges used as the affinity mask:

    # systemctl set-property <service name> CPUAffinity=<value>
  3. Restart the service to apply the changes.

    # systemctl restart <service name>
Note

You can review the full set of configuration options for systemd in the following manual pages:

  • systemd.resource-control(5)
  • systemd.exec(5)

20.10. Setting global default CPU affinity through manager configuration

The CPUAffinity option in the /etc/systemd/system.conf file defines an affinity mask for the process identification number (PID) 1 and all processes forked off of PID1. You can then override the CPUAffinity on a per-service basis.

To set default CPU affinity mask for all systemd services using the manager configuration option:

  1. Set the CPU numbers for the CPUAffinity= option in the /etc/systemd/system.conf file.
  2. Save the edited file and reload the systemd service:

    # systemctl daemon-reload
  3. Reboot the server to apply the changes.
Note

You can review the full set of configuration options for systemd in the following manual pages:

  • systemd.resource-control(5)
  • systemd.exec(5)

20.11. Configuring NUMA policies using systemd

Non-uniform memory access (NUMA) is a computer memory subsystem design, in which the memory access time depends on the physical memory location relative to the processor.

Memory close to the CPU has lower latency (local memory) than memory that is local for a different CPU (foreign memory) or is shared between a set of CPUs.

In terms of the Linux kernel, NUMA policy governs where (for example, on which NUMA nodes) the kernel allocates physical memory pages for the process.

systemd provides unit file options NUMAPolicy and NUMAMask to control memory allocation policies for services.

Procedure

To set the NUMA memory policy through the NUMAPolicy unit file option:

  1. Check the values of the NUMAPolicy unit file option in the service of your choice:

    $ systemctl show --property <NUMA policy configuration option> <service name>
  2. As a root, set the required policy type of the NUMAPolicy unit file option:

    # systemctl set-property <service name> NUMAPolicy=<value>
  3. Restart the service to apply the changes.

    # systemctl restart <service name>

To set a global NUMAPolicy setting through the manager configuration option:

  1. Search in the /etc/systemd/system.conf file for the NUMAPolicy option.
  2. Edit the policy type and save the file.
  3. Reload the systemd configuration:

    # systemd daemon-reload
  4. Reboot the server.
Important

When you configure a strict NUMA policy, for example bind, make sure that you also appropriately set the CPUAffinity= unit file option.

Additional resources

20.12. NUMA policy configuration options for systemd

Systemd provides the following options to configure the NUMA policy:

NUMAPolicy

Controls the NUMA memory policy of the executed processes. The following policy types are possible:

  • default
  • preferred
  • bind
  • interleave
  • local
NUMAMask

Controls the NUMA node list which is associated with the selected NUMA policy.

Note that the NUMAMask option is not required to be specified for the following policies:

  • default
  • local

For the preferred policy, the list specifies only a single NUMA node.

Additional resources

  • systemd.resource-control(5), systemd.exec(5), and set_mempolicy(2) manual pages

20.13. Creating transient cgroups using systemd-run command

The transient cgroups set limits on resources consumed by a unit (service or scope) during its runtime.

Procedure

  • To create a transient control group, use the systemd-run command in the following format:

    # systemd-run --unit=<name> --slice=<name>.slice <command>

    This command creates and starts a transient service or a scope unit and runs a custom command in such a unit.

    • The --unit=<name> option gives a name to the unit. If --unit is not specified, the name is generated automatically.
    • The --slice=<name>.slice option makes your service or scope unit a member of a specified slice. Replace <name>.slice with the name of an existing slice (as shown in the output of systemctl -t slice), or create a new slice by passing a unique name. By default, services and scopes are created as members of the system.slice.
    • Replace <command> with the command you wish to execute in the service or the scope unit.

      The following message is displayed to confirm that you created and started the service or the scope successfully:

      # Running as unit <name>.service
  • Optionally, keep the unit running after its processes finished to collect run-time information:

    # systemd-run --unit=<name> --slice=<name>.slice --remain-after-exit <command>

    The command creates and starts a transient service unit and runs a custom command in such a unit. The --remain-after-exit option ensures that the service keeps running after its processes have finished.

Additional resources

20.14. Removing transient control groups

You can use the systemd system and service manager to remove transient control groups (cgroups) if you no longer need to limit, prioritize, or control access to hardware resources for groups of processes.

Transient cgroups are automatically released once all the processes that a service or a scope unit contains, finish.

Procedure

  • To stop the service unit with all its processes, execute:

    # systemctl stop name.service
  • To terminate one or more of the unit processes, execute:

    # systemctl kill name.service --kill-who=PID,... --signal=<signal>

    The command above uses the --kill-who option to select process(es) from the control group you wish to terminate. To kill multiple processes at the same time, pass a comma-separated list of PIDs. The --signal option determines the type of POSIX signal to be sent to the specified processes. The default signal is SIGTERM.