Menu Close
Monitoring and managing system status and performance
Optimizing system throughput, latency, and power consumption
Abstract
Making open source more inclusive
Red Hat is committed to replacing problematic language in our code, documentation, and web properties. We are beginning with these four terms: master, slave, blacklist, and whitelist. Because of the enormity of this endeavor, these changes will be implemented gradually over several upcoming releases. For more details, see our CTO Chris Wright’s message.
Providing feedback on Red Hat documentation
We appreciate your input on our documentation. Please let us know how we could make it better.
For simple comments on specific passages:
- Make sure you are viewing the documentation in the Multi-page HTML format. In addition, ensure you see the Feedback button in the upper right corner of the document.
- Use your mouse cursor to highlight the part of text that you want to comment on.
- Click the Add Feedback pop-up that appears below the highlighted text.
- Follow the displayed instructions.
For submitting feedback via Bugzilla, create a new ticket:
- Go to the Bugzilla website.
- As the Component, use Documentation.
- Fill in the Description field with your suggestion for improvement. Include a link to the relevant part(s) of documentation.
- Click Submit Bug.
Chapter 1. Getting started with TuneD
As a system administrator, you can use the TuneD application to optimize the performance profile of your system for a variety of use cases.
1.1. The purpose of TuneD
TuneD is a service that monitors your system and optimizes the performance under certain workloads. The core of TuneD are profiles, which tune your system for different use cases.
TuneD is distributed with a number of predefined profiles for use cases such as:
- High throughput
- Low latency
- Saving power
It is possible to modify the rules defined for each profile and customize how to tune a particular device. When you switch to another profile or deactivate TuneD, all changes made to the system settings by the previous profile revert back to their original state.
You can also configure TuneD to react to changes in device usage and adjusts settings to improve performance of active devices and reduce power consumption of inactive devices.
1.2. TuneD profiles
A detailed analysis of a system can be very time-consuming. TuneD provides a number of predefined profiles for typical use cases. You can also create, modify, and delete profiles.
The profiles provided with TuneD are divided into the following categories:
- Power-saving profiles
- Performance-boosting profiles
The performance-boosting profiles include profiles that focus on the following aspects:
- Low latency for storage and network
- High throughput for storage and network
- Virtual machine performance
- Virtualization host performance
Syntax of profile configuration
The tuned.conf
file can contain one [main]
section and other sections for configuring plug-in instances. However, all sections are optional.
Lines starting with the hash sign (#
) are comments.
Additional resources
-
tuned.conf(5)
man page.
1.3. The default TuneD profile
During the installation, the best profile for your system is selected automatically. Currently, the default profile is selected according to the following customizable rules:
Environment | Default profile | Goal |
---|---|---|
Compute nodes |
| The best throughput performance |
Virtual machines |
|
The best performance. If you are not interested in the best performance, you can change it to the |
Other cases |
| Balanced performance and power consumption |
Additional resources
-
tuned.conf(5)
man page.
1.4. Merged TuneD profiles
As an experimental feature, it is possible to select more profiles at once. TuneD will try to merge them during the load.
If there are conflicts, the settings from the last specified profile takes precedence.
Example 1.1. Low power consumption in a virtual guest
The following example optimizes the system to run in a virtual machine for the best performance and concurrently tunes it for low power consumption, while the low power consumption is the priority:
# tuned-adm profile virtual-guest powersave
Merging is done automatically without checking whether the resulting combination of parameters makes sense. Consequently, the feature might tune some parameters the opposite way, which might be counterproductive: for example, setting the disk for high throughput by using the throughput-performance
profile and concurrently setting the disk spindown to the low value by the spindown-disk
profile.
Additional resources
*tuned-adm
man page. * tuned.conf(5)
man page.
1.5. The location of TuneD profiles
TuneD stores profiles in the following directories:
/usr/lib/tuned/
-
Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile consists of the main configuration file called
tuned.conf
, and optionally other files, for example helper scripts. /etc/tuned/
-
If you need to customize a profile, copy the profile directory into the directory, which is used for custom profiles. If there are two profiles of the same name, the custom profile located in
/etc/tuned/
is used.
Additional resources
-
tuned.conf(5)
man page.
1.6. TuneD profiles distributed with RHEL
The following is a list of profiles that are installed with TuneD on Red Hat Enterprise Linux.
There might be more product-specific or third-party TuneD profiles available. Such profiles are usually provided by separate RPM packages.
balanced
The default power-saving profile. It is intended to be a compromise between performance and power consumption. It uses auto-scaling and auto-tuning whenever possible. The only drawback is the increased latency. In the current TuneD release, it enables the CPU, disk, audio, and video plugins, and activates the
conservative
CPU governor. Theradeon_powersave
option uses thedpm-balanced
value if it is supported, otherwise it is set toauto
.It changes the
energy_performance_preference
attribute to thenormal
energy setting. It also changes thescaling_governor
policy attribute to either theconservative
orpowersave
CPU governor.powersave
A profile for maximum power saving performance. It can throttle the performance in order to minimize the actual power consumption. In the current TuneD release it enables USB autosuspend, WiFi power saving, and Aggressive Link Power Management (ALPM) power savings for SATA host adapters. It also schedules multi-core power savings for systems with a low wakeup rate and activates the
ondemand
governor. It enables AC97 audio power saving or, depending on your system, HDA-Intel power savings with a 10 seconds timeout. If your system contains a supported Radeon graphics card with enabled KMS, the profile configures it to automatic power saving. On ASUS Eee PCs, a dynamic Super Hybrid Engine is enabled.It changes the
energy_performance_preference
attribute to thepowersave
orpower
energy setting. It also changes thescaling_governor
policy attribute to either theondemand
orpowersave
CPU governor.NoteIn certain cases, the
balanced
profile is more efficient compared to thepowersave
profile.Consider there is a defined amount of work that needs to be done, for example a video file that needs to be transcoded. Your machine might consume less energy if the transcoding is done on the full power, because the task is finished quickly, the machine starts to idle, and it can automatically step-down to very efficient power save modes. On the other hand, if you transcode the file with a throttled machine, the machine consumes less power during the transcoding, but the process takes longer and the overall consumed energy can be higher.
That is why the
balanced
profile can be generally a better option.throughput-performance
A server profile optimized for high throughput. It disables power savings mechanisms and enables
sysctl
settings that improve the throughput performance of the disk and network IO. CPU governor is set toperformance
.It changes the
energy_performance_preference
andscaling_governor
attribute to theperformance
profile.accelerator-performance
-
The
accelerator-performance
profile contains the same tuning as thethroughput-performance
profile. Additionally, it locks the CPU to low C states so that the latency is less than 100us. This improves the performance of certain accelerators, such as GPUs. latency-performance
A server profile optimized for low latency. It disables power savings mechanisms and enables
sysctl
settings that improve latency. CPU governor is set toperformance
and the CPU is locked to the low C states (by PM QoS).It changes the
energy_performance_preference
andscaling_governor
attribute to theperformance
profile.network-latency
A profile for low latency network tuning. It is based on the
latency-performance
profile. It additionally disables transparent huge pages and NUMA balancing, and tunes several other network-relatedsysctl
parameters.It inherits the
latency-performance
profile which changes theenergy_performance_preference
andscaling_governor
attribute to theperformance
profile.hpc-compute
-
A profile optimized for high-performance computing. It is based on the
latency-performance
profile. network-throughput
A profile for throughput network tuning. It is based on the
throughput-performance
profile. It additionally increases kernel network buffers.It inherits either the
latency-performance
orthroughput-performance
profile, and changes theenergy_performance_preference
andscaling_governor
attribute to theperformance
profile.virtual-guest
A profile designed for Red Hat Enterprise Linux 9 virtual machines and VMWare guests based on the
throughput-performance
profile that, among other tasks, decreases virtual memory swappiness and increases disk readahead values. It does not disable disk barriers.It inherits the
throughput-performance
profile and changes theenergy_performance_preference
andscaling_governor
attribute to theperformance
profile.virtual-host
A profile designed for virtual hosts based on the
throughput-performance
profile that, among other tasks, decreases virtual memory swappiness, increases disk readahead values, and enables a more aggressive value of dirty pages writeback.It inherits the
throughput-performance
profile and changes theenergy_performance_preference
andscaling_governor
attribute to theperformance
profile.oracle
-
A profile optimized for Oracle databases loads based on
throughput-performance
profile. It additionally disables transparent huge pages and modifies other performance-related kernel parameters. This profile is provided by thetuned-profiles-oracle
package. desktop
-
A profile optimized for desktops, based on the
balanced
profile. It additionally enables scheduler autogroups for better response of interactive applications. optimize-serial-console
A profile that tunes down I/O activity to the serial console by reducing the printk value. This should make the serial console more responsive. This profile is intended to be used as an overlay on other profiles. For example:
# tuned-adm profile throughput-performance optimize-serial-console
mssql
-
A profile provided for Microsoft SQL Server. It is based on the
thoguhput-performance
profile. intel-sst
A profile optimized for systems with user-defined Intel Speed Select Technology configurations. This profile is intended to be used as an overlay on other profiles. For example:
# tuned-adm profile cpu-partitioning intel-sst
1.7. TuneD cpu-partitioning profile
For tuning Red Hat Enterprise Linux 9 for latency-sensitive workloads, Red Hat recommends to use the cpu-partitioning
TuneD profile.
Prior to Red Hat Enterprise Linux 9, the low-latency Red Hat documentation described the numerous low-level steps needed to achieve low-latency tuning. In Red Hat Enterprise Linux 9, you can perform low-latency tuning more efficiently by using the cpu-partitioning
TuneD profile. This profile is easily customizable according to the requirements for individual low-latency applications.
The following figure is an example to demonstrate how to use the cpu-partitioning
profile. This example uses the CPU and node layout.
Figure 1.1. Figure cpu-partitioning

You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf
file using the following configuration options:
- Isolated CPUs with load balancing
In the cpu-partitioning figure, the blocks numbered from 4 to 23, are the default isolated CPUs. The kernel scheduler’s process load balancing is enabled on these CPUs. It is designed for low-latency processes with multiple threads that need the kernel scheduler load balancing.
You can configure the cpu-partitioning profile in the
/etc/tuned/cpu-partitioning-variables.conf
file using theisolated_cores=cpu-list
option, which lists CPUs to isolate that will use the kernel scheduler load balancing.The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as
3-5
. This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping CPU.- Isolated CPUs without load balancing
In the cpu-partitioning figure, the blocks numbered 2 and 3, are the isolated CPUs that do not provide any additional kernel scheduler process load balancing.
You can configure the cpu-partitioning profile in the
/etc/tuned/cpu-partitioning-variables.conf
file using theno_balance_cores=cpu-list
option, which lists CPUs to isolate that will not use the kernel scheduler load balancing.Specifying the
no_balance_cores
option is optional, however any CPUs in this list must be a subset of the CPUs listed in theisolated_cores
list.Application threads using these CPUs need to be pinned individually to each CPU.
- Housekeeping CPUs
-
Any CPU not isolated in the
cpu-partitioning-variables.conf
file is automatically considered a housekeeping CPU. On the housekeeping CPUs, all services, daemons, user processes, movable kernel threads, interrupt handlers, and kernel timers are permitted to execute.
Additional resources
-
tuned-profiles-cpu-partitioning(7)
man page
1.8. Using the TuneD cpu-partitioning profile for low-latency tuning
This procedure describes how to tune a system for low-latency using the TuneD’s cpu-partitioning
profile. It uses the example of a low-latency application that can use cpu-partitioning
and the CPU layout as mentioned in the cpu-partitioning figure.
The application in this case uses:
- One dedicated reader thread that reads data from the network will be pinned to CPU 2.
- A large number of threads that process this network data will be pinned to CPUs 4-23.
- A dedicated writer thread that writes the processed data to the network will be pinned to CPU 3.
Prerequisites
-
You have installed the
cpu-partitioning
TuneD profile by using thednf install tuned-profiles-cpu-partitioning
command as root.
Procedure
Edit
/etc/tuned/cpu-partitioning-variables.conf
file and add the following information:# Isolated CPUs with the kernel’s scheduler load balancing: isolated_cores=2-23 # Isolated CPUs without the kernel’s scheduler load balancing: no_balance_cores=2,3
Set the
cpu-partitioning
TuneD profile:# tuned-adm profile cpu-partitioning
Reboot
After rebooting, the system is tuned for low-latency, according to the isolation in the cpu-partitioning figure. The application can use taskset to pin the reader and writer threads to CPUs 2 and 3, and the remaining application threads on CPUs 4-23.
Additional resources
-
tuned-profiles-cpu-partitioning(7)
man page
1.9. Customizing the cpu-partitioning TuneD profile
You can extend the TuneD profile to make additional tuning changes.
For example, the cpu-partitioning
profile sets the CPUs to use cstate=1
. In order to use the cpu-partitioning
profile but to additionally change the CPU cstate from cstate1 to cstate0, the following procedure describes a new TuneD profile named my_profile, which inherits the cpu-partitioning
profile and then sets C state-0.
Procedure
Create the
/etc/tuned/my_profile
directory:# mkdir /etc/tuned/my_profile
Create a
tuned.conf
file in this directory, and add the following content:# vi /etc/tuned/my_profile/tuned.conf [main] summary=Customized tuning on top of cpu-partitioning include=cpu-partitioning [cpu] force_latency=cstate.id:0|1
Use the new profile:
# tuned-adm profile my_profile
In the shared example, a reboot is not required. However, if the changes in the my_profile profile require a reboot to take effect, then reboot your machine.
Additional resources
-
tuned-profiles-cpu-partitioning(7)
man page
1.10. Real-time TuneD profiles distributed with RHEL
Real-time profiles are intended for systems running the real-time kernel. Without a special kernel build, they do not configure the system to be real-time. On RHEL, the profiles are available from additional repositories.
The following real-time profiles are available:
realtime
Use on bare-metal real-time systems.
Provided by the
tuned-profiles-realtime
package, which is available from the RT or NFV repositories.realtime-virtual-host
Use in a virtualization host configured for real-time.
Provided by the
tuned-profiles-nfv-host
package, which is available from the NFV repository.realtime-virtual-guest
Use in a virtualization guest configured for real-time.
Provided by the
tuned-profiles-nfv-guest
package, which is available from the NFV repository.
1.11. Static and dynamic tuning in TuneD
This section explains the difference between the two categories of system tuning that TuneD applies: static and dynamic.
- Static tuning
-
Mainly consists of the application of predefined
sysctl
andsysfs
settings and one-shot activation of several configuration tools such asethtool
. - Dynamic tuning
Watches how various system components are used throughout the uptime of your system. TuneD adjusts system settings dynamically based on that monitoring information.
For example, the hard drive is used heavily during startup and login, but is barely used later when the user might mainly work with applications such as web browsers or email clients. Similarly, the CPU and network devices are used differently at different times. TuneD monitors the activity of these components and reacts to the changes in their use.
By default, dynamic tuning is disabled. To enable it, edit the
/etc/tuned/tuned-main.conf
file and change thedynamic_tuning
option to1
. TuneD then periodically analyzes system statistics and uses them to update your system tuning settings. To configure the time interval in seconds between these updates, use theupdate_interval
option.Currently implemented dynamic tuning algorithms try to balance the performance and powersave, and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be enabled or disabled in the TuneD profiles.
Example 1.2. Static and dynamic tuning on a workstation
On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a few emails go in and out or some web pages might be loaded.
For those kinds of loads, the network interface does not have to run at full speed all the time, as it does by default. TuneD has a monitoring and tuning plug-in for network devices that can detect this low activity and then automatically lower the speed of that interface, typically resulting in a lower power usage.
If the activity on the interface increases for a longer period of time, for example because a DVD image is being downloaded or an email with a large attachment is opened, TuneD detects this and sets the interface speed to maximum to offer the best performance while the activity level is high.
This principle is used for other plug-ins for CPU and disks as well.
1.12. TuneD no-daemon mode
You can run TuneD in no-daemon
mode, which does not require any resident memory. In this mode, TuneD applies the settings and exits.
By default, no-daemon
mode is disabled because a lot of TuneD functionality is missing in this mode, including:
- D-Bus support
- Hot-plug support
- Rollback support for settings
To enable no-daemon
mode, include the following line in the /etc/tuned/tuned-main.conf
file:
daemon = 0
1.13. Installing and enabling TuneD
This procedure installs and enables the TuneD application, installs TuneD profiles, and presets a default TuneD profile for your system.
Procedure
Install the
tuned
package:# dnf install tuned
Enable and start the
tuned
service:# systemctl enable --now tuned
Optionally, install TuneD profiles for real-time systems:
# dnf install tuned-profiles-realtime tuned-profiles-nfv
Verify that a TuneD profile is active and applied:
$ tuned-adm active Current active profile: balanced
$ tuned-adm verify Verfication succeeded, current system settings match the preset profile. See tuned log file ('/var/log/tuned/tuned.log') for details.
1.14. Listing available TuneD profiles
This procedure lists all TuneD profiles that are currently available on your system.
Procedure
To list all available TuneD profiles on your system, use:
$ tuned-adm list Available profiles: - accelerator-performance - Throughput performance based tuning with disabled higher latency STOP states - balanced - General non-specialized tuned profile - desktop - Optimize for the desktop use-case - latency-performance - Optimize for deterministic performance at the cost of increased power consumption - network-latency - Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance - network-throughput - Optimize for streaming network throughput, generally only necessary on older CPUs or 40G+ networks - powersave - Optimize for low power consumption - throughput-performance - Broadly applicable tuning that provides excellent performance across a variety of common server workloads - virtual-guest - Optimize for running inside a virtual guest - virtual-host - Optimize for running KVM guests Current active profile: balanced
To display only the currently active profile, use:
$ tuned-adm active Current active profile: balanced
Additional resources
-
tuned-adm(8)
man page.
1.15. Setting a TuneD profile
This procedure activates a selected TuneD profile on your system.
Prerequisites
-
The
tuned
service is running. See Installing and Enabling TuneD for details.
Procedure
Optionally, you can let TuneD recommend the most suitable profile for your system:
# tuned-adm recommend balanced
Activate a profile:
# tuned-adm profile selected-profile
Alternatively, you can activate a combination of multiple profiles:
# tuned-adm profile profile1 profile2
Example 1.3. A virtual machine optimized for low power consumption
The following example optimizes the system to run in a virtual machine with the best performance and concurrently tunes it for low power consumption, while the low power consumption is the priority:
# tuned-adm profile virtual-guest powersave
View the current active TuneD profile on your system:
# tuned-adm active Current active profile: selected-profile
Reboot the system:
# reboot
Verification steps
Verify that the TuneD profile is active and applied:
$ tuned-adm verify Verfication succeeded, current system settings match the preset profile. See tuned log file ('/var/log/tuned/tuned.log') for details.
Additional resources
-
tuned-adm(8)
man page
1.16. Disabling TuneD
This procedure disables TuneD and resets all affected system settings to their original state before TuneD modified them.
Procedure
To disable all tunings temporarily:
# tuned-adm off
The tunings are applied again after the
tuned
service restarts.Alternatively, to stop and disable the
tuned
service permanently:# systemctl disable --now tuned
Additional resources
-
tuned-adm(8)
man page
Chapter 2. Customizing TuneD profiles
You can create or modify TuneD profiles to optimize system performance for your intended use case.
Prerequisites
- Install and enable TuneD as described in Installing and Enabling TuneD for details.
2.1. TuneD profiles
A detailed analysis of a system can be very time-consuming. TuneD provides a number of predefined profiles for typical use cases. You can also create, modify, and delete profiles.
The profiles provided with TuneD are divided into the following categories:
- Power-saving profiles
- Performance-boosting profiles
The performance-boosting profiles include profiles that focus on the following aspects:
- Low latency for storage and network
- High throughput for storage and network
- Virtual machine performance
- Virtualization host performance
Syntax of profile configuration
The tuned.conf
file can contain one [main]
section and other sections for configuring plug-in instances. However, all sections are optional.
Lines starting with the hash sign (#
) are comments.
Additional resources
-
tuned.conf(5)
man page.
2.2. The default TuneD profile
During the installation, the best profile for your system is selected automatically. Currently, the default profile is selected according to the following customizable rules:
Environment | Default profile | Goal |
---|---|---|
Compute nodes |
| The best throughput performance |
Virtual machines |
|
The best performance. If you are not interested in the best performance, you can change it to the |
Other cases |
| Balanced performance and power consumption |
Additional resources
-
tuned.conf(5)
man page.
2.3. Merged TuneD profiles
As an experimental feature, it is possible to select more profiles at once. TuneD will try to merge them during the load.
If there are conflicts, the settings from the last specified profile takes precedence.
Example 2.1. Low power consumption in a virtual guest
The following example optimizes the system to run in a virtual machine for the best performance and concurrently tunes it for low power consumption, while the low power consumption is the priority:
# tuned-adm profile virtual-guest powersave
Merging is done automatically without checking whether the resulting combination of parameters makes sense. Consequently, the feature might tune some parameters the opposite way, which might be counterproductive: for example, setting the disk for high throughput by using the throughput-performance
profile and concurrently setting the disk spindown to the low value by the spindown-disk
profile.
Additional resources
*tuned-adm
man page. * tuned.conf(5)
man page.
2.4. The location of TuneD profiles
TuneD stores profiles in the following directories:
/usr/lib/tuned/
-
Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile consists of the main configuration file called
tuned.conf
, and optionally other files, for example helper scripts. /etc/tuned/
-
If you need to customize a profile, copy the profile directory into the directory, which is used for custom profiles. If there are two profiles of the same name, the custom profile located in
/etc/tuned/
is used.
Additional resources
-
tuned.conf(5)
man page.
2.5. Inheritance between TuneD profiles
TuneD profiles can be based on other profiles and modify only certain aspects of their parent profile.
The [main]
section of TuneD profiles recognizes the include
option:
[main]
include=parent
All settings from the parent profile are loaded in this child profile. In the following sections, the child profile can override certain settings inherited from the parent profile or add new settings not present in the parent profile.
You can create your own child profile in the /etc/tuned/
directory based on a pre-installed profile in /usr/lib/tuned/
with only some parameters adjusted.
If the parent profile is updated, such as after a TuneD upgrade, the changes are reflected in the child profile.
Example 2.2. A power-saving profile based on balanced
The following is an example of a custom profile that extends the balanced
profile and sets Aggressive Link Power Management (ALPM) for all devices to the maximum powersaving.
[main] include=balanced [scsi_host] alpm=min_power
Additional resources
-
tuned.conf(5)
man page
2.6. Static and dynamic tuning in TuneD
This section explains the difference between the two categories of system tuning that TuneD applies: static and dynamic.
- Static tuning
-
Mainly consists of the application of predefined
sysctl
andsysfs
settings and one-shot activation of several configuration tools such asethtool
. - Dynamic tuning
Watches how various system components are used throughout the uptime of your system. TuneD adjusts system settings dynamically based on that monitoring information.
For example, the hard drive is used heavily during startup and login, but is barely used later when the user might mainly work with applications such as web browsers or email clients. Similarly, the CPU and network devices are used differently at different times. TuneD monitors the activity of these components and reacts to the changes in their use.
By default, dynamic tuning is disabled. To enable it, edit the
/etc/tuned/tuned-main.conf
file and change thedynamic_tuning
option to1
. TuneD then periodically analyzes system statistics and uses them to update your system tuning settings. To configure the time interval in seconds between these updates, use theupdate_interval
option.Currently implemented dynamic tuning algorithms try to balance the performance and powersave, and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be enabled or disabled in the TuneD profiles.
Example 2.3. Static and dynamic tuning on a workstation
On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a few emails go in and out or some web pages might be loaded.
For those kinds of loads, the network interface does not have to run at full speed all the time, as it does by default. TuneD has a monitoring and tuning plug-in for network devices that can detect this low activity and then automatically lower the speed of that interface, typically resulting in a lower power usage.
If the activity on the interface increases for a longer period of time, for example because a DVD image is being downloaded or an email with a large attachment is opened, TuneD detects this and sets the interface speed to maximum to offer the best performance while the activity level is high.
This principle is used for other plug-ins for CPU and disks as well.
2.7. TuneD plug-ins
Plug-ins are modules in TuneD profiles that TuneD uses to monitor or optimize different devices on the system.
TuneD uses two types of plug-ins:
- Monitoring plug-ins
Monitoring plug-ins are used to get information from a running system. The output of the monitoring plug-ins can be used by tuning plug-ins for dynamic tuning.
Monitoring plug-ins are automatically instantiated whenever their metrics are needed by any of the enabled tuning plug-ins. If two tuning plug-ins require the same data, only one instance of the monitoring plug-in is created and the data is shared.
- Tuning plug-ins
- Each tuning plug-in tunes an individual subsystem and takes several parameters that are populated from the tuned profiles. Each subsystem can have multiple devices, such as multiple CPUs or network cards, that are handled by individual instances of the tuning plug-ins. Specific settings for individual devices are also supported.
Syntax for plug-ins in TuneD profiles
Sections describing plug-in instances are formatted in the following way:
[NAME] type=TYPE devices=DEVICES
- NAME
- is the name of the plug-in instance as it is used in the logs. It can be an arbitrary string.
- TYPE
- is the type of the tuning plug-in.
- DEVICES
is the list of devices that this plug-in instance handles.
The
devices
line can contain a list, a wildcard (*
), and negation (!
). If there is nodevices
line, all devices present or later attached on the system of the TYPE are handled by the plug-in instance. This is same as using thedevices=*
option.Example 2.4. Matching block devices with a plug-in
The following example matches all block devices starting with
sd
, such assda
orsdb
, and does not disable barriers on them:[data_disk] type=disk devices=sd* disable_barriers=false
The following example matches all block devices except
sda1
andsda2
:[data_disk] type=disk devices=!sda1, !sda2 disable_barriers=false
If no instance of a plug-in is specified, the plug-in is not enabled.
If the plug-in supports more options, they can be also specified in the plug-in section. If the option is not specified and it was not previously specified in the included plug-in, the default value is used.
Short plug-in syntax
If you do not need custom names for the plug-in instance and there is only one definition of the instance in your configuration file, TuneD supports the following short syntax:
[TYPE] devices=DEVICES
In this case, it is possible to omit the type
line. The instance is then referred to with a name, same as the type. The previous example could be then rewritten into:
Example 2.5. Matching block devices using the short syntax
[disk] devices=sdb* disable_barriers=false
Conflicting plug-in definitions in a profile
If the same section is specified more than once using the include
option, the settings are merged. If they cannot be merged due to a conflict, the last conflicting definition overrides the previous settings. If you do not know what was previously defined, you can use the replace
Boolean option and set it to true
. This causes all the previous definitions with the same name to be overwritten and the merge does not happen.
You can also disable the plug-in by specifying the enabled=false
option. This has the same effect as if the instance was never defined. Disabling the plug-in is useful if you are redefining the previous definition from the include
option and do not want the plug-in to be active in your custom profile.
- NOTE
TuneD includes the ability to run any shell command as part of enabling or disabling a tuning profile. This enables you to extend TuneD profiles with functionality that has not been integrated into TuneD yet.
You can specify arbitrary shell commands using the
script
plug-in.
Additional resources
-
tuned.conf(5)
man page
2.8. Available TuneD plug-ins
This section lists all monitoring and tuning plug-ins currently available in TuneD.
Monitoring plug-ins
Currently, the following monitoring plug-ins are implemented:
disk
- Gets disk load (number of IO operations) per device and measurement interval.
net
- Gets network load (number of transferred packets) per network card and measurement interval.
load
- Gets CPU load per CPU and measurement interval.
Tuning plug-ins
Currently, the following tuning plug-ins are implemented. Only some of these plug-ins implement dynamic tuning. Options supported by plug-ins are also listed:
cpu
Sets the CPU governor to the value specified by the
governor
option and dynamically changes the Power Management Quality of Service (PM QoS) CPU Direct Memory Access (DMA) latency according to the CPU load.If the CPU load is lower than the value specified by the
load_threshold
option, the latency is set to the value specified by thelatency_high
option, otherwise it is set to the value specified bylatency_low
.You can also force the latency to a specific value and prevent it from dynamically changing further. To do so, set the
force_latency
option to the required latency value.eeepc_she
Dynamically sets the front-side bus (FSB) speed according to the CPU load.
This feature can be found on some netbooks and is also known as the ASUS Super Hybrid Engine (SHE).
If the CPU load is lower or equal to the value specified by the
load_threshold_powersave
option, the plug-in sets the FSB speed to the value specified by theshe_powersave
option. If the CPU load is higher or equal to the value specified by theload_threshold_normal
option, it sets the FSB speed to the value specified by theshe_normal
option.Static tuning is not supported and the plug-in is transparently disabled if TuneD does not detect the hardware support for this feature.
net
-
Configures the Wake-on-LAN functionality to the values specified by the
wake_on_lan
option. It uses the same syntax as theethtool
utility. It also dynamically changes the interface speed according to the interface utilization. sysctl
Sets various
sysctl
settings specified by the plug-in options.The syntax is
name=value
, where name is the same as the name provided by thesysctl
utility.Use the
sysctl
plug-in if you need to change system settings that are not covered by other plug-ins available in TuneD. If the settings are covered by some specific plug-ins, prefer these plug-ins.usb
Sets autosuspend timeout of USB devices to the value specified by the
autosuspend
parameter.The value
0
means that autosuspend is disabled.vm
Enables or disables transparent huge pages depending on the value of the
transparent_hugepages
option.Valid values of the
transparent_hugepages
option are:- "always"
- "never"
- "madvise"
audio
Sets the autosuspend timeout for audio codecs to the value specified by the
timeout
option.Currently, the
snd_hda_intel
andsnd_ac97_codec
codecs are supported. The value0
means that the autosuspend is disabled. You can also enforce the controller reset by setting the Boolean optionreset_controller
totrue
.disk
Sets the disk elevator to the value specified by the
elevator
option.It also sets:
-
APM to the value specified by the
apm
option -
Scheduler quantum to the value specified by the
scheduler_quantum
option -
Disk spindown timeout to the value specified by the
spindown
option -
Disk readahead to the value specified by the
readahead
parameter -
The current disk readahead to a value multiplied by the constant specified by the
readahead_multiply
option
In addition, this plug-in dynamically changes the advanced power management and spindown timeout setting for the drive according to the current drive utilization. The dynamic tuning can be controlled by the Boolean option
dynamic
and is enabled by default.-
APM to the value specified by the
scsi_host
Tunes options for SCSI hosts.
It sets Aggressive Link Power Management (ALPM) to the value specified by the
alpm
option.mounts
-
Enables or disables barriers for mounts according to the Boolean value of the
disable_barriers
option. script
Executes an external script or binary when the profile is loaded or unloaded. You can choose an arbitrary executable.
ImportantThe
script
plug-in is provided mainly for compatibility with earlier releases. Prefer other TuneD plug-ins if they cover the required functionality.TuneD calls the executable with one of the following arguments:
-
start
when loading the profile -
stop
when unloading the profile
You need to correctly implement the
stop
action in your executable and revert all settings that you changed during thestart
action. Otherwise, the roll-back step after changing your TuneD profile will not work.Bash scripts can import the
/usr/lib/tuned/functions
Bash library and use the functions defined there. Use these functions only for functionality that is not natively provided by TuneD. If a function name starts with an underscore, such as_wifi_set_power_level
, consider the function private and do not use it in your scripts, because it might change in the future.Specify the path to the executable using the
script
parameter in the plug-in configuration.Example 2.6. Running a Bash script from a profile
To run a Bash script named
script.sh
that is located in the profile directory, use:[script] script=${i:PROFILE_DIR}/script.sh
-
sysfs
Sets various
sysfs
settings specified by the plug-in options.The syntax is
name=value
, where name is thesysfs
path to use.Use this plugin in case you need to change some settings that are not covered by other plug-ins. Prefer specific plug-ins if they cover the required settings.
video
Sets various powersave levels on video cards. Currently, only the Radeon cards are supported.
The powersave level can be specified by using the
radeon_powersave
option. Supported values are:-
default
-
auto
-
low
-
mid
-
high
-
dynpm
-
dpm-battery
-
dpm-balanced
-
dpm-perfomance
For details, see www.x.org. Note that this plug-in is experimental and the option might change in future releases.
-
bootloader
Adds options to the kernel command line. This plug-in supports only the GRUB 2 boot loader.
Customized non-standard location of the GRUB 2 configuration file can be specified by the
grub2_cfg_file
option.The kernel options are added to the current GRUB configuration and its templates. The system needs to be rebooted for the kernel options to take effect.
Switching to another profile or manually stopping the
tuned
service removes the additional options. If you shut down or reboot the system, the kernel options persist in thegrub.cfg
file.The kernel options can be specified by the following syntax:
cmdline=arg1 arg2 ... argN
Example 2.7. Modifying the kernel command line
For example, to add the
quiet
kernel option to a TuneD profile, include the following lines in thetuned.conf
file:[bootloader] cmdline=quiet
The following is an example of a custom profile that adds the
isolcpus=2
option to the kernel command line:[bootloader] cmdline=isolcpus=2
2.9. Variables in TuneD profiles
Variables expand at run time when a TuneD profile is activated.
Using TuneD variables reduces the amount of necessary typing in TuneD profiles.
There are no predefined variables in TuneD profiles. You can define your own variables by creating the [variables]
section in a profile and using the following syntax:
[variables] variable_name=value
To expand the value of a variable in a profile, use the following syntax:
${variable_name}
Example 2.8. Isolating CPU cores using variables
In the following example, the ${isolated_cores}
variable expands to 1,2
; hence the kernel boots with the isolcpus=1,2
option:
[variables] isolated_cores=1,2 [bootloader] cmdline=isolcpus=${isolated_cores}
The variables can be specified in a separate file. For example, you can add the following lines to tuned.conf
:
[variables]
include=/etc/tuned/my-variables.conf
[bootloader]
cmdline=isolcpus=${isolated_cores}
If you add the isolated_cores=1,2
option to the /etc/tuned/my-variables.conf
file, the kernel boots with the isolcpus=1,2
option.
Additional resources
-
tuned.conf(5)
man page
2.10. Built-in functions in TuneD profiles
Built-in functions expand at run time when a TuneD profile is activated.
You can:
- Use various built-in functions together with TuneD variables
- Create custom functions in Python and add them to TuneD in the form of plug-ins
To call a function, use the following syntax:
${f:function_name:argument_1:argument_2}
To expand the directory path where the profile and the tuned.conf
file are located, use the PROFILE_DIR
function, which requires special syntax:
${i:PROFILE_DIR}
Example 2.9. Isolating CPU cores using variables and built-in functions
In the following example, the ${non_isolated_cores}
variable expands to 0,3-5
, and the cpulist_invert
built-in function is called with the 0,3-5
argument:
[variables] non_isolated_cores=0,3-5 [bootloader] cmdline=isolcpus=${f:cpulist_invert:${non_isolated_cores}}
The cpulist_invert
function inverts the list of CPUs. For a 6-CPU machine, the inversion is 1,2
, and the kernel boots with the isolcpus=1,2
command-line option.
Additional resources
-
tuned.conf(5)
man page
2.11. Built-in functions available in TuneD profiles
The following built-in functions are available in all TuneD profiles:
PROFILE_DIR
-
Returns the directory path where the profile and the
tuned.conf
file are located. exec
- Executes a process and returns its output.
assertion
- Compares two arguments. If they do not match, the function logs text from the first argument and aborts profile loading.
assertion_non_equal
- Compares two arguments. If they match, the function logs text from the first argument and aborts profile loading.
kb2s
- Converts kilobytes to disk sectors.
s2kb
- Converts disk sectors to kilobytes.
strip
- Creates a string from all passed arguments and deletes both leading and trailing white space.
virt_check
Checks whether TuneD is running inside a virtual machine (VM) or on bare metal:
- Inside a VM, the function returns the first argument.
- On bare metal, the function returns the second argument, even in case of an error.
cpulist_invert
-
Inverts a list of CPUs to make its complement. For example, on a system with 4 CPUs, numbered from 0 to 3, the inversion of the list
0,2,3
is1
. cpulist2hex
- Converts a CPU list to a hexadecimal CPU mask.
cpulist2hex_invert
- Converts a CPU list to a hexadecimal CPU mask and inverts it.
hex2cpulist
- Converts a hexadecimal CPU mask to a CPU list.
cpulist_online
- Checks whether the CPUs from the list are online. Returns the list containing only online CPUs.
cpulist_present
- Checks whether the CPUs from the list are present. Returns the list containing only present CPUs.
cpulist_unpack
-
Unpacks a CPU list in the form of
1-3,4
to1,2,3,4
. cpulist_pack
-
Packs a CPU list in the form of
1,2,3,5
to1-3,5
.
2.12. Creating new TuneD profiles
This procedure creates a new TuneD profile with custom performance rules.
Prerequisites
-
The
tuned
service is running. See Installing and Enabling TuneD for details.
Procedure
In the
/etc/tuned/
directory, create a new directory named the same as the profile that you want to create:# mkdir /etc/tuned/my-profile
In the new directory, create a file named
tuned.conf
. Add a[main]
section and plug-in definitions in it, according to your requirements.For example, see the configuration of the
balanced
profile:[main] summary=General non-specialized tuned profile [cpu] governor=conservative energy_perf_bias=normal [audio] timeout=10 [video] radeon_powersave=dpm-balanced, auto [scsi_host] alpm=medium_power
To activate the profile, use:
# tuned-adm profile my-profile
Verify that the TuneD profile is active and the system settings are applied:
$ tuned-adm active Current active profile: my-profile
$ tuned-adm verify Verfication succeeded, current system settings match the preset profile. See tuned log file ('/var/log/tuned/tuned.log') for details.
Additional resources
-
tuned.conf(5)
man page
2.13. Modifying existing TuneD profiles
This procedure creates a modified child profile based on an existing TuneD profile.
Prerequisites
-
The
tuned
service is running. See Installing and Enabling TuneD for details.
Procedure
In the
/etc/tuned/
directory, create a new directory named the same as the profile that you want to create:# mkdir /etc/tuned/modified-profile
In the new directory, create a file named
tuned.conf
, and set the[main]
section as follows:[main] include=parent-profile
Replace parent-profile with the name of the profile you are modifying.
Include your profile modifications.
Example 2.10. Lowering swappiness in the throughput-performance profile
To use the settings from the
throughput-performance
profile and change the value ofvm.swappiness
to 5, instead of the default 10, use:[main] include=throughput-performance [sysctl] vm.swappiness=5
To activate the profile, use:
# tuned-adm profile modified-profile
Verify that the TuneD profile is active and the system settings are applied:
$ tuned-adm active Current active profile: my-profile
$ tuned-adm verify Verfication succeeded, current system settings match the preset profile. See tuned log file ('/var/log/tuned/tuned.log') for details.
Additional resources
-
tuned.conf(5)
man page
2.14. Setting the disk scheduler using TuneD
This procedure creates and enables a TuneD profile that sets a given disk scheduler for selected block devices. The setting persists across system reboots.
In the following commands and configuration, replace:
-
device with the name of the block device, for example
sdf
-
selected-scheduler with the disk scheduler that you want to set for the device, for example
bfq
Prerequisites
-
The
tuned
service is installed and enabled. For details, see Installing and enabling TuneD.
Procedure
Optional: Select an existing TuneD profile on which your profile will be based. For a list of available profiles, see TuneD profiles distributed with RHEL.
To see which profile is currently active, use:
$ tuned-adm active
Create a new directory to hold your TuneD profile:
# mkdir /etc/tuned/my-profile
Find the system unique identifier of the selected block device:
$ udevadm info --query=property --name=/dev/device | grep -E '(WWN|SERIAL)' ID_WWN=0x5002538d00000000_ ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0 ID_SERIAL_SHORT=20120501030900000
NoteThe command in the this example will return all values identified as a World Wide Name (WWN) or serial number associated with the specified block device. Although it is preferred to use a WWN, the WWN is not always available for a given device and any values returned by the example command are acceptable to use as the device system unique ID.
Create the
/etc/tuned/my-profile/tuned.conf
configuration file. In the file, set the following options:Optional: Include an existing profile:
[main] include=existing-profile
Set the selected disk scheduler for the device that matches the WWN identifier:
[disk] devices_udev_regex=IDNAME=device system unique id elevator=selected-scheduler
Here:
-
Replace IDNAME with the name of the identifier being used (for example,
ID_WWN
). Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000
).To match multiple devices in the
devices_udev_regex
option, enclose the identifiers in parentheses and separate them with vertical bars:devices_udev_regex=(ID_WWN=0x5002538d00000000)|(ID_WWN=0x1234567800000000)
-
Replace IDNAME with the name of the identifier being used (for example,
Enable your profile:
# tuned-adm profile my-profile
Verification steps
Verify that the TuneD profile is active and applied:
$ tuned-adm active Current active profile: my-profile
$ tuned-adm verify Verification succeeded, current system settings match the preset profile. See tuned log file ('/var/log/tuned/tuned.log') for details.
Read the contents of the
/sys/block/device/queue/scheduler
file:# cat /sys/block/device/queue/scheduler [mq-deadline] kyber bfq none
In the file name, replace device with the block device name, for example
sdc
.The active scheduler is listed in square brackets (
[]
).
Additional resources
Chapter 3. Reviewing a system using tuna interface
Use the tuna
tool to adjust scheduler tunables, tune thread priority, IRQ handlers, and isolate CPU cores and sockets. Tuna reduces the complexity of performing tuning tasks.
The tuna
tool performs the following operations:
- Lists the CPUs on a system
- Lists the interrupt requests (IRQs) currently running on a system
- Changes policy and priority information on threads
- Displays the current policies and priorities of a system
3.1. Installing tuna tool
The tuna
tool is designed to be used on a running system. This allows application-specific measurement tools to see and analyze system performance immediately after changes have been made.
This procedure describes how to install the tuna
tool.
Procedure
Install the
tuna
tool:# dnf install tuna
Verification steps
View the available
tuna
CLI options:# tuna -h
Additional resources
-
tuna(8)
man page
3.2. Viewing the system status using tuna tool
This procedure describes how to view the system status using the tuna
command-line interface (CLI) tool.
Prerequisites
- The tuna tool is installed. For more information, see Installing tuna tool.
Procedure
To view the current policies and priorities:
# tuna --show_threads thread pid SCHED_ rtpri affinity cmd 1 OTHER 0 0,1 init 2 FIFO 99 0 migration/0 3 OTHER 0 0 ksoftirqd/0 4 FIFO 99 0 watchdog/0
To view a specific thread corresponding to a PID or matching a command name:
# tuna --threads=pid_or_cmd_list --show_threads
The pid_or_cmd_list argument is a list of comma-separated PIDs or command-name patterns.
-
To tune CPUs using the
tuna
CLI, see Tuning CPUs using tuna tool. -
To tune the IRQs using the
tuna
tool, see Tuning IRQs using tuna tool. To save the changed configuration:
# tuna --save=filename
This command saves only currently running kernel threads. Processes that are not running are not saved.
Additional resources
-
tuna(8)
man page
3.3. Tuning CPUs using tuna tool
The tuna
tool commands can target individual CPUs.
Using the tuna tool, you can:
Isolate CPUs
- All tasks running on the specified CPU move to the next available CPU. Isolating a CPU makes it unavailable by removing it from the affinity mask of all threads.
Include CPUs
- Allows tasks to run on the specified CPU
Restore CPUs
- Restores the specified CPU to its previous configuration.
This procedure describes how to tune CPUs using the tuna
CLI.
Prerequisites
- The tuna tool is installed. For more information, see Installing tuna tool.
Procedure
To specify the list of CPUs to be affected by a command:
# tuna --cpus=cpu_list [command]
The cpu_list argument is a list of comma-separated CPU numbers. For example,
--cpus=0,2
. CPU lists can also be specified in a range, for example--cpus=”1-3”
, which would select CPUs 1, 2, and 3.To add a specific CPU to the current cpu_list, for example, use
--cpus=+0
.Replace [command] with, for example,
--isolate
.To isolate a CPU:
# tuna --cpus=cpu_list --isolate
To include a CPU:
# tuna --cpus=cpu_list --include
To use a system with four or more processors, display how to make all the ssh threads run on CPU 0 and 1, and all the
http
threads on CPU 2 and 3:# tuna --cpus=0,1 --threads=ssh\* \ --move --cpus=2,3 --threads=http\* --move
This command performs the following operations sequentially:
- Selects CPUs 0 and 1.
-
Selects all threads that begin with
ssh
. -
Moves the selected threads to the selected CPUs. Tuna sets the affinity mask of threads starting with
ssh
to the appropriate CPUs. The CPUs can be expressed numerically as 0 and 1, in hex mask as 0x3, or in binary as 11. - Resets the CPU list to 2 and 3.
-
Selects all threads that begin with
http
. -
Moves the selected threads to the specified CPUs. Tuna sets the affinity mask of threads starting with
http
to the specified CPUs. The CPUs can be expressed numerically as 2 and 3, in hex mask as 0xC, or in binary as 1100.
Verification steps
Display the current configuration and verify that the changes were performed as expected:
# tuna --threads=gnome-sc\* --show_threads \ --cpus=0 --move --show_threads --cpus=1 \ --move --show_threads --cpus=+0 --move --show_threads thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 3861 OTHER 0 0,1 33997 58 gnome-screensav thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 3861 OTHER 0 0 33997 58 gnome-screensav thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 3861 OTHER 0 1 33997 58 gnome-screensav thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 3861 OTHER 0 0,1 33997 58 gnome-screensav
This command performs the following operations sequentially:
-
Selects all threads that begin with the
gnome-sc
threads. - Displays the selected threads to enable the user to verify their affinity mask and RT priority.
- Selects CPU 0.
-
Moves the
gnome-sc
threads to the specified CPU, CPU 0. - Shows the result of the move.
- Resets the CPU list to CPU 1.
-
Moves the
gnome-sc
threads to the specified CPU, CPU 1. - Displays the result of the move.
- Adds CPU 0 to the CPU list.
-
Moves the
gnome-sc
threads to the specified CPUs, CPUs 0 and 1. - Displays the result of the move.
-
Selects all threads that begin with the
Additional resources
-
/proc/cpuinfo
file -
tuna(8)
man page
3.4. Tuning IRQs using tuna tool
The /proc/interrupts
file records the number of interrupts per IRQ, the type of interrupt, and the name of the device that is located at that IRQ.
This procedure describes how to tune the IRQs using the tuna
tool.
Prerequisites
- The tuna tool is installed. For more information, see Installing tuna tool.
Procedure
To view the current IRQs and their affinity:
# tuna --show_irqs # users affinity 0 timer 0 1 i8042 0 7 parport0 0
To specify the list of IRQs to be affected by a command:
# tuna --irqs=irq_list [command]
The irq_list argument is a list of comma-separated IRQ numbers or user-name patterns.
Replace [command] with, for example,
--spread
.To move an interrupt to a specified CPU:
# tuna --irqs=128 --show_irqs # users affinity 128 iwlwifi 0,1,2,3 # tuna --irqs=128 --cpus=3 --move
Replace 128 with the irq_list argument and 3 with the cpu_list argument.
The cpu_list argument is a list of comma-separated CPU numbers, for example,
--cpus=0,2
. For more information, see Tuning CPUs using tuna tool.
Verification steps
Compare the state of the selected IRQs before and after moving any interrupt to a specified CPU:
# tuna --irqs=128 --show_irqs # users affinity 128 iwlwifi 3
Additional resources
-
/procs/interrupts
file -
tuna(8)
man page
Chapter 4. Monitoring performance using RHEL System Roles
As a system administrator, you can use the Metrics RHEL System Role to monitor the performance of a system.
4.1. Introduction to RHEL System Roles
RHEL System Roles is a collection of Ansible roles and modules. RHEL System Roles provide a configuration interface to remotely manage multiple RHEL systems. The interface enables managing system configurations across multiple versions of RHEL, as well as adopting new major releases.
On Red Hat Enterprise Linux 9, the interface currently consists of the following roles:
- Certificate Issuance and Renewal
- Kernel Settings
- Metrics
- Network Bound Disk Encryption client and Network Bound Disk Encryption server
- Networking
- Postfix
- SSH client
- SSH server
- System-wide Cryptographic Policies
- Terminal Session Recording
All these roles are provided by the rhel-system-roles
package available in the AppStream
repository.
Additional resources
- Red Hat Enterprise Linux (RHEL) System Roles
-
Documentation in the
/usr/share/doc/rhel-system-roles/
directory [1]
4.2. RHEL System Roles terminology
You can find the following terms across this documentation:
- Ansible playbook
- Playbooks are Ansible’s configuration, deployment, and orchestration language. They can describe a policy you want your remote systems to enforce, or a set of steps in a general IT process.
- Control node
- Any machine with Ansible installed. You can run commands and playbooks, invoking /usr/bin/ansible or /usr/bin/ansible-playbook, from any control node. You can use any computer that has Python installed on it as a control node - laptops, shared desktops, and servers can all run Ansible. However, you cannot use a Windows machine as a control node. You can have multiple control nodes.
- Inventory
- A list of managed nodes. An inventory file is also sometimes called a “hostfile”. Your inventory can specify information like IP address for each managed node. An inventory can also organize managed nodes, creating and nesting groups for easier scaling. To learn more about inventory, see the Working with Inventory section.
- Managed nodes
- The network devices, servers, or both that you manage with Ansible. Managed nodes are also sometimes called “hosts”. Ansible is not installed on managed nodes.
4.3. Installing RHEL System Roles in your system
To use the RHEL System Roles, install the required packages in your system.
Prerequisites
- The Ansible Core package is installed on the control machine.
- You have Ansible packages installed in the system you want to use as a control node.
Procedure
Install the
rhel-system-roles
package on the system that you want to use as a control node:# dnf install rhel-system-roles
Install the Ansible Core package:
# dnf install ansible-core
The Ansible Core package provides the ansible-playbook
CLI, the Ansible Vault functionality, and the basic modules and filters required by RHEL Ansible content.
As a result, you are able to create an Ansible playbook.
Additional resources
- The Red Hat Enterprise Linux (RHEL) System Roles
-
The
ansible-playbook
man page.
4.4. Applying a role
The following procedure describes how to apply a particular role.
Prerequisites
Ensure that the
rhel-system-roles
package is installed on the system that you want to use as a control node:# dnf install rhel-system-roles
Install the Ansible Core package:
# dnf install ansible-core
The Ansible Core package provides the
ansible-playbook
CLI, the Ansible Vault functionality, and the basic modules and filters required by RHEL Ansible content.
Ensure that you are able to create an Ansible inventory.
Inventories represent the hosts, host groups, and some of the configuration parameters used by the Ansible playbooks.
Playbooks are typically human-readable, and are defined in
ini
,yaml
,json
, and other file formats.Ensure that you are able to create an Ansible playbook.
Playbooks represent Ansible’s configuration, deployment, and orchestration language. By using playbooks, you can declare and manage configurations of remote machines, deploy multiple remote machines or orchestrate steps of any manual ordered process.
A playbook is a list of one or more
plays
. Everyplay
can include Ansible variables, tasks, or roles.Playbooks are human-readable, and are defined in the
yaml
format.
Procedure
Create the required Ansible inventory containing the hosts and groups that you want to manage. Here is an example using a file called
inventory.ini
of a group of hosts calledwebservers
:[webservers] host1 host2 host3
Create an Ansible playbook including the required role. The following example shows how to use roles through the
roles:
option for a playbook:The following example shows how to use roles through the
roles:
option for a givenplay
:--- - hosts: webservers roles: - rhel-system-roles.network - rhel-system-roles.postfix
NoteEvery role includes a README file, which documents how to use the role and supported parameter values. You can also find an example playbook for a particular role under the documentation directory of the role. Such documentation directory is provided by default with the
rhel-system-roles
package, and can be found in the following location:/usr/share/doc/rhel-system-roles/SUBSYSTEM/
Replace SUBSYSTEM with the name of the required role, such as
postfix
,metrics
,network
,tlog
, orssh
.To execute the playbook on specific hosts, you must perform one of the following:
Edit the playbook to use
hosts: host1[,host2,…]
, orhosts: all
, and execute the command:# ansible-playbook name.of.the.playbook
Edit the inventory to ensure that the hosts you want to use are defined in a group, and execute the command:
# ansible-playbook -i name.of.the.inventory name.of.the.playbook
Specify all hosts when executing the
ansible-playbook
command:# ansible-playbook -i host1,host2,... name.of.the.playbook
ImportantBe aware that the
-i
flag specifies the inventory of all hosts that are available. If you have multiple targeted hosts, but want to select a host against which you want to run the playbook, you can add a variable in the playbook to be able to select a host. For example:Ansible Playbook | example-playbook.yml: - hosts: "{{ target_host }}" roles: - rhel-system-roles.network - rhel-system-roles.postfix
Playbook execution command:
# ansible-playbook -i host1,..hostn -e target_host=host5 example-playbook.yml
4.5. Introduction to the Metrics System Role
RHEL System Roles is a collection of Ansible roles and modules that provide a consistent configuration interface to remotely manage multiple RHEL systems. The Metrics System Role configures performance analysis services for the local system and, optionally, includes a list of remote systems to be monitored by the local system. The Metrics System Role enables you to use pcp
to monitor your systems performance without having to configure pcp
separately, as the set-up and deployment of pcp
is handled by the playbook.
Table 4.1. Metrics system role variables
Role variable | Description | Example usage |
---|---|---|
metrics_monitored_hosts |
List of remote hosts to be analyzed by the target host. These hosts will have metrics recorded on the target host, so ensure enough disk space exists below |
|
metrics_retention_days | Configures the number of days for performance data retention before deletion. |
|
metrics_graph_service |
A boolean flag that enables the host to be set up with services for performance data visualization via |
|
metrics_query_service |
A boolean flag that enables the host to be set up with time series query services for querying recorded |
|
metrics_provider |
Specifies which metrics collector to use to provide metrics. Currently, |
|
For details about the parameters used in metrics_connections
and additional information about the Metrics System Role, see the /usr/share/ansible/roles/rhel-system-roles.metrics/README.md
file.
4.6. Using the Metrics System Role to monitor your local system with visualization
This procedure describes how to use the Metrics RHEL System Role to monitor your local system while simultaneously provisioning data visualization via Grafana
.
Prerequisites
- The Ansible Core package is installed on the control machine.
-
You have the
rhel-system-roles
package installed on the machine you want to monitor.
Procedure
Configure
localhost
in the the/etc/ansible/hosts
Ansible inventory by adding the following content to the inventory:localhost ansible_connection=local
Create an Ansible playbook with the following content:
--- - hosts: localhost vars: metrics_graph_service: yes roles: - rhel-system-roles.metrics
Run the Ansible playbook:
# ansible-playbook name_of_your_playbook.yml
NoteSince the
metrics_graph_service
boolean is set to value="yes",Grafana
is automatically installed and provisioned withpcp
added as a data source.-
To view visualization of the metrics being collected on your machine, access the
grafana
web interface as described in Accessing the Grafana web UI.
4.7. Using the Metrics System Role to setup a fleet of individual systems to monitor themselves
This procedure describes how to use the Metrics System Role to set up a fleet of machines to monitor themselves.
Prerequisites
- The Ansible Core package is installed on the control machine.
-
You have the
rhel-system-roles
package installed on the machine you want to use to run the playbook. - You have the SSH connection established.
Procedure
Add the name or IP of the machines you wish to monitor via the playbook to the
/etc/ansible/hosts
Ansible inventory file under an identifying group name enclosed in brackets:[remotes] webserver.example.com database.example.com
Create an Ansible playbook with the following content:
--- - hosts: remotes vars: metrics_retention_days: 0 roles: - rhel-system-roles.metrics
Run the Ansible playbook:
# ansible-playbook name_of_your_playbook.yml -k
Where the -k
prompt for password to connect to remote system.
4.8. Using the Metrics System Role to monitor a fleet of machines centrally via your local machine
This procedure describes how to use the Metrics System Role to set up your local machine to centrally monitor a fleet of machines while also provisioning visualization of the data via grafana
and querying of the data via redis
.
Prerequisites
- The Ansible Core package is installed on the control machine.
-
You have the
rhel-system-roles
package installed on the machine you want to use to run the playbook.
Procedure
Create an Ansible playbook with the following content:
--- - hosts: localhost vars: metrics_graph_service: yes metrics_query_service: yes metrics_retention_days: 10 metrics_monitored_hosts: ["database.example.com", "webserver.example.com"] roles: - rhel-system-roles.metrics
Run the Ansible playbook:
# ansible-playbook name_of_your_playbook.yml
NoteSince the
metrics_graph_service
andmetrics_query_service
booleans are set to value="yes",grafana
is automatically installed and provisioned withpcp
added as a data source with thepcp
data recording indexed intoredis
, allowing thepcp
querying language to be used for complex querying of the data.-
To view graphical representation of the metrics being collected centrally by your machine and to query the data, access the
grafana
web interface as described in Accessing the Grafana web UI.
4.9. Setting up authentication while monitoring a system using the Metrics System Role
PCP supports the scram-sha-256
authentication mechanism through the Simple Authentication Security Layer (SASL) framework. The Metrics RHEL System Role automates the steps to setup authentication using the scram-sha-256
authentication mechanism. This procedure describes how to setup authentication using the Metrics RHEL System Role.
Prerequisites
- The Ansible Core package is installed on the control machine.
-
You have the
rhel-system-roles
package installed on the machine you want to use to run the playbook.
Procedure
Include the following variables in the Ansible playbook you want to setup authentication for:
--- vars: metrics_username: your_username metrics_password: your_password
Run the Ansible playbook:
# ansible-playbook name_of_your_playbook.yml
Verification steps
Verify the
sasl
configuration:# pminfo -f -h "pcp://ip_adress?username=your_username" disk.dev.read Password: disk.dev.read inst [0 or "sda"] value 19540
ip_adress should be replaced by the IP address of the host.
4.10. Using the Metrics System Role to configure and enable metrics collection for SQL Server
This procedure describes how to use the Metrics RHEL System Role to automate the configuration and enabling of metrics collection for Microsoft SQL Server via pcp
on your local system.
Prerequisites
- The Ansible Core package is installed on the control machine.
-
You have the
rhel-system-roles
package installed on the machine you want to monitor. - You have installed Microsoft SQL Server for Red Hat Enterprise Linux and established a 'trusted' connection to an SQL server. See Install SQL Server and create a database on Red Hat.
- You have installed the Microsoft ODBC driver for SQL Server for Red Hat Enterprise Linux. See Red Hat Enterprise Server and Oracle Linux.
Procedure
Configure
localhost
in the the/etc/ansible/hosts
Ansible inventory by adding the following content to the inventory:localhost ansible_connection=local
Create an Ansible playbook that contains the following content:
--- - hosts: localhost roles: - role: rhel-system-roles.metrics vars: metrics_from_mssql: yes
Run the Ansible playbook:
# ansible-playbook name_of_your_playbook.yml
Verification steps
Use the
pcp
command to verify that SQL Server PMDA agent (mssql) is loaded and running:# pcp platform: Linux rhel82-2.local 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC 2019 x86_64 hardware: 2 cpus, 1 disk, 1 node, 2770MB RAM timezone: PDT+7 services: pmcd pmproxy pmcd: Version 5.0.2-1, 12 agents, 4 clients pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm mssql jbd2 dm pmlogger: primary logger: /var/log/pcp/pmlogger/rhel82-2.local/20200326.16.31 pmie: primary engine: /var/log/pcp/pmie/rhel82-2.local/pmie.log
Chapter 5. Setting up PCP
Performance Co-Pilot (PCP) is a suite of tools, services, and libraries for monitoring, visualizing, storing, and analyzing system-level performance measurements.
This section describes how to install and enable PCP on your system.
5.1. Overview of PCP
You can add performance metrics using Python, Perl, C++, and C interfaces. Analysis tools can use the Python, C++, C client APIs directly, and rich web applications can explore all available performance data using a JSON interface.
You can analyze data patterns by comparing live results with archived data.
Features of PCP:
- Light-weight distributed architecture, which is useful during the centralized analysis of complex systems.
- It allows the monitoring and management of real-time data.
- It allows logging and retrieval of historical data.
PCP has the following components:
-
The Performance Metric Collector Daemon (
pmcd
) collects performance data from the installed Performance Metric Domain Agents (pmda
). PMDAs can be individually loaded or unloaded on the system and are controlled by the PMCD on the same host. -
Various client tools, such as
pminfo
orpmstat
, can retrieve, display, archive, and process this data on the same host or over the network. -
The
pcp
package provides the command-line tools and underlying functionality. -
The
pcp-gui
package provides the graphical application. Install thepcp-gui
package by executing thednf install pcp-gui
command. For more information, see Visually tracing PCP log archives with the PCP Charts application.
Additional resources
-
pcp(1)
man page -
/usr/share/doc/pcp-doc/
directory - Tools distributed with PCP
- Index of Performance Co-Pilot (PCP) articles, solutions, tutorials, and white papers fromon Red Hat Customer Portal
- Side-by-side comparison of PCP tools with legacy tools Red Hat Knowledgebase article
- PCP upstream documentation
5.2. Installing and enabling PCP
To begin using PCP, install all the required packages and enable the PCP monitoring services.
This procedure describes how to install PCP using the pcp
package. If you want to automate the PCP installation, install it using the pcp-zeroconf
package. For more information on installing PCP by using pcp-zeroconf
, see Setting up PCP with pcp-zeroconf.
Procedure
Install the
pcp
package:# dnf install pcp
Enable and start the
pmcd
service on the host machine:# systemctl enable pmcd # systemctl start pmcd
Verification steps
Verify if the
pmcd
process is running on the host:# pcp Performance Co-Pilot configuration on workstation: platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019 x86_64 hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM timezone: CEST-2 services: pmcd pmcd: Version 4.3.0-1, 8 agents pmda: root pmcd proc xfs linux mmv kvm jbd2
Additional resources
-
pmcd(1)
man page - Tools distributed with PCP
5.3. Deploying a minimal PCP setup
The minimal PCP setup collects performance statistics on Red Hat Enterprise Linux. The setup involves adding the minimum number of packages on a production system needed to gather data for further analysis.
You can analyze the resulting tar.gz
file and the archive of the pmlogger
output using various PCP tools and compare them with other sources of performance information.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Update the
pmlogger
configuration:# pmlogconf -r /var/lib/pcp/config/pmlogger/config.default
Start the
pmcd
andpmlogger
services:# systemctl start pmcd.service # systemctl start pmlogger.service
- Execute the required operations to record the performance data.
Stop the
pmcd
andpmlogger
services:# systemctl stop pmcd.service # systemctl stop pmlogger.service
Save the output and save it to a
tar.gz
file named based on the host name and the current date and time:# cd /var/log/pcp/pmlogger/ # tar -czf $(hostname).$(date +%F-%Hh%M).pcp.tar.gz $(hostname)
Extract this file and analyze the data using PCP tools.
Additional resources
-
pmlogconf(1)
,pmlogger(1)
, andpmcd(1)
man pages - Tools distributed with PCP
- System services distributed with PCP
5.4. System services distributed with PCP
The following table describes roles of various system services, which are distributed with PCP.
Table 5.1. Roles of system services distributed with PCP
Name | Description |
| The Performance Metric Collector Daemon (PMCD). |
| The Performance Metrics Inference Engine. |
| The performance metrics logger. |
| The realtime and historical performance metrics proxy, time series query and REST API service. |
5.5. Tools distributed with PCP
The following table describes usage of various tools, which are distributed with PCP.
Table 5.2. Usage of tools distributed with PCP
Name | Description |
| Displays the current status of a Performance Co-Pilot installation. |
| Shows the system-level occupation of the most critical hardware resources from the performance point of view: CPU, memory, disk, and network. |
| Generates a system-level activity report over a variety of system resource utilization. The report is generated from a raw logfile previously recorded using pmlogger or the -w option of pcp-atop. |
| Displays information about configured Device Mapper Cache targets, such as: device IOPs, cache and metadata device utilization, as well as hit and miss rates and ratios for both reads and writes for each cache device. |
|
Displays metrics of one system at a time. To display metrics of multiple systems, use |
| Reports on free and used memory in a system. |
|
Displays all processes running on a system along with their command line arguments in a manner similar to the |
| Displays information on the inter-process communication (IPC) facilities that the calling process has read access for. |
| Displays NUMA allocation statistics from the kernel memory allocator. |
| Displays information about individual tasks or processes running on the system such as: CPU percentage, memory and stack usage, scheduling, and priority. Reports live data for the local host by default. |
| Displays socket statistics collected by the pmdasockets Performance Metrics Domain Agent (PMDA). |
| Displays how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes. |
| Provides a high-level system performance overview every 5 seconds. Displays information about processes, memory, paging, block IO, traps, and CPU activity. |
| Plots performance metrics values available through the facilities of the Performance Co-Pilot. |
| Displays high-level system performance metrics by using the Performance Metrics Application Programming Interface (PMAPI). |
| Displays the values of configuration parameters. |
| Displays available Performance Co-Pilot debug control flags and their values. |
| Compares the average values for every metric in either one or two archives, in a given time window, for changes that are likely to be of interest when searching for performance regressions. |
| Displays control, metadata, index, and state information from a Performance Co-Pilot archive file. |
| Outputs the values of performance metrics collected live or from a Performance Co-Pilot archive. |
| Displays available Performance Co-Pilot error codes and their corresponding error messages. |
| Finds PCP services on the network. |
| An inference engine that periodically evaluates a set of arithmetic, logical, and rule expressions. The metrics are collected either from a live system, or from a Performance Co-Pilot archive file. |
| Displays or sets configurable pmie variables. |
| Manages non-primary instances of pmie. |
| Displays information about performance metrics. The metrics are collected either from a live system, or from a Performance Co-Pilot archive file. |
| Reports I/O statistics for SCSI devices (by default) or device-mapper devices (with the -x dm option). |
| Interactively configures active pmlogger instances. |
| Identifies invalid data in a Performance Co-Pilot archive file. |
| Creates and modifies a pmlogger configuration file. |
| Manages non-primary instances of pmlogger. |
| Verifies, modifies, or repairs the label of a Performance Co-Pilot archive file. |
| Calculates statistical information about performance metrics stored in a Performance Co-Pilot archive file. |
| Determines the availability of performance metrics. |
| Reports on selected, easily customizable, performance metrics values. |
| Allows access to a Performance Co-Pilot hosts through a firewall. |
| Periodically displays a brief summary of system performance. |
| Modifies the values of performance metrics. |
| Provides a command line interface to the trace PMDA. |
| Displays the current value of a performance metric. |
5.6. PCP deployment architectures
Performance Co-Pilot (PCP) offers many options to accomplish advanced setups. From the huge variety of possible architectures, this section describes how to scale your PCP deployment based on the recommended deployment set up by Red Hat, sizing factors, and configuration options.
PCP supports multiple deployment architectures, based on the scale of the PCP deployment.
Available scaling deployment setup variants:
Localhost
Each service runs locally on the monitored machine. When you start a service without any configuration changes, this is the default deployment. Scaling beyond the individual node is not possible in this case.
By default, the deployment setup for Redis is standalone, localhost. However, Redis can optionally perform in a highly-available and highly scalable clustered fashion, where data is shared across multiple hosts. Another viable option is to deploy a Redis cluster in the cloud, or to utilize a managed Redis cluster from a cloud vendor.
Decentralized
The only difference between localhost and decentralized setup is the centralized Redis service. In this model, the host executes
pmlogger
service on each monitored host and retrieves metrics from a localpmcd
instance. A localpmproxy
service then exports the performance metrics to a central Redis instance.Figure 5.1. Decentralized logging
Centralized logging - pmlogger farm
When the resource usage on the monitored hosts is constrained, another deployment option is a
pmlogger
farm, which is also known as centralized logging. In this setup, a single logger host executes multiplepmlogger
processes, and each is configured to retrieve performance metrics from a different remotepmcd
host. The centralized logger host is also configured to execute thepmproxy
service, which discovers the resulting PCP archives logs and loads the metric data into a Redis instance.Figure 5.2. Centralized logging - pmlogger farm
Federated - multiple pmlogger farms
For large scale deployments, Red Hat recommends to deploy multiple
pmlogger
farms in a federated fashion. For example, onepmlogger
farm per rack or data center. Eachpmlogger
farm loads the metrics into a central Redis instance.Figure 5.3. Federated - multiple pmlogger farms
By default, the deployment setup for Redis is standalone, localhost. However, Redis can optionally perform in a highly-available and highly scalable clustered fashion, where data is shared across multiple hosts. Another viable option is to deploy a Redis cluster in the cloud, or to utilize a managed Redis cluster from a cloud vendor.
Additional resources
-
pcp(1)
,pmlogger(1)
,pmproxy(1)
, andpmcd(1)
man pages - Recommended deployment architecture
5.7. Recommended deployment architecture
The following table describes the recommended deployment architectures based on the number of monitored hosts.
Table 5.3. Recommended deployment architecture
Number of hosts (N) | 1-10 | 10-100 | 100-1000 |
---|---|---|---|
| N | N | N |
| 1 to N | N/10 to N | N/100 to N |
| 1 to N | 1 to N | N/100 to N |
Redis servers | 1 to N | 1 to N/10 | N/100 to N/10 |
Redis cluster | No | Maybe | Yes |
Recommended deployment setup | Localhost, Decentralized, or Centralized logging | Decentralized, Centralized logging, or Federated | Decentralized or Federated |
5.8. Sizing factors
The following are the sizing factors required for scaling:
Remote system size
-
The number of CPUs, disks, network interfaces, and other hardware resources affects the amount of data collected by each
pmlogger
on the centralized logging host. Logged Metrics
-
The number and types of logged metrics play an important role. In particular, the
per-process proc.*
metrics require a large amount of disk space, for example, with the standardpcp-zeroconf
setup, 10s logging interval, 11 MB without proc metrics versus 155 MB with proc metrics - a factor of 10 times more. Additionally, the number of instances for each metric, for example the number of CPUs, block devices, and network interfaces also impacts the required storage capacity. Logging Interval
-
The interval how often metrics are logged, affects the storage requirements. The expected daily PCP archive file sizes are written to the
pmlogger.log
file for eachpmlogger
instance. These values are uncompressed estimates. Since PCP archives compress very well, approximately 10:1, the actual long term disk space requirements can be determined for a particular site. pmlogrewrite
-
After every PCP upgrade, the
pmlogrewrite
tool is executed and rewrites old archives if there were changes in the metric metadata from the previous version and the new version of PCP. This process duration scales linear with the number of archives stored.
Additional resources
-
pmlogrewrite(1)
andpmlogger(1)
man pages
5.9. Configuration options for PCP scaling
The following are the configuration options, which are required for scaling:
sysctl and rlimit settings
-
When archive discovery is enabled,
pmproxy
requires four descriptors for everypmlogger
that it is monitoring or log-tailing, along with the additional file descriptors for the service logs andpmproxy
client sockets, if any. Eachpmlogger
process uses about 20 file descriptors for the remotepmcd
socket, archive files, service logs, and others. In total, this can exceed the default 1024 soft limit on a system running around 200pmlogger
processes. Thepmproxy
service inpcp-5.3.0
and later automatically increases the soft limit to the hard limit. On earlier versions of PCP, tuning is required if a high number ofpmlogger
processes are to be deployed, and this can be accomplished by increasing the soft or hard limits forpmlogger
. For more information, see How to set limits (ulimit) for services run by systemd. Local Archives
-
The
pmlogger
service stores metrics of local and remotepmcds
in the/var/log/pcp/pmlogger/
directory. To control the logging interval of the local system, update the/etc/pcp/pmlogger/control.d/configfile
file and add-t X
in the arguments, where X is the logging interval in seconds. To configure which metrics should be logged, executepmlogconf /var/lib/pcp/config/pmlogger/config.clienthostname
. This command deploys a configuration file with a default set of metrics, which can optionally be further customized. To specify retention settings, that is when to purge old PCP archives, update the/etc/sysconfig/pmlogger_timers
file and specifyPMLOGGER_DAILY_PARAMS="-E -k X"
, where X is the amount of days to keep PCP archives. Redis
The
pmproxy
service sends logged metrics frompmlogger
to a Redis instance. The following are the available two options to specify the retention settings in the/etc/pcp/pmproxy/pmproxy.conf
configuration file:-
stream.expire
specifies the duration when stale metrics should be removed, that is metrics which were not updated in a specified amount of time in seconds. -
stream.maxlen
specifies the maximum number of metric values for one metric per host. This setting should be the retention time divided by the logging interval, for example 20160 for 14 days of retention and 60s logging interval (60*60*24*14/60)
-
Additional resources
-
pmproxy(1)
,pmlogger(1)
, andsysctl(8)
man pages
5.10. Example: Analyzing the centralized logging deployment
The following results were gathered on a centralized logging setup, also known as pmlogger farm deployment, with a default pcp-zeroconf 5.3.0
installation, where each remote host is an identical container instance running pmcd
on a server with 64 CPU cores, 376 GB RAM, and one disk attached.
The logging interval is 10s, proc metrics of remote nodes are not included, and the memory values refer to the Resident Set Size (RSS) value.
Table 5.4. Detailed utilization statistics for 10s logging interval
Number of Hosts | 10 | 50 |
---|---|---|
PCP Archives Storage per Day | 91 MB | 522 MB |
| 160 MB | 580 MB |
| 2 MB | 9 MB |
| 1.4 GB | 6.3 GB |
Redis Memory per Day | 2.6 GB | 12 GB |
Table 5.5. Used resources depending on monitored hosts for 60s logging interval
Number of Hosts | 10 | 50 | 100 |
---|---|---|---|
PCP Archives Storage per Day | 20 MB | 120 MB | 271 MB |
| 104 MB | 524 MB | 1049 MB |
| 0.38 MB | 1.75 MB | 3.48 MB |
| 2.67 GB | 5.5GB | 9 GB |
Redis Memory per Day | 0.54 GB | 2.65 GB | 5.3 GB |
The pmproxy
queues Redis requests and employs Redis pipelining to speed up Redis queries. This can result in high memory usage. For troubleshooting this issue, see Troubleshooting high memory usage.
5.11. Example: Analyzing the federated setup deployment
The following results were observed on a federated setup, also known as multiple pmlogger
farms, consisting of three centralized logging (pmlogger
farm) setups, where each pmlogger
farm was monitoring 100 remote hosts, that is 300 hosts in total.
This setup of the pmlogger
farms is identical to the configuration mentioned in the Example: Analyzing the centralized logging deployment for 60s logging interval, except that the Redis servers were operating in cluster mode.
Table 5.6. Used resources depending on federated hosts for 60s logging interval
PCP Archives Storage per Day | pmlogger Memory | Network per Day (In/Out) | pmproxy Memory | Redis Memory per Day |
---|---|---|---|---|
277 MB | 1058 MB | 15.6 MB / 12.3 MB | 6-8 GB | 5.5 GB |
Here, all values are per host. The network bandwidth is higher due to the inter-node communication of the Redis cluster.
5.12. Troubleshooting high memory usage
The following scenarios can result in high memory usage:
-
The
pmproxy
process is busy processing new PCP archives and does not have spare CPU cycles to process Redis requests and responses. - The Redis node or cluster is overloaded and cannot process incoming requests on time.
The pmproxy
service daemon uses Redis streams and supports the configuration parameters, which are PCP tuning parameters and affects Redis memory usage and key retention. The /etc/pcp/pmproxy/pmproxy.conf
file lists the available configuration options for pmproxy
and the associated APIs.
This section describes how to troubleshoot high memory usage issue.
Prerequisites
Install the
pcp-pmda-redis
package:# dnf install pcp-pmda-redis
Install the redis PMDA:
# cd /var/lib/pcp/pmdas/redis && ./Install
Procedure
To troubleshoot high memory usage, execute the following command and observe the
inflight
column:$ pmrep :pmproxy backlog inflight reqs/s resp/s wait req err resp err changed throttled byte count count/s count/s s/s count/s count/s count/s count/s 14:59:08 0 0 N/A N/A N/A N/A N/A N/A N/A 14:59:09 0 0 2268.9 2268.9 28 0 0 2.0 4.0 14:59:10 0 0 0.0 0.0 0 0 0 0.0 0.0 14:59:11 0 0 0.0 0.0 0 0 0 0.0 0.0
This column shows how many Redis requests are in-flight, which means they are queued or sent, and no reply was received so far.
A high number indicates one of the following conditions:
-
The
pmproxy
process is busy processing new PCP archives and does not have spare CPU cycles to process Redis requests and responses. - The Redis node or cluster is overloaded and cannot process incoming requests on time.
-
The
To troubleshoot the high memory usage issue, reduce the number of
pmlogger
processes for this farm, and add another pmlogger farm. Use the federated - multiple pmlogger farms setup.If the Redis node is using 100% CPU for an extended amount of time, move it to a host with better performance or use a clustered Redis setup instead.
To view the
pmproxy.redis.*
metrics, use the following command:$ pminfo -ftd pmproxy.redis pmproxy.redis.responses.wait [wait time for responses] Data Type: 64-bit unsigned int InDom: PM_INDOM_NULL 0xffffffff Semantics: counter Units: microsec value 546028367374 pmproxy.redis.responses.error [number of error responses] Data Type: 64-bit unsigned int InDom: PM_INDOM_NULL 0xffffffff Semantics: counter Units: count value 1164 [...] pmproxy.redis.requests.inflight.bytes [bytes allocated for inflight requests] Data Type: 64-bit int InDom: PM_INDOM_NULL 0xffffffff Semantics: discrete Units: byte value 0 pmproxy.redis.requests.inflight.total [inflight requests] Data Type: 64-bit unsigned int InDom: PM_INDOM_NULL 0xffffffff Semantics: discrete Units: count value 0 [...]
To view how many Redis requests are inflight, see the
pmproxy.redis.requests.inflight.total
metric andpmproxy.redis.requests.inflight.bytes
metric to view how many bytes are occupied by all current inflight Redis requests.In general, the redis request queue would be zero but can build up based on the usage of large pmlogger farms, which limits scalability and can cause high latency for
pmproxy
clients.Use the
pminfo
command to view information about performance metrics. For example, to view theredis.*
metrics, use the following command:$ pminfo -ftd redis redis.redis_build_id [Build ID] Data Type: string InDom: 24.0 0x6000000 Semantics: discrete Units: count inst [0 or "localhost:6379"] value "87e335e57cffa755" redis.total_commands_processed [Total number of commands processed by the server] Data Type: 64-bit unsigned int InDom: 24.0 0x6000000 Semantics: counter Units: count inst [0 or "localhost:6379"] value 595627069 [...] redis.used_memory_peak [Peak memory consumed by Redis (in bytes)] Data Type: 32-bit unsigned int InDom: 24.0 0x6000000 Semantics: instant Units: count inst [0 or "localhost:6379"] value 572234920 [...]
To view the peak memory usage, see the
redis.used_memory_peak
metric.
Additional resources
-
pmdaredis(1)
,pmproxy(1)
, andpminfo(1)
man pages - PCP deployment architectures
Chapter 6. Logging performance data with pmlogger
With the PCP tool you can log the performance metric values and replay them later. This allows you to perform a retrospective performance analysis.
Using the pmlogger
tool, you can:
- Create the archived logs of selected metrics on the system
- Specify which metrics are recorded on the system and how often
6.1. Modifying the pmlogger configuration file with pmlogconf
When the pmlogger
service is running, PCP logs a default set of metrics on the host.
Use the pmlogconf
utility to check the default configuration. If the pmlogger
configuration file does not exist, pmlogconf
creates it with a default metric values.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Create or modify the
pmlogger
configuration file:# pmlogconf -r /var/lib/pcp/config/pmlogger/config.default
-
Follow
pmlogconf
prompts to enable or disable groups of related performance metrics and to control the logging interval for each enabled group.
Additional resources
-
pmlogconf(1)
andpmlogger(1)
man pages - Tools distributed with PCP
- System services distributed with PCP
6.2. Editing the pmlogger configuration file manually
To create a tailored logging configuration with specific metrics and given intervals, edit the pmlogger
configuration file manually. The default pmlogger
configuration file is /var/lib/pcp/config/pmlogger/config.default
. The configuration file specifies which metrics are logged by the primary logging instance.
In manual configuration, you can:
- Record metrics which are not listed in the automatic configuration.
- Choose custom logging frequencies.
- Add PMDA with the application metrics.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Open and edit the
/var/lib/pcp/config/pmlogger/config.default
file to add specific metrics:# It is safe to make additions from here on ... # log mandatory on every 5 seconds { xfs.write xfs.write_bytes xfs.read xfs.read_bytes } log mandatory on every 10 seconds { xfs.allocs xfs.block_map xfs.transactions xfs.log } [access] disallow * : all; allow localhost : enquire;
Additional resources
-
pmlogger(1)
man page - Tools distributed with PCP
- System services distributed with PCP
6.3. Enabling the pmlogger service
The pmlogger
service must be started and enabled to log the metric values on the local machine.
This procedure describes how to enable the pmlogger
service.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Start and enable the
pmlogger
service:# systemctl start pmlogger # systemctl enable pmlogger
Verification steps
Verify if the
pmlogger
service is enabled:# pcp Performance Co-Pilot configuration on workstation: platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019 x86_64 hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM timezone: CEST-2 services: pmcd pmcd: Version 4.3.0-1, 8 agents, 1 client pmda: root pmcd proc xfs linux mmv kvm jbd2 pmlogger: primary logger: /var/log/pcp/pmlogger/workstation/20190827.15.54
Additional resources
-
pmlogger(1)
man page - Tools distributed with PCP
- System services distributed with PCP
-
/var/lib/pcp/config/pmlogger/config.default
file
6.4. Setting up a client system for metrics collection
This procedure describes how to set up a client system so that a central server can collect metrics from clients running PCP.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Install the
pcp-system-tools
package:# dnf install pcp-system-tools
Configure an IP address for
pmcd
:# echo "-i 192.168.4.62" >>/etc/pcp/pmcd/pmcd.options
Replace 192.168.4.62 with the IP address, the client should listen on.
By default,
pmcd
is listening on the localhost.Configure the firewall to add the public
zone
permanently:# firewall-cmd --permanent --zone=public --add-port=44321/tcp success # firewall-cmd --reload success
Set an SELinux boolean:
# setsebool -P pcp_bind_all_unreserved_ports on
Enable the
pmcd
andpmlogger
services:# systemctl enable pmcd pmlogger # systemctl restart pmcd pmlogger
Verification steps
Verify if the
pmcd
is correctly listening on the configured IP address:# ss -tlp | grep 44321 LISTEN 0 5 127.0.0.1:44321 0.0.0.0:* users:(("pmcd",pid=151595,fd=6)) LISTEN 0 5 192.168.4.62:44321 0.0.0.0:* users:(("pmcd",pid=151595,fd=0)) LISTEN 0 5 [::1]:44321 [::]:* users:(("pmcd",pid=151595,fd=7))
Additional resources
-
pmlogger(1)
,firewall-cmd(1)
,ss(8)
, andsetsebool(8)
man pages - Tools distributed with PCP
- System services distributed with PCP
-
/var/lib/pcp/config/pmlogger/config.default
file
6.5. Setting up a central server to collect data
This procedure describes how to create a central server to collect metrics from clients running PCP.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
- Client is configured for metrics collection. For more information, see Setting up a client system for metrics collection.
Procedure
Install the
pcp-system-tools
package:# dnf install pcp-system-tools
Create the
/etc/pcp/pmlogger/control.d/remote
file with the following content:# DO NOT REMOVE OR EDIT THE FOLLOWING LINE $version=1.1 192.168.4.13 n n PCP_ARCHIVE_DIR/rhel7u4a -r -T24h10m -c config.rhel7u4a 192.168.4.14 n n PCP_ARCHIVE_DIR/rhel6u10a -r -T24h10m -c config.rhel6u10a 192.168.4.62 n n PCP_ARCHIVE_DIR/rhel8u1a -r -T24h10m -c config.rhel8u1a 192.168.4.69 n n PCP_ARCHIVE_DIR/rhel9u3a -r -T24h10m -c config.rhel9u3a
Replace 192.168.4.13, 192.168.4.14, 192.168.4.62 and 192.168.4.69 with the client IP addresses.
Enable the
pmcd
andpmlogger
services:# systemctl enable pmcd pmlogger # systemctl restart pmcd pmlogger
Verification steps
Ensure that you can access the latest archive file from each directory:
# for i in /var/log/pcp/pmlogger/rhel*/*.0; do pmdumplog -L $i; done Log Label (Log Format Version 2) Performance metrics from host rhel6u10a.local commencing Mon Nov 25 21:55:04.851 2019 ending Mon Nov 25 22:06:04.874 2019 Archive timezone: JST-9 PID for pmlogger: 24002 Log Label (Log Format Version 2) Performance metrics from host rhel7u4a commencing Tue Nov 26 06:49:24.954 2019 ending Tue Nov 26 07:06:24.979 2019 Archive timezone: CET-1 PID for pmlogger: 10941 [..]
The archive files from the
/var/log/pcp/pmlogger/
directory can be used for further analysis and graphing.
Additional resources
-
pmlogger(1)
man page - Tools distributed with PCP
- System services distributed with PCP
-
/var/lib/pcp/config/pmlogger/config.default
file
6.6. Replaying the PCP log archives with pmrep
After recording the metric data, you can replay the PCP log archives. To export the logs to text files and import them into spreadsheets, use PCP utilities such as pcp2csv
, pcp2xml
, pmrep
or pmlogsummary
.
Using the pmrep
tool, you can:
- View the log files
- Parse the selected PCP log archive and export the values into an ASCII table
- Extract the entire archive log or only select metric values from the log by specifying individual metrics on the command line
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
-
The
pmlogger
service is enabled. For more information, see Enabling the pmlogger service. Install the
pcp-system-tools
package:# dnf install pcp-gui
Procedure
Display the data on the metric:
$ pmrep --start @3:00am --archive 20211128 --interval 5seconds --samples 10 --output csv disk.dev.write Time,"disk.dev.write-sda","disk.dev.write-sdb" 2021-11-28 03:00:00,, 2021-11-28 03:00:05,4.000,5.200 2021-11-28 03:00:10,1.600,7.600 2021-11-28 03:00:15,0.800,7.100 2021-11-28 03:00:20,16.600,8.400 2021-11-28 03:00:25,21.400,7.200 2021-11-28 03:00:30,21.200,6.800 2021-11-28 03:00:35,21.000,27.600 2021-11-28 03:00:40,12.400,33.800 2021-11-28 03:00:45,9.800,20.600
The mentioned example displays the data on the
disk.dev.write
metric collected in an archive at a 5 second interval in comma-separated-value format.NoteReplace
20211128
in this example with a filename containing thepmlogger
archive you want to display data for.
Additional resources
-
pmlogger(1)
,pmrep(1)
, andpmlogsummary(1)
man pages - Tools distributed with PCP
- System services distributed with PCP
Chapter 7. Monitoring performance with Performance Co-Pilot
Performance Co-Pilot (PCP) is a suite of tools, services, and libraries for monitoring, visualizing, storing, and analyzing system-level performance measurements.
As a system administrator, you can monitor the system’s performance using the the PCP application in Red Hat Enterprise Linux 9.
7.1. Monitoring postfix with pmda-postfix
This procedure describes how to monitor performance metrics of the postfix
mail server with pmda-postfix
. It helps to check how many emails are received per second.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
-
The
pmlogger
service is enabled. For more information, see Enabling the pmlogger service.
Procedure
Install the following packages:
Install the
pcp-system-tools
:# dnf install pcp-system-tools
Install the
pmda-postfix
package to monitorpostfix
:# dnf install pcp-pmda-postfix postfix
Install the logging daemon:
# dnf install rsyslog
Install the mail client for testing:
# dnf install mutt
Enable the
postfix
andrsyslog
services:# systemctl enable postfix rsyslog # systemctl restart postfix rsyslog
Enable the SELinux boolean, so that
pmda-postfix
can access the required log files:# setsebool -P pcp_read_generic_logs=on
Install the
PMDA
:# cd /var/lib/pcp/pmdas/postfix/ # ./Install Updating the Performance Metrics Name Space (PMNS) ... Terminate PMDA if already installed ... Updating the PMCD control file, and notifying PMCD ... Waiting for pmcd to terminate ... Starting pmcd ... Check postfix metrics have appeared ... 7 metrics and 58 values
Verification steps
Verify the
pmda-postfix
operation:echo testmail | mutt root
Verify the available metrics:
# pminfo postfix postfix.received postfix.sent postfix.queues.incoming postfix.queues.maildrop postfix.queues.hold postfix.queues.deferred postfix.queues.active
Additional resources
-
rsyslogd(8)
,postfix(1)
, andsetsebool(8)
man pages - Tools distributed with PCP
- System services distributed with PCP
-
/var/lib/pcp/config/pmlogger/config.default
file
7.2. Visually tracing PCP log archives with the PCP Charts application
After recording metric data, you can replay the PCP log archives as graphs. The metrics are sourced from one or more live hosts with alternative options to use metric data from PCP log archives as a source of historical data. To customize the PCP Charts application interface to display the data from the performance metrics, you can use line plot, bar graphs, or utilization graphs.
Using the PCP Charts application, you can:
- Replay the data in the PCP Charts application application and use graphs to visualize the retrospective data alongside live data of the system.
- Plot performance metric values into graphs.
- Display multiple charts simultaneously.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
-
Logged performance data with the
pmlogger
. For more information, see Logging performance data with pmlogger. Install the
pcp-gui
package:# dnf install pcp-gui
Procedure
Launch the PCP Charts application from the command line:
# pmchart
Figure 7.1. PCP Charts application
The
pmtime
server settings are located at the bottom. The start and pause button allows you to control:- The interval in which PCP polls the metric data
- The date and time for the metrics of historical data
- Click File and then New Chart to select metric from both the local machine and remote machines by specifying their host name or address. Advanced configuration options include the ability to manually set the axis values for the chart, and to manually choose the color of the plots.
Record the views created in the PCP Charts application:
Following are the options to take images or record the views created in the PCP Charts application:
- Click File and then Export to save an image of the current view.
- Click Record and then Start to start a recording. Click Record and then Stop to stop the recording. After stopping the recording, the recorded metrics are archived to be viewed later.
Optional: In the PCP Charts application, the main configuration file, known as the view, allows the metadata associated with one or more charts to be saved. This metadata describes all chart aspects, including the metrics used and the chart columns. Save the custom view configuration by clicking File and then Save View, and load the view configuration later.
The following example of the PCP Charts application view configuration file describes a stacking chart graph showing the total number of bytes read and written to the given XFS file system
loop1
:#kmchart version 1 chart title "Filesystem Throughput /loop1" style stacking antialiasing off plot legend "Read rate" metric xfs.read_bytes instance "loop1" plot legend "Write rate" metric xfs.write_bytes instance "loop1"
Additional resources
-
pmchart(1)
andpmtime(1)
man pages - Tools distributed with PCP
7.3. Collecting data from SQL server using PCP
The SQL Server agent is available in Performance Co-Pilot (PCP), which helps you to monitor and analyze database performance issues.
This procedure describes how to collect data for Microsoft SQL Server via pcp
on your system.
Prerequisites
- You have installed Microsoft SQL Server for Red Hat Enterprise Linux and established a 'trusted' connection to an SQL server.
- You have installed the Microsoft ODBC driver for SQL Server for Red Hat Enterprise Linux.
Procedure
Install PCP:
# dnf install pcp-zeroconf
Install packages required for the
pyodbc
driver:# dnf install python3-pyodbc
Install the
mssql
agent:Install the Microsoft SQL Server domain agent for PCP:
# dnf install pcp-pmda-mssql
Edit the
/etc/pcp/mssql/mssql.conf
file to configure the SQL server account’s username and password for themssql
agent. Ensure that the account you configure has access rights to performance data.username: user_name password: user_password
Replace user_name with the SQL Server account and user_password with the SQL Server user password for this account.
Install the agent:
# cd /var/lib/pcp/pmdas/mssql # ./Install Updating the Performance Metrics Name Space (PMNS) ... Terminate PMDA if already installed ... Updating the PMCD control file, and notifying PMCD ... Check mssql metrics have appeared ... 168 metrics and 598 values [...]
Verification steps
Using the
pcp
command, verify if the SQL Server PMDA (mssql
) is loaded and running:$ pcp Performance Co-Pilot configuration on rhel.local: platform: Linux rhel.local 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC 2019 x86_64 hardware: 2 cpus, 1 disk, 1 node, 2770MB RAM timezone: PDT+7 services: pmcd pmproxy pmcd: Version 5.0.2-1, 12 agents, 4 clients pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm mssql jbd2 dm pmlogger: primary logger: /var/log/pcp/pmlogger/rhel.local/20200326.16.31 pmie: primary engine: /var/log/pcp/pmie/rhel.local/pmie.log
View the complete list of metrics that PCP can collect from the SQL Server:
# pminfo mssql
After viewing the list of metrics, you can report the rate of transactions. For example, to report on the overall transaction count per second, over a five second time window:
# pmval -t 1 -T 5 mssql.databases.transactions
-
View the graphical chart of these metrics on your system by using the
pmchart
command. For more information, see Visually tracing PCP log archives with the PCP Charts application.
Additional resources
-
pcp(1)
,pminfo(1)
,pmval(1)
,pmchart(1)
, andpmdamssql(1)
man pages - Performance Co-Pilot for Microsoft SQL Server with RHEL 8.2 Red Hat Developers Blog post
7.4. Generating PCP archives from sadc archives
You can use the sadf
tool provided by the sysstat
package to generate PCP archives from native sadc
archives.
Prerequisites
A
sadc
archive has been created:# /usr/lib64/sa/sadc 1 5 -
In this example,
sadc
is sampling system data 1 time in a 5 second interval. The outfile is specified as-
which results insadc
writing the data to the standard system activity daily data file. This file is named saDD and is located in the /var/log/sa directory by default.
Procedure
Generate a PCP archive from a
sadc
archive:# sadf -l -O pcparchive=/tmp/recording -2
In this example, using the
-2
option results insadf
generating a PCP archive from asadc
archive recorded 2 days ago.
Verification steps
You can use PCP commands to inspect and analyze the PCP archive generated from a sadc
archive as you would a native PCP archive. For example:
To show a list of metrics in the PCP archive generated from an
sadc
archive archive, run:$ pminfo --archive /tmp/recording Disk.dev.avactive Disk.dev.read Disk.dev.write Disk.dev.blkread [...]
To show the timespace of the archive and hostname of the PCP archive, run:
$ pmdumplog --label /tmp/recording Log Label (Log Format Version 2) Performance metrics from host shard commencing Tue Jul 20 00:10:30.642477 2021 ending Wed Jul 21 00:10:30.222176 2021
To plot performance metrics values into graphs, run:
$ pmchart --archive /tmp/recording
Chapter 8. Performance analysis of XFS with PCP
The XFS PMDA ships as part of the pcp
package and is enabled by default during the installation. It is used to gather performance metric data of XFS file systems in Performance Co-Pilot (PCP).
This section describes how to analyze XFS file system’s performance using PCP.
8.1. Installing XFS PMDA manually
If the XFS PMDA is not listed in the pcp
configuration output, install the PMDA agent manually.
This procedure describes how to manually install the PMDA agent.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Navigate to the xfs directory:
# cd /var/lib/pcp/pmdas/xfs/
Verification steps
Verify that the
pmcd
process is running on the host and the XFS PMDA is listed as enabled in the configuration:# pcp Performance Co-Pilot configuration on workstation: platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019 x86_64 hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM timezone: CEST-2 services: pmcd pmcd: Version 4.3.0-1, 8 agents pmda: root pmcd proc xfs linux mmv kvm jbd2
Additional resources
-
pmcd(1)
man page - Tools distributed with PCP
8.2. Examining XFS performance metrics with pminfo
PCP enables XFS PMDA to allow the reporting of certain XFS metrics per each of the mounted XFS file systems. This makes it easier to pinpoint specific mounted file system issues and evaluate performance.
The pminfo
command provides per-device XFS metrics for each mounted XFS file system.
This procedure displays a list of all available metrics provided by the XFS PMDA.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Display the list of all available metrics provided by the XFS PMDA:
# pminfo xfs
Display information for the individual metrics. The following examples examine specific XFS
read
andwrite
metrics using thepminfo
tool:Display a short description of the
xfs.write_bytes
metric:# pminfo --oneline xfs.write_bytes xfs.write_bytes [number of bytes written in XFS file system write operations]
Display a long description of the
xfs.read_bytes
metric:# pminfo --helptext xfs.read_bytes xfs.read_bytes Help: This is the number of bytes read via read(2) system calls to files in XFS file systems. It can be used in conjunction with the read_calls count to calculate the average size of the read operations to file in XFS file systems.
Obtain the current performance value of the
xfs.read_bytes
metric:# pminfo --fetch xfs.read_bytes xfs.read_bytes value 4891346238
Obtain per-device XFS metrics with
pminfo
:# pminfo --fetch --oneline xfs.perdev.read xfs.perdev.write xfs.perdev.read [number of XFS file system read operations] inst [0 or "loop1"] value 0 inst [0 or "loop2"] value 0 xfs.perdev.write [number of XFS file system write operations] inst [0 or "loop1"] value 86 inst [0 or "loop2"] value 0
Additional resources
-
pminfo(1)
man page - PCP metric groups for XFS
- Per-device PCP metric groups for XFS
8.3. Resetting XFS performance metrics with pmstore
With PCP, you can modify the values of certain metrics, especially if the metric acts as a control variable, such as the xfs.control.reset
metric. To modify a metric value, use the pmstore
tool.
This procedure describes how to reset XFS metrics using the pmstore
tool.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Display the value of a metric:
$ pminfo -f xfs.write xfs.write value 325262
Reset all the XFS metrics:
# pmstore xfs.control.reset 1 xfs.control.reset old value=0 new value=1
Verification steps
View the information after resetting the metric:
$ pminfo --fetch xfs.write xfs.write value 0
Additional resources
-
pmstore(1)
andpminfo(1)
man pages - Tools distributed with PCP
- PCP metric groups for XFS
8.4. PCP metric groups for XFS
The following table describes the available PCP metric groups for XFS.
Table 8.1. Metric groups for XFS
Metric Group | Metrics provided |
| General XFS metrics including the read and write operation counts, read and write byte counts. Along with counters for the number of times inodes are flushed, clustered and number of failure to cluster. |
| Range of metrics regarding the allocation of objects in the file system, these include number of extent and block creations/frees. Allocation tree lookup and compares along with extend record creation and deletion from the btree. |
| Metrics include the number of block map read/write and block deletions, extent list operations for insertion, deletions and lookups. Also operations counters for compares, lookups, insertions and deletion operations from the blockmap. |
| Counters for directory operations on XFS file systems for creation, entry deletions, count of “getdent” operations. |
| Counters for the number of meta-data transactions, these include the count for the number of synchronous and asynchronous transactions along with the number of empty transactions. |
| Counters for the number of times that the operating system looked for an XFS inode in the inode cache with different outcomes. These count cache hits, cache misses, and so on. |
| Counters for the number of log buffer writes over XFS file sytems includes the number of blocks written to disk. Metrics also for the number of log flushes and pinning. |
| Counts for the number of bytes of file data flushed out by the XFS flush deamon along with counters for number of buffers flushed to contiguous and non-contiguous space on disk. |
| Counts for the number of attribute get, set, remove and list operations over all XFS file systems. |
| Metrics for quota operation over XFS file systems, these include counters for number of quota reclaims, quota cache misses, cache hits and quota data reclaims. |
| Range of metrics regarding XFS buffer objects. Counters include the number of requested buffer calls, successful buffer locks, waited buffer locks, miss_locks, miss_retries and buffer hits when looking up pages. |
| Metrics regarding the operations of the XFS btree. |
| Configuration metrics which are used to reset the metric counters for the XFS stats. Control metrics are toggled by means of the pmstore tool. |
8.5. Per-device PCP metric groups for XFS
The following table describes the available per-device PCP metric group for XFS.
Table 8.2. Per-device PCP metric groups for XFS
Metric Group | Metrics provided |
| General XFS metrics including the read and write operation counts, read and write byte counts. Along with counters for the number of times inodes are flushed, clustered and number of failure to cluster. |
| Range of metrics regarding the allocation of objects in the file system, these include number of extent and block creations/frees. Allocation tree lookup and compares along with extend record creation and deletion from the btree. |
| Metrics include the number of block map read/write and block deletions, extent list operations for insertion, deletions and lookups. Also operations counters for compares, lookups, insertions and deletion operations from the blockmap. |
| Counters for directory operations of XFS file systems for creation, entry deletions, count of “getdent” operations. |
| Counters for the number of meta-data transactions, these include the count for the number of synchronous and asynchronous transactions along with the number of empty transactions. |
| Counters for the number of times that the operating system looked for an XFS inode in the inode cache with different outcomes. These count cache hits, cache misses, and so on. |
| Counters for the number of log buffer writes over XFS filesytems includes the number of blocks written to disk. Metrics also for the number of log flushes and pinning. |
| Counts for the number of bytes of file data flushed out by the XFS flush deamon along with counters for number of buffers flushed to contiguous and non-contiguous space on disk. |
| Counts for the number of attribute get, set, remove and list operations over all XFS file systems. |
| Metrics for quota operation over XFS file systems, these include counters for number of quota reclaims, quota cache misses, cache hits and quota data reclaims. |
| Range of metrics regarding XFS buffer objects. Counters include the number of requested buffer calls, successful buffer locks, waited buffer locks, miss_locks, miss_retries and buffer hits when looking up pages. |
| Metrics regarding the operations of the XFS btree. |
Chapter 9. Setting up graphical representation of PCP metrics
Using a combination of pcp
, grafana
, pcp redis
, pcp bpftrace
, and pcp vector
provides graphs, based on the live data or data collected by Performance Co-Pilot (PCP).
This section describes how to set up and access the graphical representation of PCP metrics.
9.1. Setting up PCP with pcp-zeroconf
This procedure describes how to set up PCP on a system with the pcp-zeroconf
package. Once the pcp-zeroconf
package is installed, the system records the default set of metrics into archived files.
Procedure
Install the
pcp-zeroconf
package:# dnf install pcp-zeroconf
Verification steps
Ensure that the
pmlogger
service is active, and starts archiving the metrics:# pcp | grep pmlogger pmlogger: primary logger: /var/log/pcp/pmlogger/localhost.localdomain/20200401.00.12
Additional resources
-
pmlogger
man page - Monitoring performance with Performance Co-Pilot
9.2. Setting up a grafana-server
Grafana generates graphs that are accessible from a browser. The grafana-server
is a back-end server for the Grafana dashboard. It listens, by default, on all interfaces, and provides web services accessed through the web browser. The grafana-pcp
plugin interacts with the pmproxy
protocol in the backend.
This procedure describes how to set up a grafana-server
.
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
Install the following packages:
# dnf install grafana grafana-pcp
Restart and enable the following service:
# systemctl restart grafana-server # systemctl enable grafana-server
Open the server’s firewall for network traffic to the Grafana service.
# firewall-cmd --permanent --add-service=grafana success # firewall-cmd --reload success
Verification steps
Ensure that the
grafana-server
is listening and responding to requests:# ss -ntlp | grep 3000 LISTEN 0 128 *:3000 *:* users:(("grafana-server",pid=19522,fd=7))
Ensure that the
grafana-pcp
plugin is installed:# grafana-cli plugins ls | grep performancecopilot-pcp-app performancecopilot-pcp-app @ 3.1.0
Additional resources
-
pmproxy(1)
andgrafana-server
man pages
9.3. Accessing the Grafana web UI
This procedure describes how to access the Grafana web interface.
Using the Grafana web interface, you can:
- add PCP Redis, PCP bpftrace, and PCP Vector data sources
- create dashboard
- view an overview of any useful metrics
- create alerts in PCP Redis
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
-
The
grafana-server
is configured. For more information, see Setting up a grafana-server.
Procedure
On the client system, open a browser and access the
grafana-server
on port3000
, using http://192.0.2.0:3000 link.Replace 192.0.2.0 with your machine IP.
For the first login, enter admin in both the Email or username and Password field.
Grafana prompts to set a New password to create a secured account. If you want to set it later, click Skip.
-
From the menu, hover over the
Configuration icon and then click Plugins.
- In the Plugins tab, type performance co-pilot in the Search by name or type text box and then click Performance Co-Pilot (PCP) plugin.
- In the Plugins / Performance Co-Pilot pane, click Enable.
Click Grafana
icon. The Grafana Home page is displayed.
Figure 9.1. Home Dashboard
NoteThe top corner of the screen has a similar
icon, but it controls the general Dashboard settings.
In the Grafana Home page, click Add your first data source to add PCP Redis, PCP bpftrace, and PCP Vector data sources. For more information on adding data source, see:
- To add pcp redis data source, view default dashboard, create a panel, and an alert rule, see Creating panels and alert in PCP Redis data source.
- To add pcp bpftrace data source and view the default dashboard, see Viewing the PCP bpftrace System Analysis dashboard.
- To add pcp vector data source, view the default dashboard, and to view the vector checklist, see Viewing the PCP Vector Checklist.
-
Optional: From the menu, hover over the admin profile
icon to change the Preferences including Edit Profile, Change Password, or to Sign out.
Additional resources
-
grafana-cli
andgrafana-server
man pages
9.4. Configuring PCP Redis
This section provides information for configuring PCP Redis data source.
Use the PCP Redis data source to:
- View data archives
- Query time series using pmseries language
- Analyze data across multiple hosts
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
-
The
grafana-server
is configured. For more information, see Setting up a grafana-server.
Procedure
Install the
redis
package:# dnf install redis
Start and enable the following services:
# systemctl start pmproxy redis # systemctl enable pmproxy redis
-
Mail transfer agent, for example,
sendmail
orpostfix
is installed and configured. Ensure that the
allow_loading_unsigned_plugins
parameter is set to PCP Redis database in thegrafana.ini
file:# vi /etc/grafana/grafana.ini allow_loading_unsigned_plugins = pcp-redis-datasource
Restart the
grafana-server
:# systemctl restart grafana-server
Verification steps
Ensure that the
pmproxy
andredis
are working:# pmseries disk.dev.read 2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
This command does not return any data if the
redis
package is not installed.
Additional resources
-
pmseries(1)
man page
9.5. Creating panels and alert in PCP Redis data source
After adding the PCP Redis data source, you can view the dashboard with an overview of useful metrics, add a query to visualize the load graph, and create alerts that help you to view the system issues after they occur.
Prerequisites
- The PCP Redis is configured. For more information, see Configuring PCP Redis.
-
The
grafana-server
is accessible. For more information, see Accessing the Grafana web UI.
Procedure
- Log into the Grafana web UI.
- In the Grafana Home page, click Add your first data source.
- In the Add data source pane, type redis in the Filter by name or type text box and then click PCP Redis.
In the Data Sources / PCP Redis pane, perform the following:
-
Add
http://localhost:44322
in the URL field and then click Save & Test. Click Dashboards tab → Import → PCP Redis: Host Overview to see a dashboard with an overview of any useful metrics.
Figure 9.2. PCP Redis: Host Overview
-
Add
Add a new panel:
-
From the menu, hover over the
Create icon → Dashboard → Add new panel icon to add a panel.
-
In the Query tab, select the PCP Redis from the query list instead of the selected default option and in the text field of A, enter metric, for example,
kernel.all.load
to visualize the kernel load graph. - Optional: Add Panel title and Description, and update other options from the Settings.
- Click Save to apply changes and save the dashboard. Add Dashboard name.
Click Apply to apply changes and go back to the dashboard.
Figure 9.3. PCP Redis query panel
-
From the menu, hover over the
Create an alert rule:
-
In the PCP Redis query panel, click
Alert and then click Create Alert.
- Edit the Name, Evaluate query, and For fields from the Rule, and specify the Conditions for your alert.
Click Save to apply changes and save the dashboard. Click Apply to apply changes and go back to the dashboard.
Figure 9.4. Creating alerts in the PCP Redis panel
- Optional: In the same panel, scroll down and click Delete icon to delete the created rule.
Optional: From the menu, click
Alerting icon to view the created alert rules with different alert statuses, to edit the alert rule, or to pause the existing rule from the Alert Rules tab.
To add a notification channel for the created alert rule to receive an alert notification from Grafana, see Adding notification channels for alerts.
-
In the PCP Redis query panel, click
9.6. Adding notification channels for alerts
By adding notification channels, you can receive an alert notification from Grafana whenever the alert rule conditions are met and the system needs further monitoring.
You can receive these alerts after selecting any one type from the supported list of notifiers, which includes DingDing, Discord, Email, Google Hangouts Chat, HipChat, Kafka REST Proxy, LINE, Microsoft Teams, OpsGenie, PagerDuty, Prometheus Alertmanager, Pushover, Sensu, Slack, Telegram, Threema Gateway, VictorOps, and webhook.
Prerequisites
-
The
grafana-server
is accessible. For more information, see Accessing the Grafana web UI. - An alert rule is created. For more information, see Creating panels and alert in PCP Redis data source.
Configure SMTP and add a valid sender’s email address in the
grafana/grafana.ini
file:# vi /etc/grafana/grafana.ini [smtp] enabled = true from_address = abc@gmail.com
Replace abc@gmail.com by a valid email address.
Procedure
-
From the menu, hover over the
Alerting icon → click Notification channels → Add channel.
In the Add notification channel details pane, perform the following:
- Enter your name in the Name text box
-
Select the communication Type, for example, Email and enter the email address. You can add multiple email addresses using the
;
separator. - Optional: Configure Optional Email settings and Notification settings.
- Click Save.
Select a notification channel in the alert rule:
-
From the menu, hover over the
Alerting icon and then click Alert rules.
- From the Alert Rules tab, click the created alert rule.
- On the Notifications tab, select your notification channel name from the Send to option, and then add an alert message.
- Click Apply.
-
From the menu, hover over the
Additional resources
9.7. Setting up authentication between PCP components
You can setup authentication using the scram-sha-256
authentication mechanism, which is supported by PCP through the Simple Authentication Security Layer (SASL) framework.
Procedure
Install the
sasl
framework for thescram-sha-256
authentication mechanism:# dnf install cyrus-sasl-scram cyrus-sasl-lib
Specify the supported authentication mechanism and the user database path in the
pmcd.conf
file:# vi /etc/sasl2/pmcd.conf mech_list: scram-sha-256 sasldb_path: /etc/pcp/passwd.db
Create a new user:
# useradd -r metrics
Replace metrics by your user name.
Add the created user in the user database:
# saslpasswd2 -a pmcd metrics Password: Again (for verification):
To add the created user, you are required to enter the metrics account password.
Set the permissions of the user database:
# chown root:pcp /etc/pcp/passwd.db # chmod 640 /etc/pcp/passwd.db
Restart the
pmcd
service:# systemctl restart pmcd
Verification steps
Verify the
sasl
configuration:# pminfo -f -h "pcp://127.0.0.1?username=metrics" disk.dev.read Password: disk.dev.read inst [0 or "sda"] value 19540
Additional resources
-
saslauthd(8)
,pminfo(1)
, andsha256
man pages - How can I setup authentication between PCP components, like PMDAs and pmcd in RHEL 8.2?
9.8. Installing PCP bpftrace
Install the PCP bpftrace
agent to introspect a system and to gather metrics from the kernel and user-space tracepoints.
The bpftrace
agent uses bpftrace scripts to gather the metrics. The bpftrace
scripts use the enhanced Berkeley Packet Filter (eBPF
).
This procedure describes how to install a pcp bpftrace
.
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
-
The
grafana-server
is configured. For more information, see Setting up a grafana-server. -
The
scram-sha-256
authentication mechanism is configured. For more information, see Setting up authentication between PCP components.
Procedure
Install the
pcp-pmda-bpftrace
package:# dnf install pcp-pmda-bpftrace
Edit the
bpftrace.conf
file and add the user that you have created in the {setting-up-authentication-between-pcp-components}:# vi /var/lib/pcp/pmdas/bpftrace/bpftrace.conf [dynamic_scripts] enabled = true auth_enabled = true allowed_users = root,metrics
Replace metrics by your user name.
Install
bpftrace
PMDA:# cd /var/lib/pcp/pmdas/bpftrace/ # ./Install Updating the Performance Metrics Name Space (PMNS) ... Terminate PMDA if already installed ... Updating the PMCD control file, and notifying PMCD ... Check bpftrace metrics have appeared ... 7 metrics and 6 values
The
pmda-bpftrace
is now installed, and can only be used after authenticating your user. For more information, see Viewing the PCP bpftrace System Analysis dashboard.
Additional resources
-
pmdabpftrace(1)
andbpftrace
man pages
9.9. Viewing the PCP bpftrace System Analysis dashboard
Using the PCP bpftrace data source, you can access the live data from sources which are not available as normal data from the pmlogger
or archives
In the PCP bpftrace data source, you can view the dashboard with an overview of useful metrics.
Prerequisites
- The PCP bpftrace is installed. For more information, see Installing PCP bpftrace.
-
The
grafana-server
is accessible. For more information, see Accessing the Grafana web UI.
Procedure
- Log into the Grafana web UI.
- In the Grafana Home page, click Add your first data source.
- In the Add data source pane, type bpftrace in the Filter by name or type text box and then click PCP bpftrace.
In the Data Sources / PCP bpftrace pane, perform the following:
-
Add
http://localhost:44322
in the URL field. - Toggle the Basic Auth option and add the created user credentials in the User and Password field.
Click Save & Test.
Figure 9.5. Adding PCP bpftrace in the data source
Click Dashboards tab → Import → PCP bpftrace: System Analysis to see a dashboard with an overview of any useful metrics.
Figure 9.6. PCP bpftrace: System Analysis
-
Add
9.10. Installing PCP Vector
This procedure describes how to install a pcp vector
.
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
-
The
grafana-server
is configured. For more information, see Setting up a grafana-server.
Procedure
Install the
pcp-pmda-bcc
package:# dnf install pcp-pmda-bcc
Install the
bcc
PMDA:# cd /var/lib/pcp/pmdas/bcc # ./Install [Wed Apr 1 00:27:48] pmdabcc(22341) Info: Initializing, currently in 'notready' state. [Wed Apr 1 00:27:48] pmdabcc(22341) Info: Enabled modules: [Wed Apr 1 00:27:48] pmdabcc(22341) Info: ['biolatency', 'sysfork', [...] Updating the Performance Metrics Name Space (PMNS) ... Terminate PMDA if already installed ... Updating the PMCD control file, and notifying PMCD ... Check bcc metrics have appeared ... 1 warnings, 1 metrics and 0 values
Additional resources
-
pmdabcc(1)
man page
9.11. Viewing the PCP Vector Checklist
The PCP Vector data source displays live metrics and uses the pcp
metrics. It analyzes data for individual hosts.
After adding the PCP Vector data source, you can view the dashboard with an overview of useful metrics and view the related troubleshooting or reference links in the checklist.
Prerequisites
- The PCP Vector is installed. For more information, see Installing PCP Vector.
-
The
grafana-server
is accessible. For more information, see Accessing the Grafana web UI.
Procedure
- Log into the Grafana web UI.
- In the Grafana Home page, click Add your first data source.
- In the Add data source pane, type vector in the Filter by name or type text box and then click PCP Vector.
In the Data Sources / PCP Vector pane, perform the following:
-
Add
http://localhost:44322
in the URL field and then click Save & Test. Click Dashboards tab → Import → PCP Vector: Host Overview to see a dashboard with an overview of any useful metrics.
Figure 9.7. PCP Vector: Host Overview
-
Add
From the menu, hover over the
Performance Co-Pilot plugin and then click PCP Vector Checklist.
In the PCP checklist, click
help or
warning icon to view the related troubleshooting or reference links.
Figure 9.8. Performance Co-Pilot / PCP Vector Checklist
9.12. Troubleshooting Grafana issues
This section describes how to troubleshoot Grafana issues, such as, Grafana does not display any data, the dashboard is black, or similar issues.
Procedure
Verify that the
pmlogger
service is up and running by executing the following command:$ systemctl status pmlogger
Verify if files were created or modified to the disk by executing the following command:
$ ls /var/log/pcp/pmlogger/$(hostname)/ -rlt total 4024 -rw-r--r--. 1 pcp pcp 45996 Oct 13 2019 20191013.20.07.meta.xz -rw-r--r--. 1 pcp pcp 412 Oct 13 2019 20191013.20.07.index -rw-r--r--. 1 pcp pcp 32188 Oct 13 2019 20191013.20.07.0.xz -rw-r--r--. 1 pcp pcp 44756 Oct 13 2019 20191013.20.30-00.meta.xz [..]
Verify that the
pmproxy
service is running by executing the following command:$ systemctl status pmproxy
Verify that
pmproxy
is running, time series support is enabled, and a connection to Redis is established by viewing the/var/log/pcp/pmproxy/pmproxy.log
file and ensure that it contains the following text:pmproxy(1716) Info: Redis slots, command keys, schema version setup
Here, 1716 is the PID of pmproxy, which will be different for every invocation of
pmproxy
.Verify if the Redis database contains any keys by executing the following command:
$ redis-cli dbsize (integer) 34837
Verify if any PCP metrics are in the Redis database and
pmproxy
is able to access them by executing the following commands:$ pmseries disk.dev.read 2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df $ pmseries "disk.dev.read[count:10]" 2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df [Mon Jul 26 12:21:10.085468000 2021] 117971 70e83e88d4e1857a3a31605c6d1333755f2dd17c [Mon Jul 26 12:21:00.087401000 2021] 117758 70e83e88d4e1857a3a31605c6d1333755f2dd17c [Mon Jul 26 12:20:50.085738000 2021] 116688 70e83e88d4e1857a3a31605c6d1333755f2dd17c [...]
$ redis-cli --scan --pattern "*$(pmseries 'disk.dev.read')" pcp:metric.name:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:values:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:desc:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:labelvalue:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:instances:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:labelflags:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
Verify if there are any errors in the Grafana logs by executing the following command:
$ journalctl -e -u grafana-server -- Logs begin at Mon 2021-07-26 11:55:10 IST, end at Mon 2021-07-26 12:30:15 IST. -- Jul 26 11:55:17 localhost.localdomain systemd[1]: Starting Grafana instance... Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530 lvl=info msg="Starting Grafana" logger=server version=7.3.6 c> Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530 lvl=info msg="Config loaded from" logger=settings file=/usr/s> Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530 lvl=info msg="Config loaded from" logger=settings file=/etc/g> [...]
Chapter 10. Optimizing the system performance using the web console
Learn how to set a performance profile in the RHEL web console to optimize the performance of the system for a selected task.
10.1. Performance tuning options in the web console
Red Hat Enterprise Linux 9 provides several performance profiles that optimize the system for the following tasks:
- Systems using the desktop
- Throughput performance
- Latency performance
- Network performance
- Low power consumption
- Virtual machines
The tuned
service optimizes system options to match the selected profile.
In the web console, you can set which performance profile your system uses.
Additional resources
10.2. Setting a performance profile in the web console
This procedure uses the web console to optimize the system performance for a selected task.
Prerequisites
- Make sure the web console is installed and accessible. For details, see Installing the web console.
Procedure
- Log into the RHEL web console. For details, see Logging in to the web console.
- Click Overview.
In the Performance Profile field, click the current performance profile.
- In the Change Performance Profile dialog box, change the profile if necessary.
Click Change Profile.
Verification steps
- The Overview tab now shows the selected performance profile.
10.3. Monitoring performance using the web console
Red Hat’s web console uses the Utilization Saturation and Errors (USE) Method for troubleshooting. The new performance metrics page has a historical view of your data organized chronologically with the newest data at the top.
Here, you can view the events, errors, and graphical representation for resource utilization and saturation.
Prerequisites
- Make sure the web console is installed and accessible. For details, see Installing the web console.
Install the
cockpit-pcp
package, which enables collecting the performance metrics:# dnf install cockpit-pcp
Procedure
- Log into the RHEL 9 web console. For details, see Logging in to the web console.
Click Overview.
Click View details and history to view the Performance Metrics.
Chapter 11. Setting the disk scheduler
The disk scheduler is responsible for ordering the I/O requests submitted to a storage device.
You can configure the scheduler in several different ways:
- Set the scheduler using TuneD, as described in Setting the disk scheduler using TuneD
-
Set the scheduler using
udev
, as described in Setting the disk scheduler using udev rules - Temporarily change the scheduler on a running system, as described in Temporarily setting a scheduler for a specific disk
In Red Hat Enterprise Linux 9, block devices support only multi-queue scheduling. This enables the block layer performance to scale well with fast solid-state drives (SSDs) and multi-core systems.
The traditional, single-queue schedulers, which were available in Red Hat Enterprise Linux 7 and earlier versions, have been removed.
11.1. Available disk schedulers
The following multi-queue disk schedulers are supported in Red Hat Enterprise Linux 9:
none
- Implements a first-in first-out (FIFO) scheduling algorithm. It merges requests at the generic block layer through a simple last-hit cache.
mq-deadline
Attempts to provide a guaranteed latency for requests from the point at which requests reach the scheduler.
The
mq-deadline
scheduler sorts queued I/O requests into a read or write batch and then schedules them for execution in increasing logical block addressing (LBA) order. By default, read batches take precedence over write batches, because applications are more likely to block on read I/O operations. Aftermq-deadline
processes a batch, it checks how long write operations have been starved of processor time and schedules the next read or write batch as appropriate.This scheduler is suitable for most use cases, but particularly those in which the write operations are mostly asynchronous.
bfq
Targets desktop systems and interactive tasks.
The
bfq
scheduler ensures that a single application is never using all of the bandwidth. In effect, the storage device is always as responsive as if it was idle. In its default configuration,bfq
focuses on delivering the lowest latency rather than achieving the maximum throughput.bfq
is based oncfq
code. It does not grant the disk to each process for a fixed time slice but assigns a budget measured in number of sectors to the process.This scheduler is suitable while copying large files and the system does not become unresponsive in this case.
kyber
The scheduler tunes itself to achieve a latency goal by calculating the latencies of every I/O request submitted to the block I/O layer. You can configure the target latencies for read, in the case of cache-misses, and synchronous write requests.
This scheduler is suitable for fast devices, for example NVMe, SSD, or other low latency devices.
11.2. Different disk schedulers for different use cases
Depending on the task that your system performs, the following disk schedulers are recommended as a baseline prior to any analysis and tuning tasks:
Table 11.1. Disk schedulers for different use cases
Use case | Disk scheduler |
---|---|
Traditional HDD with a SCSI interface |
Use |
High-performance SSD or a CPU-bound system with fast storage |
Use |
Desktop or interactive tasks |
Use |
Virtual guest |
Use |
11.3. The default disk scheduler
Block devices use the default disk scheduler unless you specify another scheduler.
For non-volatile Memory Express (NVMe)
block devices specifically, the default scheduler is none
and Red Hat recommends not changing this.
The kernel selects a default disk scheduler based on the type of device. The automatically selected scheduler is typically the optimal setting. If you require a different scheduler, Red Hat recommends to use udev
rules or the TuneD application to configure it. Match the selected devices and switch the scheduler only for those devices.
11.4. Determining the active disk scheduler
This procedure determines which disk scheduler is currently active on a given block device.
Procedure
Read the content of the
/sys/block/device/queue/scheduler
file:# cat /sys/block/device/queue/scheduler [mq-deadline] kyber bfq none
In the file name, replace device with the block device name, for example
sdc
.The active scheduler is listed in square brackets (
[ ]
).
11.5. Setting the disk scheduler using TuneD
This procedure creates and enables a TuneD profile that sets a given disk scheduler for selected block devices. The setting persists across system reboots.
In the following commands and configuration, replace:
-
device with the name of the block device, for example
sdf
-
selected-scheduler with the disk scheduler that you want to set for the device, for example
bfq
Prerequisites
-
The
tuned
service is installed and enabled. For details, see Installing and enabling TuneD.
Procedure
Optional: Select an existing TuneD profile on which your profile will be based. For a list of available profiles, see TuneD profiles distributed with RHEL.
To see which profile is currently active, use:
$ tuned-adm active
Create a new directory to hold your TuneD profile:
# mkdir /etc/tuned/my-profile
Find the system unique identifier of the selected block device:
$ udevadm info --query=property --name=/dev/device | grep -E '(WWN|SERIAL)' ID_WWN=0x5002538d00000000_ ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0 ID_SERIAL_SHORT=20120501030900000
NoteThe command in the this example will return all values identified as a World Wide Name (WWN) or serial number associated with the specified block device. Although it is preferred to use a WWN, the WWN is not always available for a given device and any values returned by the example command are acceptable to use as the device system unique ID.
Create the
/etc/tuned/my-profile/tuned.conf
configuration file. In the file, set the following options:Optional: Include an existing profile:
[main] include=existing-profile
Set the selected disk scheduler for the device that matches the WWN identifier:
[disk] devices_udev_regex=IDNAME=device system unique id elevator=selected-scheduler
Here:
-
Replace IDNAME with the name of the identifier being used (for example,
ID_WWN
). Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000
).To match multiple devices in the
devices_udev_regex
option, enclose the identifiers in parentheses and separate them with vertical bars:devices_udev_regex=(ID_WWN=0x5002538d00000000)|(ID_WWN=0x1234567800000000)
-
Replace IDNAME with the name of the identifier being used (for example,
Enable your profile:
# tuned-adm profile my-profile
Verification steps
Verify that the TuneD profile is active and applied:
$ tuned-adm active Current active profile: my-profile
$ tuned-adm verify Verification succeeded, current system settings match the preset profile. See tuned log file ('/var/log/tuned/tuned.log') for details.
Read the contents of the
/sys/block/device/queue/scheduler
file:# cat /sys/block/device/queue/scheduler [mq-deadline] kyber bfq none
In the file name, replace device with the block device name, for example
sdc
.The active scheduler is listed in square brackets (
[]
).
Additional resources
11.6. Setting the disk scheduler using udev rules
This procedure sets a given disk scheduler for specific block devices using udev
rules. The setting persists across system reboots.
In the following commands and configuration, replace:
-
device with the name of the block device, for example
sdf
-
selected-scheduler with the disk scheduler that you want to set for the device, for example
bfq
Procedure
Find the system unique identifier of the block device:
$ udevadm info --name=/dev/device | grep -E '(WWN|SERIAL)' E: ID_WWN=0x5002538d00000000 E: ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0 E: ID_SERIAL_SHORT=20120501030900000
NoteThe command in the this example will return all values identified as a World Wide Name (WWN) or serial number associated with the specified block device. Although it is preferred to use a WWN, the WWN is not always available for a given device and any values returned by the example command are acceptable to use as the device system unique ID.
Configure the
udev
rule. Create the/etc/udev/rules.d/99-scheduler.rules
file with the following content:ACTION=="add|change", SUBSYSTEM=="block", ENV{IDNAME}=="device system unique id", ATTR{queue/scheduler}="selected-scheduler"
Here:
-
Replace IDNAME with the name of the identifier being used (for example,
ID_WWN
). -
Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000
).
-
Replace IDNAME with the name of the identifier being used (for example,
Reload
udev
rules:# udevadm control --reload-rules
Apply the scheduler configuration:
# udevadm trigger --type=devices --action=change
Verification steps
Verify the active scheduler:
# cat /sys/block/device/queue/scheduler
11.7. Temporarily setting a scheduler for a specific disk
This procedure sets a given disk scheduler for specific block devices. The setting does not persist across system reboots.
Procedure
Write the name of the selected scheduler to the
/sys/block/device/queue/scheduler
file:# echo selected-scheduler > /sys/block/device/queue/scheduler
In the file name, replace device with the block device name, for example
sdc
.
Verification steps
Verify that the scheduler is active on the device:
# cat /sys/block/device/queue/scheduler
Chapter 12. Tuning the performance of a Samba server
This chapter describes what settings can improve the performance of Samba in certain situations, and which settings can have a negative performance impact.
Parts of this section were adopted from the Performance Tuning documentation published in the Samba Wiki. License: CC BY 4.0. Authors and contributors: See the history tab on the Wiki page.
Prerequisites
- Samba is set up as a file or print server
12.1. Setting the SMB protocol version
Each new SMB version adds features and improves the performance of the protocol. The recent Windows and Windows Server operating systems always supports the latest protocol version. If Samba also uses the latest protocol version, Windows clients connecting to Samba benefit from the performance improvements. In Samba, the default value of the server max protocol is set to the latest supported stable SMB protocol version.
To always have the latest stable SMB protocol version enabled, do not set the server max protocol
parameter. If you set the parameter manually, you will need to modify the setting with each new version of the SMB protocol, to have the latest protocol version enabled.
The following procedure explains how to use the default value in the server max protocol
parameter.
Procedure
-
Remove the
server max protocol
parameter from the[global]
section in the/etc/samba/smb.conf
file. Reload the Samba configuration
# smbcontrol all reload-config
12.2. Tuning shares with directories that contain a large number of files
Linux supports case-sensitive file names. For this reason, Samba needs to scan directories for uppercase and lowercase file names when searching or accessing a file. You can configure a share to create new files only in lowercase or uppercase, which improves the performance.
Prerequisites
- Samba is configured as a file server
Procedure
Rename all files on the share to lowercase.
NoteUsing the settings in this procedure, files with names other than in lowercase will no longer be displayed.
Set the following parameters in the share’s section:
case sensitive = true default case = lower preserve case = no short preserve case = no
For details about the parameters, see their descriptions in the
smb.conf(5)
man page.Verify the
/etc/samba/smb.conf
file:# testparm
Reload the Samba configuration:
# smbcontrol all reload-config
After you applied these settings, the names of all newly created files on this share use lowercase. Because of these settings, Samba no longer needs to scan the directory for uppercase and lowercase, which improves the performance.
12.3. Settings that can have a negative performance impact
By default, the kernel in Red Hat Enterprise Linux is tuned for high network performance. For example, the kernel uses an auto-tuning mechanism for buffer sizes. Setting the socket options
parameter in the /etc/samba/smb.conf
file overrides these kernel settings. As a result, setting this parameter decreases the Samba network performance in most cases.
To use the optimized settings from the Kernel, remove the socket options
parameter from the [global]
section in the /etc/samba/smb.conf
.
Chapter 13. Optimizing virtual machine performance
Virtual machines (VMs) always experience some degree of performance deterioration in comparison to the host. The following sections explain the reasons for this deterioration and provide instructions on how to minimize the performance impact of virtualization in RHEL 9, so that your hardware infrastructure resources can be used as efficiently as possible.
13.1. What influences virtual machine performance
VMs are run as user-space processes on the host. The hypervisor therefore needs to convert the host’s system resources so that the VMs can use them. As a consequence, a portion of the resources is consumed by the conversion, and the VM therefore cannot achieve the same performance efficiency as the host.
The impact of virtualization on system performance
More specific reasons for VM performance loss include:
- Virtual CPUs (vCPUs) are implemented as threads on the host, handled by the Linux scheduler.
- VMs do not automatically inherit optimization features, such as NUMA or huge pages, from the host kernel.
- Disk and network I/O settings of the host might have a significant performance impact on the VM.
- Network traffic typically travels to a VM through a software-based bridge.
- Depending on the host devices and their models, there might be significant overhead due to emulation of particular hardware.
The severity of the virtualization impact on the VM performance is influenced by a variety factors, which include:
- The number of concurrently running VMs.
- The amount of virtual devices used by each VM.
- The device types used by the VMs.
Reducing VM performance loss
RHEL 9 provides a number of features you can use to reduce the negative performance effects of virtualization. Notably:
-
The
tuned
service can automatically optimize the resource distribution and performance of your VMs. - Block I/O tuning can improve the performances of the VM’s block devices, such as disks.
- NUMA tuning can increase vCPU performance.
- Virtual networking can be optimized in various ways.
Tuning VM performance can have adverse effects on other virtualization functions. For example, it can make migrating the modified VM more difficult.
13.2. Optimizing virtual machine performance using tuned
The tuned
utility is a tuning profile delivery mechanism that adapts RHEL for certain workload characteristics, such as requirements for CPU-intensive tasks or storage-network throughput responsiveness. It provides a number of tuning profiles that are pre-configured to enhance performance and reduce power consumption in a number of specific use cases. You can edit these profiles or create new profiles to create performance solutions tailored to your environment, including virtualized environments.
To optimize RHEL 9 for virtualization, use the following profiles:
-
For RHEL 9 virtual machines, use the virtual-guest profile. It is based on the generally applicable
throughput-performance
profile, but also decreases the swappiness of virtual memory. - For RHEL 9 virtualization hosts, use the virtual-host profile. This enables more aggressive writeback of dirty memory pages, which benefits the host performance.
Prerequisites
-
The
tuned
service is installed and enabled.
Procedure
To enable a specific tuned
profile:
List the available
tuned
profiles.# tuned-adm list Available profiles: - balanced - General non-specialized tuned profile - desktop - Optimize for the desktop use-case [...] - virtual-guest - Optimize for running inside a virtual guest - virtual-host - Optimize for running KVM guests Current active profile: balanced
Optional: Create a new
tuned
profile or edit an existingtuned
profile.For more information, see Customizing tuned profiles.
Activate a
tuned
profile.# tuned-adm profile selected-profile
To optimize a virtualization host, use the virtual-host profile.
# tuned-adm profile virtual-host
On a RHEL guest operating system, use the virtual-guest profile.
# tuned-adm profile virtual-guest
Additional resources
13.3. Optimizing libvirt daemons
The libvirt
virtualization suite works as a management layer for the RHEL hypervisor, and your libvirt
configuration significantly impacts your virtualization host. Notably, RHEL 9 contains two different types of libvirt
daemons, monolithic or modular, and which type of daemons you use affects how granularly you can configure individual virtualization drivers.
13.3.1. Types of libvirt daemons
RHEL 9 supports the following libvirt
daemon types:
- Monolithic libvirt
The traditional
libvirt
daemon,libvirtd
, controls a wide variety of virtualization drivers, using a single configuration file -/etc/libvirt/libvirtd.conf
.As such,
libvirtd
allows for centralized hypervisor configuration, but may use system resources inefficiently. Therefore,libvirtd
will become unsupported in a future major release of RHEL.However, if you updated to RHEL 9 from RHEL 8, your host still uses
libvirtd
by default.- Modular libvirt
Newly introduced in RHEL 9, modular
libvirt
provides a specific daemon for each virtualization driver. These include the following:- virtqemud - A primary daemon for hypervisor management
- virtinterfaced - A secondary daemon for host NIC management
- virtnetworkd - A secondary daemon for virtual network management
- virtnodedevd - A secondary daemon for host physical device management
- virtnwfilterd - A secondary daemon for host firewall management
- virtsecretd - A secondary daemon for host secret management
- virtstoraged - A secondary daemon for storage management
Each of the daemons has a separate configuration file - for example
/etc/libvirt/virtqemud.conf
. As such, modularlibvirt
daemons provide better options for fine-tuninglibvirt
resource management.If you performed a fresh install of RHEL 9, modular
libvirt
is configured by default.
Next steps
-
If your RHEL 9 uses
libvirtd
, Red Hat recommends switching to modular daemons. For instructions, see Enabling modular libvirt daemons.
13.3.2. Enabling modular libvirt daemons
In RHEL 9, the libvirt
library uses modular daemons that handle individual virtualization driver sets on your host. For example, the virtqemud
daemon handles QEMU drivers.
If you performed a fresh install of a RHEL 9 host, your hypervisor uses modular libvirt
daemons by default. However, if you upgraded your host from RHEL 8 to RHEL 9, your hypervisor uses the monolithic libvirtd
daemon, which is the default in RHEL 8.
If that is the case, Red Hat recommends enabling the modular libvirt
daemons instead, because they provide better options for fine-tuning libvirt
resource management. In addition, libvirtd
will become unsupported in a future major release of RHEL.
Prerequisites
Your hypervisor is using the monolithic
libvirtd
service. To learn whether this is the case:# systemctl is-active libvirtd.service active
If this command displays
active
, you are usinglibvirtd
.- Your virtual machines are shut down.
Procedure
Stop
libvirtd
and its sockets.# systemctl stop libvirtd.service # systemctl stop libvirtd{,-ro,-admin,-tcp,-tls}.socket
Disable
libvirtd
to prevent it from starting on boot.$ systemctl disable libvirtd.service $ systemctl disable libvirtd{,-ro,-admin,-tcp,-tls}.socket
Enable the modular
libvirt
daemons.# for drv in qemu interface network nodedev nwfilter secret storage; do systemctl unmask virt${drv}d.service; systemctl unmask virt${drv}d{,-ro,-admin}.socket; systemctl enable virt${drv}d.service; systemctl enable virt${drv}d{,-ro,-admin}.socket; done
Start the sockets for the modular daemons.
# for drv in qemu network nodedev nwfilter secret storage; do systemctl start virt${drv}d{,-ro,-admin}.socket; done
Optional: If you require connecting to your host from remote hosts, enable and start the virtualization proxy daemon.
# systemctl unmask virtproxyd.service # systemctl unmask virtproxyd{,-ro,-admin,-tls}.socket # systemctl enable virtproxyd.service # systemctl enable virtproxyd{,-ro,-admin,-tls}.socket # systemctl start virtproxyd{,-ro,-admin,-tls}.socket
Verification
Activate the enabled virtualization daemons.
# virsh uri qemu:///system
Ensure your host is using the
virtqemud
modular daemon.# systemctl is-active virtqemud.service active
If this command displays
active
, you have successfully enabled modularlibvirt
daemons.
13.4. Configuring virtual machine memory
To improve the performance of a virtual machine (VM), you can assign additional host RAM to the VM. Similarly, you can decrease the amount of memory allocated to a VM so the host memory can be allocated to other VMs or tasks.
To perform these actions, you can use the web console or the command-line interface.
13.4.1. Adding and removing virtual machine memory using the web console
To improve the performance of a virtual machine (VM) or to free up the host resources it is using, you can use the web console to adjust amount of memory allocated to the VM.
Prerequisites
The guest OS is running the memory balloon drivers. To verify this is the case:
Ensure the VM’s configuration includes the
memballoon
device:# virsh dumpxml testguest | grep memballoon <memballoon model='virtio'> </memballoon>
If this commands displays any output and the model is not set to
none
, thememballoon
device is present.Ensure the balloon drivers are running in the guest OS.
-
In Windows guests, the drivers are installed as a part of the
virtio-win
driver package. For instructions, see Installing paravirtualized KVM drivers for Windows virtual machines. -
In Linux guests, the drivers are generally included by default and activate when the
memballoon
device is present.
-
In Windows guests, the drivers are installed as a part of the
- The web console VM plug-in is installed on your system.
Procedure
Optional: Obtain the information about the maximum memory and currently used memory for a VM. This will serve as a baseline for your changes, and also for verification.
# virsh dominfo testguest Max memory: 2097152 KiB Used memory: 2097152 KiB
In the Virtual Machines interface, click the VM whose information you want to see.
A new page opens with an Overview section with basic information about the selected VM and a Console section to access the VM’s graphical interface.
Click edit next to the
Memory
line in the Overview pane.The
Memory Adjustment
dialog appears.Configure the virtual CPUs for the selected VM.
Maximum allocation - Sets the maximum amount of host memory that the VM can use for its processes. You can specify the maximum memory when creating the VM or increase it later. You can specify memory as multiples of MiB or GiB.
Adjusting maximum memory allocation is only possible on a shut-off VM.
Current allocation - Sets the actual amount of memory allocated to the VM. This value can be less than the Maximum allocation but cannot exceed it. You can adjust the value to regulate the memory available to the VM for its processes. You can specify memory as multiples of MiB or GiB.
If you do not specify this value, the default allocation is the Maximum allocation value.
Click Save.
The memory allocation of the VM is adjusted.
13.4.2. Adding and removing virtual machine memory using the command-line interface
To improve the performance of a virtual machine (VM) or to free up the host resources it is using, you can use the CLI to adjust amount of memory allocated to the VM.
Prerequisites
The guest OS is running the memory balloon drivers. To verify this is the case:
Ensure the VM’s configuration includes the
memballoon
device:# virsh dumpxml testguest | grep memballoon <memballoon model='virtio'> </memballoon>
If this commands displays any output and the model is not set to
none
, thememballoon
device is present.Ensure the ballon drivers are running in the guest OS.
-
In Windows guests, the drivers are installed as a part of the
virtio-win
driver package. For instructions, see Installing paravirtualized KVM drivers for Windows virtual machines. -
In Linux guests, the drivers are generally included by default and activate when the
memballoon
device is present.
-
In Windows guests, the drivers are installed as a part of the
Procedure
Optional: Obtain the information about the maximum memory and currently used memory for a VM. This will serve as a baseline for your changes, and also for verification.
# virsh dominfo testguest Max memory: 2097152 KiB Used memory: 2097152 KiB
Adjust the maximum memory allocated to a VM. Increasing this value improves the performance potential of the VM, and reducing the value lowers the performance footprint the VM has on your host. Note that this change can only be performed on a shut-off VM, so adjusting a running VM requires a reboot to take effect.
For example, to change the maximum memory that the testguest VM can use to 4096 MiB:
# virt-xml testguest --edit --memory memory=4096,currentMemory=4096 Domain 'testguest' defined successfully. Changes will take effect after the domain is fully powered off.
To increase the maximum memory of a running VM, you can attach a memory device to the VM. This is also referred to as memory hot plug. For details, see Attaching devices to virtual machines.
WarningRemoving memory devices from a running VM (also referred as a memory hot unplug) is not supported, and highly discouraged by Red Hat.
Optional: You can also adjust the memory currently used by the VM, up to the maximum allocation. This regulates the memory load that the VM has on the host until the next reboot, without changing the maximum VM allocation.
# virsh setmem testguest --current 2048
Verification
Confirm that the memory used by the VM has been updated:
# virsh dominfo testguest Max memory: 4194304 KiB Used memory: 2097152 KiB
Optional: If you adjusted the current VM memory, you can obtain the memory balloon statistics of the VM to evaluate how effectively it regulates its memory use.
# virsh domstats --balloon testguest Domain: 'testguest' balloon.current=365624 balloon.maximum=4194304 balloon.swap_in=0 balloon.swap_out=0 balloon.major_fault=306 balloon.minor_fault=156117 balloon.unused=3834448 balloon.available=4035008 balloon.usable=3746340 balloon.last-update=1587971682 balloon.disk_caches=75444 balloon.hugetlb_pgalloc=0 balloon.hugetlb_pgfail=0 balloon.rss=1005456
13.4.3. Additional resources
- Attaching devices to virtual machines Attaching devices to virtual machines.
13.5. Optimizing virtual machine I/O performance
The input and output (I/O) capabilities of a virtual machine (VM) can significantly limit the VM’s overall efficiency. To address this, you can optimize a VM’s I/O by configuring block I/O parameters.
13.5.1. Tuning block I/O in virtual machines
When multiple block devices are being used by one or more VMs, it might be important to adjust the I/O priority of specific virtual devices by modifying their I/O weights.
Increasing the I/O weight of a device increases its priority for I/O bandwidth, and therefore provides it with more host resources. Similarly, reducing a device’s weight makes it consume less host resources.
Each device’s weight
value must be within the 100
to 1000
range. Alternatively, the value can be 0
, which removes that device from per-device listings.
Procedure
To display and set a VM’s block I/O parameters:
Display the current
<blkio>
parameters for a VM:# virsh dumpxml VM-name
<domain> [...] <blkiotune> <weight>800</weight> <device> <path>/dev/sda</path> <weight>1000</weight> </device> <device> <path>/dev/sdb</path> <weight>500</weight> </device> </blkiotune> [...] </domain>
Edit the I/O weight of a specified device:
# virsh blkiotune VM-name --device-weights device, I/O-weight
For example, the following changes the weight of the /dev/sda device in the liftrul VM to 500.
# virsh blkiotune liftbrul --device-weights /dev/sda, 500
13.5.2. Disk I/O throttling in virtual machines
When several VMs are running simultaneously, they can interfere with system performance by using excessive disk I/O. Disk I/O throttling in KVM virtualization provides the ability to set a limit on disk I/O requests sent from the VMs to the host machine. This can prevent a VM from over-utilizing shared resources and impacting the performance of other VMs.
To enable disk I/O throttling, set a limit on disk I/O requests sent from each block device attached to VMs to the host machine.
Procedure
Use the
virsh domblklist
command to list the names of all the disk devices on a specified VM.# virsh domblklist rollin-coal Target Source ------------------------------------------------ vda /var/lib/libvirt/images/rollin-coal.qcow2 sda - sdb /home/horridly-demanding-processes.iso
Find the host block device where the virtual disk that you want to throttle is mounted.
For example, if you want to throttle the
sdb
virtual disk from the previous step, the following output shows that the disk is mounted on the/dev/nvme0n1p3
partition.$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT zram0 252:0 0 4G 0 disk [SWAP] nvme0n1 259:0 0 238.5G 0 disk ├─nvme0n1p1 259:1 0 600M 0 part /boot/efi ├─nvme0n1p2 259:2 0 1G 0 part /boot └─nvme0n1p3 259:3 0 236.9G 0 part └─luks-a1123911-6f37-463c-b4eb-fxzy1ac12fea 253:0 0 236.9G 0 crypt /home
Set I/O limits for the block device using the
virsh blkiotune
command.# virsh blkiotune VM-name --parameter device,limit
The following example throttles the
sdb
disk on therollin-coal
VM to 1000 read and write I/O operations per second and to 50 MB per second read and write throughput.# virsh blkiotune rollin-coal --device-read-iops-sec /dev/nvme0n1p3,1000 --device-write-iops-sec /dev/nvme0n1p3,1000 --device-write-bytes-sec /dev/nvme0n1p3,52428800 --device-read-bytes-sec /dev/nvme0n1p3,52428800
Additional information
- Disk I/O throttling can be useful in various situations, for example when VMs belonging to different customers are running on the same host, or when quality of service guarantees are given for different VMs. Disk I/O throttling can also be used to simulate slower disks.
- I/O throttling can be applied independently to each block device attached to a VM and supports limits on throughput and I/O operations.
-
Red Hat does not support using the
virsh blkdeviotune
command to configure I/O throttling in VMs. For more information on unsupported features when using RHEL 9 as a VM host, see Unsupported features in RHEL 9 virtualization.
13.5.3. Enabling multi-queue virtio-scsi
When using virtio-scsi
storage devices in your virtual machines (VMs), the multi-queue virtio-scsi feature provides improved storage performance and scalability. It enables each virtual CPU (vCPU) to have a separate queue and interrupt to use without affecting other vCPUs.
Procedure
To enable multi-queue virtio-scsi support for a specific VM, add the following to the VM’s XML configuration, where N is the total number of vCPU queues:
<controller type='scsi' index='0' model='virtio-scsi'> <driver queues='N' /> </controller>
13.6. Optimizing virtual machine CPU performance
Much like physical CPUs in host machines, vCPUs are critical to virtual machine (VM) performance. As a result, optimizing vCPUs can have a significant impact on the resource efficiency of your VMs. To optimize your vCPU:
- Adjust how many host CPUs are assigned to the VM. You can do this using the CLI or the web console.
Ensure that the vCPU model is aligned with the CPU model of the host. For example, to set the testguest1 VM to use the CPU model of the host:
# virt-xml testguest1 --edit --cpu host-model
- Manage kernel same-page merging (KSM).
If your host machine uses Non-Uniform Memory Access (NUMA), you can also configure NUMA for its VMs. This maps the host’s CPU and memory processes onto the CPU and memory processes of the VM as closely as possible. In effect, NUMA tuning provides the vCPU with a more streamlined access to the system memory allocated to the VM, which can improve the vCPU processing effectiveness.
For details, see Configuring NUMA in a virtual machine and Sample vCPU performance tuning scenario.
13.6.1. Adding and removing virtual CPUs using the command-line interface
To increase or optimize the CPU performance of a virtual machine (VM), you can add or remove virtual CPUs (vCPUs) assigned to the VM.
When performed on a running VM, this is also referred to as vCPU hot plugging and hot unplugging. However, note that vCPU hot unplug is not supported in RHEL 9, and Red Hat highly discourages its use.
Prerequisites
Optional: View the current state of the vCPUs in the targeted VM. For example, to display the number of vCPUs on the testguest VM:
# virsh vcpucount testguest maximum config 4 maximum live 2 current config 2 current live 1
This output indicates that testguest is currently using 1 vCPU, and 1 more vCPu can be hot plugged to it to increase the VM’s performance. However, after reboot, the number of vCPUs testguest uses will change to 2, and it will be possible to hot plug 2 more vCPUs.
Procedure
Adjust the maximum number of vCPUs that can be attached to a VM, which takes effect on the VM’s next boot.
For example, to increase the maximum vCPU count for the testguest VM to 8:
# virsh setvcpus testguest 8 --maximum --config
Note that the maximum may be limited by the CPU topology, host hardware, the hypervisor, and other factors.
Adjust the current number of vCPUs attached to a VM, up to the maximum configured in the previous step. For example:
To increase the number of vCPUs attached to the running testguest VM to 4:
# virsh setvcpus testguest 4 --live
This increases the VM’s performance and host load footprint of testguest until the VM’s next boot.
To permanently decrease the number of vCPUs attached to the testguest VM to 1:
# virsh setvcpus testguest 1 --config
This decreases the VM’s performance and host load footprint of testguest after the VM’s next boot. However, if needed, additional vCPUs can be hot plugged to the VM to temporarily increase its performance.
Verification
Confirm that the current state of vCPU for the VM reflects your changes.
# virsh vcpucount testguest maximum config 8 maximum live 4 current config 1 current live 4
Additional resources
13.6.2. Managing virtual CPUs using the web console
Using the RHEL 9 web console, you can review and configure virtual CPUs used by virtual machines (VMs) to which the web console is connected.
Prerequisites
- The web console VM plug-in is installed on your system.
Procedure
In the Virtual Machines interface, click the VM whose information you want to see.
A new page opens with an Overview section with basic information about the selected VM and a Console section to access the VM’s graphical interface.
Click edit next to the number of vCPUs in the Overview pane.
The vCPU details dialog appears.
Configure the virtual CPUs for the selected VM.
vCPU Count - The number of vCPUs currently in use.
NoteThe vCPU count cannot be greater than the vCPU Maximum.
- vCPU Maximum - The maximum number of virtual CPUs that can be configured for the VM. If this value is higher than the vCPU Count, additional vCPUs can be attached to the VM.
- Sockets - The number of sockets to expose to the VM.
- Cores per socket - The number of cores for each socket to expose to the VM.
Threads per core - The number of threads for each core to expose to the VM.
Note that the Sockets, Cores per socket, and Threads per core options adjust the CPU topology of the VM. This may be beneficial for vCPU performance and may impact the functionality of certain software in the guest OS. If a different setting is not required by your deployment, keep the default values.
Click Apply.
The virtual CPUs for the VM are configured.
NoteChanges to virtual CPU settings only take effect after the VM is restarted.
Additional resources
13.6.3. Configuring NUMA in a virtual machine
The following methods can be used to configure Non-Uniform Memory Access (NUMA) settings of a virtual machine (VM) on a RHEL 9 host.
Prerequisites
The host is a NUMA-compatible machine. To detect whether this is the case, use the
virsh nodeinfo
command and see theNUMA cell(s)
line:# virsh nodeinfo CPU model: x86_64 CPU(s): 48 CPU frequency: 1200 MHz CPU socket(s): 1 Core(s) per socket: 12 Thread(s) per core: 2 NUMA cell(s): 2 Memory size: 67012964 KiB
If the value of the line is 2 or greater, the host is NUMA-compatible.
Procedure
For ease of use, you can set up a VM’s NUMA configuration using automated utilities and services. However, manual NUMA setup is more likely to yield a significant performance improvement.
Automatic methods
Set the VM’s NUMA policy to
Preferred
. For example, to do so for the testguest5 VM:# virt-xml testguest5 --edit --vcpus placement=auto # virt-xml testguest5 --edit --numatune mode=preferred
Enable automatic NUMA balancing on the host:
# echo 1 > /proc/sys/kernel/numa_balancing
Use the
numad
command to automatically align the VM CPU with memory resources.# numad
Manual methods
Pin specific vCPU threads to a specific host CPU or range of CPUs. This is also possible on non-NUMA hosts and VMs, and is recommended as a safe method of vCPU performance improvement.
For example, the following commands pin vCPU threads 0 to 5 of the testguest6 VM to host CPUs 1, 3, 5, 7, 9, and 11, respectively:
# virsh vcpupin testguest6 0 1 # virsh vcpupin testguest6 1 3 # virsh vcpupin testguest6 2 5 # virsh vcpupin testguest6 3 7 # virsh vcpupin testguest6 4 9 # virsh vcpupin testguest6 5 11
Afterwards, you can verify whether this was successful:
# virsh vcpupin testguest6 VCPU CPU Affinity ---------------------- 0 1 1 3 2 5 3 7 4 9 5 11
After pinning vCPU threads, you can also pin QEMU process threads associated with a specified VM to a specific host CPU or range of CPUs. For example, the following commands pin the QEMU process thread of testguest6 to CPUs 13 and 15, and verify this was successful:
# virsh emulatorpin testguest6 13,15 # virsh emulatorpin testguest6 emulator: CPU Affinity ---------------------------------- *: 13,15
Finally, you can also specify which host NUMA nodes will be assigned specifically to a certain VM. This can improve the host memory usage by the VM’s vCPU. For example, the following commands set testguest6 to use host NUMA nodes 3 to 5, and verify this was successful:
# virsh numatune testguest6 --nodeset 3-5 # virsh numatune testguest6
For best performance results, it is recommended to use all of the manual tuning methods listed above
Known issues
Additional resources
- Sample vCPU performance tuning scenario
-
View the current NUMA configuration of your system using the
numastat
utility
13.6.4. Sample vCPU performance tuning scenario
To obtain the best vCPU performance possible, Red Hat recommends using manual vcpupin
, emulatorpin
, and numatune
settings together, for example like in the following scenario.
Starting scenario
Your host has the following hardware specifics:
- 2 NUMA nodes
- 3 CPU cores on each node
- 2 threads on each core
The output of
virsh nodeinfo
of such a machine would look similar to:# virsh nodeinfo CPU model: x86_64 CPU(s): 12 CPU frequency: 3661 MHz CPU socket(s): 2 Core(s) per socket: 3 Thread(s) per core: 2 NUMA cell(s): 2 Memory size: 31248692 KiB
You intend to modify an existing VM to have 8 vCPUs, which means that it will not fit in a single NUMA node.
Therefore, you should distribute 4 vCPUs on each NUMA node and make the vCPU topology resemble the host topology as closely as possible. This means that vCPUs that run as sibling threads of a given physical CPU should be pinned to host threads on the same core. For details, see the Solution below:
Solution
Obtain the information on the host topology:
# virsh capabilities
The output should include a section that looks similar to the following:
<topology> <cells num="2"> <cell id="0"> <memory unit="KiB">15624346</memory> <pages unit="KiB" size="4">3906086</pages> <pages unit="KiB" size="2048">0</pages> <pages unit="KiB" size="1048576">0</pages> <distances> <sibling id="0" value="10" /> <sibling id="1" value="21" /> </distances> <cpus num="6"> <cpu id="0" socket_id="0" core_id="0" siblings="0,3" /> <cpu id="1" socket_id="0" core_id="1" siblings="1,4" /> <cpu id="2" socket_id="0" core_id="2" siblings="2,5" /> <cpu id="3" socket_id="0" core_id="0" siblings="0,3" /> <cpu id="4" socket_id="0" core_id="1" siblings="1,4" /> <cpu id="5" socket_id="0" core_id="2" siblings="2,5" /> </cpus> </cell> <cell id="1"> <memory unit="KiB">15624346</memory> <pages unit="KiB" size="4">3906086</pages> <pages unit="KiB" size="2048">0</pages> <pages unit="KiB" size="1048576">0</pages> <distances> <sibling id="0" value="21" /> <sibling id="1" value="10" /> </distances> <cpus num="6"> <cpu id="6" socket_id="1" core_id="3" siblings="6,9" /> <cpu id="7" socket_id="1" core_id="4" siblings="7,10" /> <cpu id="8" socket_id="1" core_id="5" siblings="8,11" /> <cpu id="9" socket_id="1" core_id="3" siblings="6,9" /> <cpu id="10" socket_id="1" core_id="4" siblings="7,10" /> <cpu id="11" socket_id="1" core_id="5" siblings="8,11" /> </cpus> </cell> </cells> </topology>
- Optional: Test the performance of the VM using the applicable tools and utilities.
Set up and mount 1 GiB huge pages on the host:
Add the following line to the host’s kernel command line:
default_hugepagesz=1G hugepagesz=1G
Create the
/etc/systemd/system/hugetlb-gigantic-pages.service
file with the following content:[Unit] Description=HugeTLB Gigantic Pages Reservation DefaultDependencies=no Before=dev-hugepages.mount ConditionPathExists=/sys/devices/system/node ConditionKernelCommandLine=hugepagesz=1G [Service] Type=oneshot RemainAfterExit=yes ExecStart=/etc/systemd/hugetlb-reserve-pages.sh [Install] WantedBy=sysinit.target
Create the
/etc/systemd/hugetlb-reserve-pages.sh
file with the following content:#!/bin/sh nodes_path=/sys/devices/system/node/ if [ ! -d $nodes_path ]; then echo "ERROR: $nodes_path does not exist" exit 1 fi reserve_pages() { echo $1 > $nodes_path/$2/hugepages/hugepages-1048576kB/nr_hugepages } reserve_pages 4 node1 reserve_pages 4 node2
This reserves four 1GiB huge pages from node1 and four 1GiB huge pages from node2.
Make the script created in the previous step executable:
# chmod +x /etc/systemd/hugetlb-reserve-pages.sh
Enable huge page reservation on boot:
# systemctl enable hugetlb-gigantic-pages
Use the
virsh edit
command to edit the XML configuration of the VM you wish to optimize, in this example super-VM:# virsh edit super-vm
Adjust the XML configuration of the VM in the following way:
-
Set the VM to use 8 static vCPUs. Use the
<vcpu/>
element to do this. Pin each of the vCPU threads to the corresponding host CPU threads that it mirrors in the topology. To do so, use the
<vcpupin/>
elements in the<cputune>
section.Note that, as shown by the
virsh capabilities
utility above, host CPU threads are not ordered sequentially in their respective cores. In addition, the vCPU threads should be pinned to the highest available set of host cores on the same NUMA node. For a table illustration, see the Sample topology section below.The XML configuration for steps a. and b. can look similar to:
<cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='4'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='5'/> <vcpupin vcpu='4' cpuset='7'/> <vcpupin vcpu='5' cpuset='10'/> <vcpupin vcpu='6' cpuset='8'/> <vcpupin vcpu='7' cpuset='11'/> <emulatorpin cpuset='6,9'/> </cputune>
Set the VM to use 1 GiB huge pages:
<memoryBacking> <hugepages> <page size='1' unit='GiB'/> </hugepages> </memoryBacking>
Configure the VM’s NUMA nodes to use memory from the corresponding NUMA nodes on the host. To do so, use the
<memnode/>
elements in the<numatune/>
section:<numatune> <memory mode="preferred" nodeset="1"/> <memnode cellid="0" mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> </numatune>
Ensure the CPU mode is set to
host-passthrough
, and that the CPU uses cache inpassthrough
mode:<cpu mode="host-passthrough"> <topology sockets="2" cores="2" threads="2"/> <cache mode="passthrough"/>
-
Set the VM to use 8 static vCPUs. Use the
Verification
Confirm that the resulting XML configuration of the VM includes a section similar to the following:
[...] <memoryBacking> <hugepages> <page size='1' unit='GiB'/> </hugepages> </memoryBacking> <vcpu placement='static'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='4'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='5'/> <vcpupin vcpu='4' cpuset='7'/> <vcpupin vcpu='5' cpuset='10'/> <vcpupin vcpu='6' cpuset='8'/> <vcpupin vcpu='7' cpuset='11'/> <emulatorpin cpuset='6,9'/> </cputune> <numatune> <memory mode="preferred" nodeset="1"/> <memnode cellid="0" mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> </numatune> <cpu mode="host-passthrough"> <topology sockets="2" cores="2" threads="2"/> <cache mode="passthrough"/> <numa> <cell id="0" cpus="0-3" memory="2" unit="GiB"> <distances> <sibling id="0" value="10"/> <sibling id="1" value="21"/> </distances> </cell> <cell id="1" cpus="4-7" memory="2" unit="GiB"> <distances> <sibling id="0" value="21"/> <sibling id="1" value="10"/> </distances> </cell> </numa> </cpu> </domain>
- Optional: Test the performance of the VM using the applicable tools and utilities to evaluate the impact of the VM’s optimization.
Sample topology
The following tables illustrate the connections between the vCPUs and the host CPUs they should be pinned to:
Table 13.1. Host topology
CPU threads
0
3
1
4
2
5
6
9
7
10
8
11
Cores
0
1
2
3
4
5
Sockets
0
1
NUMA nodes
0
1
Table 13.2. VM topology
vCPU threads
0
1
2
3
4
5
6
7
Cores
0
1
2
3
Sockets
0
1
NUMA nodes
0
1
Table 13.3. Combined host and VM topology
vCPU threads
0
1
2
3
4
5
6
7
Host CPU threads
0
3
1
4
2
5
6
9
7
10
8
11
Cores
0
1
2
3
4
5
Sockets
0
1
NUMA nodes
0
1
In this scenario, there are 2 NUMA nodes and 8 vCPUs. Therefore, 4 vCPU threads should be pinned to each node.
In addition, Red Hat recommends leaving at least a single CPU thread available on each node for host system operations.
Because in this example, each NUMA node houses 3 cores, each with 2 host CPU threads, the set for node 0 translates as follows:
<vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='4'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='5'/>
13.6.5. Managing kernel same-page merging
Kernel Same-Page Merging (KSM) improves memory density by sharing identical memory pages between virtual machines (VMs). However, enabling KSM increases CPU utilization, and might adversely affect overall performance depending on the workload.
Depending on your requirements, you can either enable or disable KSM for a single session or persistently.
In RHEL 9 and later, KSM is disabled by default.
Prerequisites
- Root access to your host system.
Procedure
Disable KSM:
To deactivate KSM for a single session, use the
systemctl
utility to stopksm
andksmtuned
services.# systemctl stop ksm # systemctl stop ksmtuned
To deactivate KSM persistently, use the
systemctl
utility to disableksm
andksmtuned
services.# systemctl disable ksm Removed /etc/systemd/system/multi-user.target.wants/ksm.service. # systemctl disable ksmtuned Removed /etc/systemd/system/multi-user.target.wants/ksmtuned.service.
Memory pages shared between VMs before deactivating KSM will remain shared. To stop sharing, delete all the PageKSM
pages in the system using the following command:
# echo 2 > /sys/kernel/mm/ksm/run
After anonymous pages replace the KSM pages, the khugepaged
kernel service will rebuild transparent hugepages on the VM’s physical memory.
- Enable KSM:
Enabling KSM increases CPU utilization and affects overall CPU performance.
Install the
ksmtuned
service:# yum install ksmtuned
Start the service:
To enable KSM for a single session, use the
systemctl
utility to start theksm
andksmtuned
services.# systemctl start ksm # systemctl start ksmtuned
To enable KSM persistently, use the
systemctl
utility to enable theksm
andksmtuned
services.# systemctl enable ksm Created symlink /etc/systemd/system/multi-user.target.wants/ksm.service → /usr/lib/systemd/system/ksm.service # systemctl enable ksmtuned Created symlink /etc/systemd/system/multi-user.target.wants/ksmtuned.service → /usr/lib/systemd/system/ksmtuned.service
13.7. Optimizing virtual machine network performance
Due to the virtual nature of a VM’s network interface card (NIC), the VM loses a portion of its allocated host network bandwidth, which can reduce the overall workload efficiency of the VM. The following tips can minimize the negative impact of virtualization on the virtual NIC (vNIC) throughput.
Procedure
Use any of the following methods and observe if it has a beneficial effect on your VM network performance:
- Enable the vhost_net module
On the host, ensure the
vhost_net
kernel feature is enabled:# lsmod | grep vhost vhost_net 32768 1 vhost 53248 1 vhost_net tap 24576 1 vhost_net tun 57344 6 vhost_net
If the output of this command is blank, enable the
vhost_net
kernel module:# modprobe vhost_net
- Set up multi-queue virtio-net
To set up the multi-queue virtio-net feature for a VM, use the
virsh edit
command to edit to the XML configuration of the VM. In the XML, add the following to the<devices>
section, and replaceN
with the number of vCPUs in the VM, up to 16:<interface type='network'> <source network='default'/> <model type='virtio'/> <driver name='vhost' queues='N'/> </interface>
If the VM is running, restart it for the changes to take effect.
- Batching network packets
In Linux VM configurations with a long transmission path, batching packets before submitting them to the kernel may improve cache utilization. To set up packet batching, use the following command on the host, and replace tap0 with the name of the network interface that the VMs use:
# ethtool -C tap0 rx-frames 64
- SR-IOV
- If your host NIC supports SR-IOV, use SR-IOV device assignment for your vNICs. For more information, see Managing SR-IOV devices.
Additional resources
13.8. Virtual machine performance monitoring tools
To identify what consumes the most VM resources and which aspect of VM performance needs optimization, performance diagnostic tools, both general and VM-specific, can be used.
Default OS performance monitoring tools
For standard performance evaluation, you can use the utilities provided by default by your host and guest operating systems:
On your RHEL 9 host, as root, use the
top
utility or the system monitor application, and look forqemu
andvirt
in the output. This shows how much host system resources your VMs are consuming.-
If the monitoring tool displays that any of the
qemu
orvirt
processes consume a large portion of the host CPU or memory capacity, use theperf
utility to investigate. For details, see below. -
In addition, if a
vhost_net
thread process, named for example vhost_net-1234, is displayed as consuming an excessive amount of host CPU capacity, consider using virtual network optimization features, such asmulti-queue virtio-net
.
-
If the monitoring tool displays that any of the
On the guest operating system, use performance utilities and applications available on the system to evaluate which processes consume the most system resources.
-
On Linux systems, you can use the
top
utility. - On Windows systems, you can use the Task Manager application.
-
On Linux systems, you can use the
perf kvm
You can use the perf
utility to collect and analyze virtualization-specific statistics about the performance of your RHEL 9 host. To do so:
On the host, install the perf package:
# dnf install perf
Use one of the
perf kvm stat
commands to display perf statistics for your virtualization host:-
For real-time monitoring of your hypervisor, use the
perf kvm stat live
command. -
To log the perf data of your hypervisor over a period of time, activate the logging using the
perf kvm stat record
command. After the command is canceled or interrupted, the data is saved in theperf.data.guest
file, which can be analyzed using theperf kvm stat report
command.
-
For real-time monitoring of your hypervisor, use the
Analyze the
perf
output for types ofVM-EXIT
events and their distribution. For example, thePAUSE_INSTRUCTION
events should be infrequent, but in the following output, the high occurrence of this event suggests that the host CPUs are not handling the running vCPUs well. In such a scenario, consider shutting down some of your active VMs, removing vCPUs from these VMs, or tuning the performance of the vCPUs.# perf kvm stat report Analyze events for all VMs, all VCPUs: VM-EXIT Samples Samples% Time% Min Time Max Time Avg time EXTERNAL_INTERRUPT 365634 31.59% 18.04% 0.42us 58780.59us 204.08us ( +- 0.99% ) MSR_WRITE 293428 25.35% 0.13% 0.59us 17873.02us 1.80us ( +- 4.63% ) PREEMPTION_TIMER 276162 23.86% 0.23% 0.51us 21396.03us 3.38us ( +- 5.19% ) PAUSE_INSTRUCTION 189375 16.36% 11.75% 0.72us 29655.25us 256.77us ( +- 0.70% ) HLT 20440 1.77% 69.83% 0.62us 79319.41us 14134.56us ( +- 0.79% ) VMCALL 12426 1.07% 0.03% 1.02us 5416.25us 8.77us ( +- 7.36% ) EXCEPTION_NMI 27 0.00% 0.00% 0.69us 1.34us 0.98us ( +- 3.50% ) EPT_MISCONFIG 5 0.00% 0.00% 5.15us 10.85us 7.88us ( +- 11.67% ) Total Samples:1157497, Total events handled time:413728274.66us.
Other event types that can signal problems in the output of
perf kvm stat
include:-
INSN_EMULATION
- suggests suboptimal VM I/O configuration.
-
For more information on using perf
to monitor virtualization performance, see the perf-kvm
man page.
numastat
To see the current NUMA configuration of your system, you can use the numastat
utility, which is provided by installing the numactl package.
The following shows a host with 4 running VMs, each obtaining memory from multiple NUMA nodes. This is not optimal for vCPU performance, and warrants adjusting:
# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
51722 (qemu-kvm) 68 16 357 6936 2 3 147 598 8128
51747 (qemu-kvm) 245 11 5 18 5172 2532 1 92 8076
53736 (qemu-kvm) 62 432 1661 506 4851 136 22 445 8116
53773 (qemu-kvm) 1393 3 1 2 12 0 0 6702 8114
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
Total 1769 463 2024 7462 10037 2672 169 7837 32434
In contrast, the following shows memory being provided to each VM by a single node, which is significantly more efficient.
# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
51747 (qemu-kvm) 0 0 7 0 8072 0 1 0 8080
53736 (qemu-kvm) 0 0 7 0 0 0 8113 0 8120
53773 (qemu-kvm) 0 0 7 0 0 0 1 8110 8118
59065 (qemu-kvm) 0 0 8050 0 0 0 0 0 8051
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
Total 0 0 8072 0 8072 0 8114 8110 32368
13.9. Additional resources
Chapter 14. Managing power consumption with PowerTOP
As a system administrator, you can use the PowerTOP tool to analyze and manage power consumption.
14.1. The purpose of PowerTOP
PowerTOP is a program that diagnoses issues related to power consumption and provides suggestions on how to extend battery lifetime.
The PowerTOP tool can provide an estimate of the total power usage of the system and also individual power usage for each process, device, kernel worker, timer, and interrupt handler. The tool can also identify specific components of kernel and user-space applications that frequently wake up the CPU.
Red Hat Enterprise Linux 9 uses version 2.x of PowerTOP.
14.2. Using PowerTOP
Prerequisites
To be able to use PowerTOP, make sure that the
powertop
package has been installed on your system:# dnf install powertop
14.2.1. Starting PowerTOP
Procedure
To run PowerTOP, use the following command:
# powertop
Laptops should run on battery power when running the powertop
command.
14.2.2. Calibrating PowerTOP
Procedure
On a laptop, you can calibrate the power estimation engine by running the following command:
# powertop --calibrate
Let the calibration finish without interacting with the machine during the process.
Calibration takes time because the process performs various tests, cycles through brightness levels and switches devices on and off.
When the calibration process is completed, PowerTOP starts as normal. Let it run for approximately an hour to collect data.
When enough data is collected, power estimation figures will be displayed in the first column of the output table.
Note that powertop --calibrate
can only be used on laptops.
14.2.3. Setting the measuring interval
By default, PowerTOP takes measurements in 20 seconds intervals.
If you want to change this measuring frequency, use the following procedure:
Procedure
Run the
powertop
command with the--time
option:# powertop --time=time in seconds
14.2.4. Additional resources
For more details on how to use PowerTOP, see the powertop
man page.
14.3. PowerTOP statistics
While it runs, PowerTOP gathers statistics from the system.
PowerTOP's output provides multiple tabs:
-
Overview
-
Idle stats
-
Frequency stats
-
Device stats
-
Tunables
-
WakeUp
You can use the Tab
and Shift+Tab
keys to cycle through these tabs.
14.3.1. The Overview tab
In the Overview
tab, you can view a list of the components that either send wakeups to the CPU most frequently or consume the most power. The items within the Overview
tab, including processes, interrupts, devices, and other resources, are sorted according to their utilization.
The adjacent columns within the Overview
tab provide the following pieces of information:
- Usage
- Power estimation of how the resource is being used.
- Events/s
- Wakeups per second. The number of wakeups per second indicates how efficiently the services or the devices and drivers of the kernel are performing. Less wakeups means that less power is consumed. Components are ordered by how much further their power usage can be optimized.
- Category
- Classification of the component; such as process, device, or timer.
- Description
- Description of the component.
If properly calibrated, a power consumption estimation for every listed item in the first column is shown as well.
Apart from this, the Overview
tab includes the line with summary statistics such as:
- Total power consumption
- Remaining battery life (only if applicable)
- Summary of total wakeups per second, GPU operations per second, and virtual file system operations per second
14.3.2. The Idle stats tab
The Idle stats
tab shows usage of C-states for all processors and cores, while the Frequency stats
tab shows usage of P-states including the Turbo mode, if applicable, for all processors and cores. The duration of C- or P-states is an indication of how well the CPU usage has been optimized. The longer the CPU stays in the higher C- or P-states (for example C4 is higher than C3), the better the CPU usage optimization is. Ideally, residency is 90% or more in the highest C- or P-state when the system is idle.
14.3.3. The Device stats tab
The Device stats
tab provides similar information to the Overview
tab but only for devices.
14.3.4. The Tunables tab
The Tunables
tab contains PowerTOP's suggestions for optimizing the system for lower power consumption.
Use the up
and down
keys to move through suggestions, and the enter
key to toggle the suggestion on or off.
14.3.5. The WakeUp tab
The WakeUp
tab displays the device wakeup settings available for users to change as and when required.
Use the up
and down
keys to move through the available settings, and the enter
key to enable or disable a setting.
Figure 14.1. PowerTOP output

Additional resources
For more details on PowerTOP, see PowerTOP’s home page.
14.4. Why Powertop does not display Frequency stats values in some instances
While using the Intel P-State driver, PowerTOP only displays values in the Frequency Stats
tab if the driver is in passive mode. But, even in this case, the values may be incomplete.
In total, there are three possible modes of the Intel P-State driver:
- Active mode with Hardware P-States (HWP)
- Active mode without HWP
- Passive mode
Switching to the ACPI CPUfreq driver results in complete information being displayed by PowerTOP. However, it is recommended to keep your system on the default settings.
To see what driver is loaded and in what mode, run:
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
-
intel_pstate
is returned if the Intel P-State driver is loaded and in active mode. -
intel_cpufreq
is returned if the Intel P-State driver is loaded and in passive mode. -
acpi-cpufreq
is returned if the ACPI CPUfreq driver is loaded.
While using the Intel P-State driver, add the following argument to the kernel boot command line to force the driver to run in passive mode:
intel_pstate=passive
To disable the Intel P-State driver and use, instead, the ACPI CPUfreq driver, add the following argument to the kernel boot command line:
intel_pstate=disable
14.5. Generating an HTML output
Apart from the powertop’s
output in terminal, you can also generate an HTML report.
Procedure
Run the
powertop
command with the--html
option:# powertop --html=htmlfile.html
Replace the
htmlfile.html
parameter with the required name for the output file.
14.6. Optimizing power consumption
To optimize power consumption, you can use either the powertop
service or the powertop2tuned
utility.
14.6.1. Optimizing power consumption using the powertop service
You can use the powertop
service to automatically enable all PowerTOP's suggestions from the Tunables
tab on the boot:
Procedure
Enable the
powertop
service:# systemctl enable powertop
14.6.2. The powertop2tuned utility
The powertop2tuned
utility allows you to create custom TuneD profiles from PowerTOP suggestions.
By default, powertop2tuned
creates profiles in the /etc/tuned/
directory, and bases the custom profile on the currently selected TuneD profile. For safety reasons, all PowerTOP tunings are initially disabled in the new profile.
To enable the tunings, you can:
-
Uncomment them in the
/etc/tuned/profile_name/tuned.conf file
. Use the
--enable
or-e
option to generate a new profile that enables most of the tunings suggested by PowerTOP.Certain potentially problematic tunings, such as the USB autosuspend, are disabled by default and need to be uncommented manually.
14.6.3. Optimizing power consumption using the powertop2tuned utility
Prerequisites
The
powertop2tuned
utility is installed on the system:# dnf install tuned-utils
Procedure
Create a custom profile:
# powertop2tuned new_profile_name
Activate the new profile:
# tuned-adm profile new_profile_name
Additional information
For a complete list of options that
powertop2tuned
supports, use:$ powertop2tuned --help
14.6.4. Comparison of powertop.service and powertop2tuned
Optimizing power consumption with powertop2tuned
is preferred over powertop.service
for the following reasons:
-
The
powertop2tuned
utility represents integration of PowerTOP into TuneD, which enables to benefit of advantages of both tools. -
The
powertop2tuned
utility allows for fine-grained control of enabled tuning. -
With
powertop2tuned
, potentially dangerous tuning are not automatically enabled. -
With
powertop2tuned
, rollback is possible without reboot.
Chapter 15. Getting started with perf
As a system administrator, you can use the perf
tool to collect and analyze performance data of your system.
15.1. Introduction to perf
The perf
user-space tool interfaces with the kernel-based subsystem Performance Counters for Linux (PCL). perf
is a powerful tool that uses the Performance Monitoring Unit (PMU) to measure, record, and monitor a variety of hardware and software events. perf
also supports tracepoints, kprobes, and uprobes.
15.2. Installing perf
This procedure installs the perf
user-space tool.
Procedure
Install the
perf
tool:# dnf install perf
15.3. Common perf commands
This section provides an overview of commonly used perf
commands.
Commonly used perf commands
perf stat
- This command provides overall statistics for common performance events, including instructions executed and clock cycles consumed. Options allow for selection of events other than the default measurement events.
perf record
-
This command records performance data into a file,
perf.data
, which can be later analyzed using theperf report
command. perf report
-
This command reads and displays the performance data from the
perf.data
file created byperf record
. perf list
- This command lists the events available on a particular machine. These events will vary based on performance monitoring hardware and software configuration of the system.
perf top
-
This command performs a similar function to the
top
utility. It generates and displays a performance counter profile in realtime. perf trace
-
This command performs a similar function to the
strace
tool. It monitors the system calls used by a specified thread or process and all signals received by that application. perf help
-
This command displays a complete list of
perf
commands.
Additional resources
-
Add the
--help
option to a subcommand to open the man page.
Chapter 16. Profiling memory allocation with numastat
With the numastat
tool, you can display statistics over memory allocations in a system.
The numastat
tool displays data for each NUMA node separately. You can use this information to investigate memory performance of your system or the effectiveness of different memory policies on your system.
16.1. Default numastat statistics
By default, the numastat
tool displays statistics over these categories of data for each NUMA node:
numa_hit
- The number of pages that were successfully allocated to this node.
numa_miss
-
The number of pages that were allocated on this node because of low memory on the intended node. Each
numa_miss
event has a correspondingnuma_foreign
event on another node. numa_foreign
-
The number of pages initially intended for this node that were allocated to another node instead. Each
numa_foreign
event has a correspondingnuma_miss
event on another node. interleave_hit
- The number of interleave policy pages successfully allocated to this node.
local_node
- The number of pages successfully allocated on this node by a process on this node.
other_node
- The number of pages allocated on this node by a process on another node.
High numa_hit
values and low numa_miss
values (relative to each other) indicate optimal performance.
16.2. Viewing memory allocation with numastat
You can view the memory allocation of the system by using the numastat
tool.
Prerequisites
Install the
numactl
package:# dnf install numactl
Procedure
View the memory allocation of your system:
$ numastat node0 node1 numa_hit 76557759 92126519 numa_miss 30772308 30827638 numa_foreign 30827638 30772308 interleave_hit 106507 103832 local_node 76502227 92086995 other_node 30827840 30867162
Additional resources
-
numastat(8)
man page
Chapter 17. Configuring an operating system to optimize CPU utilization
This section describes how to configure the operating system to optimize CPU utilization across their workloads.
17.1. Tools for monitoring and diagnosing processor issues
The following are the tools available in Red Hat Enterprise Linux 9 to monitor and diagnose processor-related performance issues:
-
turbostat
tool prints counter results at specified intervals to help administrators identify unexpected behavior in servers, such as excessive power usage, failure to enter deep sleep states, or system management interrupts (SMIs) being created unnecessarily. -
numactl
utility provides a number of options to manage processor and memory affinity. Thenumactl
package includes thelibnuma
library which offers a simple programming interface to the NUMA policy supported by the kernel, and can be used for more fine-grained tuning than thenumactl
application. -
numastat
tool displays per-NUMA node memory statistics for the operating system and its processes, and shows administrators whether the process memory is spread throughout a system or is centralized on specific nodes. This tool is provided by thenumactl
package. -
numad
is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management. -
/proc/interrupts
file displays the interrupt request (IRQ) number, the number of similar interrupt requests handled by each processor in the system, the type of interrupt sent, and a comma-separated list of devices that respond to the listed interrupt request. pqos
utility is available in theintel-cmt-cat
package. It monitors CPU cache and memory bandwidth on recent Intel processors. It monitors:- The instructions per cycle (IPC).
- The count of last level cache MISSES.
- The size in kilobytes that the program executing in a given CPU occupies in the LLC.
- The bandwidth to local memory (MBL).
- The bandwidth to remote memory (MBR).
-
x86_energy_perf_policy
tool allows administrators to define the relative importance of performance and energy efficiency. This information can then be used to influence processors that support this feature when they select options that trade off between performance and energy efficiency. -
taskset
tool is provided by theutil-linux
package. It allows administrators to retrieve and set the processor affinity of a running process, or launch a process with a specified processor affinity.
Additional resources
-
turbostat(8)
,numactl(8)
,numastat(8)
,numa(7)
,numad(8)
,pqos(8)
,x86_energy_perf_policy(8)
, andtaskset(1)
man pages
17.2. Types of system topology
In modern computing, the idea of a CPU is a misleading one, as most modern systems have multiple processors. The topology of the system is the way these processors are connected to each other and to other system resources. This can affect system and application performance, and the tuning considerations for a system.
The following are the two primary types of topology used in modern computing:
Symmetric Multi-Processor (SMP) topology
- SMP topology allows all processors to access memory in the same amount of time. However, because shared and equal memory access inherently forces serialized memory accesses from all the CPUs, SMP system scaling constraints are now generally viewed as unacceptable. For this reason, practically all modern server systems are NUMA machines.
Non-Uniform Memory Access (NUMA) topology
NUMA topology was developed more recently than SMP topology. In a NUMA system, multiple processors are physically grouped on a socket. Each socket has a dedicated area of memory and processors that have local access to that memory, these are referred to collectively as a node. Processors on the same node have high speed access to that node’s memory bank, and slower access to memory banks not on their node.
Therefore, there is a performance penalty when accessing non-local memory. Thus, performance sensitive applications on a system with NUMA topology should access memory that is on the same node as the processor executing the application, and should avoid accessing remote memory wherever possible.
Multi-threaded applications that are sensitive to performance may benefit from being configured to execute on a specific NUMA node rather than a specific processor. Whether this is suitable depends on your system and the requirements of your application. If multiple application threads access the same cached data, then configuring those threads to execute on the same processor may be suitable. However, if multiple threads that access and cache different data execute on the same processor, each thread may evict cached data accessed by a previous thread. This means that each thread 'misses' the cache and wastes execution time fetching data from memory and replacing it in the cache. Use the
perf
tool to check for an excessive number of cache misses.
17.2.1. Displaying system topologies
There are a number of commands that help understand the topology of a system. This procedure describes how to determine the system topology.
Procedure
To display an overview of your system topology:
$ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 4 8 12 16 20 24 28 32 36 node 0 size: 65415 MB node 0 free: 43971 MB [...]
To gather the information about the CPU architecture, such as the number of CPUs, threads, cores, sockets, and NUMA nodes:
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 1 Core(s) per socket: 10 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 47 Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz Stepping: 2 CPU MHz: 2394.204 BogoMIPS: 4787.85 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36 NUMA node1 CPU(s): 2,6,10,14,18,22,26,30,34,38 NUMA node2 CPU(s): 1,5,9,13,17,21,25,29,33,37 NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39
To view a graphical representation of your system:
# dnf install hwloc-gui # lstopo
Figure 17.1. The
lstopo
outputTo view the detailed textual output:
# dnf install hwloc # lstopo-no-graphics Machine (15GB) Package L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#4) HostBridge L#0 PCI 8086:5917 GPU L#0 "renderD128" GPU L#1 "controlD64" GPU L#2 "card0" PCIBridge PCI 8086:24fd Net L#3 "wlp61s0" PCIBridge PCI 8086:f1a6 PCI 8086:15d7 Net L#4 "enp0s31f6"
Additional resources
-
numactl(8)
,lscpu(1)
, andlstopo(1)
man pages
17.3. Configuring kernel tick time
By default, Red Hat Enterprise Linux 9 uses a tickless kernel, which does not interrupt idle CPUs in order to reduce power usage and allow new processors to take advantage of deep sleep states.
Red Hat Enterprise Linux 9 also offers a dynamic tickless option, which is useful for latency-sensitive workloads, such as high performance computing or realtime computing. By default, the dynamic tickless option is disabled. Red Hat recommends using the cpu-partitioning
TuneD profile to enable the dynamic tickless option for cores specified as isolated_cores
.
This procedure describes how to manually persistently enable dynamic tickless behavior.
Procedure
To enable dynamic tickless behavior in certain cores, specify those cores on the kernel command line with the
nohz_full
parameter. On a 16 core system, append this parameter on theGRUB_CMDLINE_LINUX
option in the/etc/default/grub
file:nohz_full=1-15
This enables dynamic tickless behavior on cores
1
through15
, moving all timekeeping to the only unspecified core (core0
).To persistently enable the dynamic tickless behavior, regenerate the GRUB2 configuration using the edited default file. On systems with BIOS firmware, execute the following command:
# grub2-mkconfig -o /etc/grub2.cfg
On systems with UEFI firmware, execute the following command:
# grub2-mkconfig -o /etc/grub2-efi.cfg
When the system boots, manually move the
rcu
threads to the non-latency-sensitive core, in this case core0
:# for i in `pgrep rcu[^c]` ; do taskset -pc 0 $i ; done
-
Optional: Use the
isolcpus
parameter on the kernel command line to isolate certain cores from user-space tasks. Optional: Set the CPU affinity for the kernel’s
write-back bdi-flush
threads to the housekeeping core:echo 1 > /sys/bus/workqueue/devices/writeback/cpumask
Verification steps
Once the system is rebooted, verify if
dynticks
are enabled:# journalctl -xe | grep dynticks Mar 15 18:34:54 rhel-server kernel: NO_HZ: Full dynticks CPUs: 1-15.
Verify that the dynamic tickless configuration is working correctly:
# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 sleep 3
This command measures ticks on CPU 1 while telling CPU 1 to sleep for 3 seconds.
The default kernel timer configuration shows around 3100 ticks on a regular CPU:
# perf stat -C 0 -e irq_vectors:local_timer_entry taskset -c 0 sleep 3 Performance counter stats for 'CPU(s) 0': 3,107 irq_vectors:local_timer_entry 3.001342790 seconds time elapsed
With the dynamic tickless kernel configured, you should see around 4 ticks instead:
# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 sleep 3 Performance counter stats for 'CPU(s) 1': 4 irq_vectors:local_timer_entry 3.001544078 seconds time elapsed
Additional resources
17.4. Overview of an interrupt request
An interrupt request or IRQ is a signal for immediate attention sent from a piece of hardware to a processor. Each device in a system is assigned one or more IRQ numbers which allow it to send unique interrupts. When interrupts are enabled, a processor that receives an interrupt request immediately pauses execution of the current application thread in order to address the interrupt request.
Because interrupt halts normal operation, high interrupt rates can severely degrade system performance. It is possible to reduce the amount of time taken by interrupts by configuring interrupt affinity or by sending a number of lower priority interrupts in a batch (coalescing a number of interrupts).
Interrupt requests have an associated affinity property, smp_affinity
, which defines the processors that handle the interrupt request. To improve application performance, assign interrupt affinity and process affinity to the same processor, or processors on the same core. This allows the specified interrupt and application threads to share cache lines.
On systems that support interrupt steering, modifying the smp_affinity
property of an interrupt request sets up the hardware so that the decision to service an interrupt with a particular processor is made at the hardware level with no intervention from the kernel.
17.4.1. Balancing interrupts manually
If your BIOS exports its NUMA topology, the irqbalance
service can automatically serve interrupt requests on the node that is local to the hardware requesting service.
Procedure
- Check which devices correspond to the interrupt requests that you want to configure.
Find the hardware specification for your platform. Check if the chipset on your system supports distributing interrupts.
- If it does, you can configure interrupt delivery as described in the following steps. Additionally, check which algorithm your chipset uses to balance interrupts. Some BIOSes have options to configure interrupt delivery.
- If it does not, your chipset always routes all interrupts to a single, static CPU. You cannot configure which CPU is used.
Check which Advanced Programmable Interrupt Controller (APIC) mode is in use on your system:
$ journalctl --dmesg | grep APIC
Here,
-
If your system uses a mode other than
flat
, you can see a line similar toSetting APIC routing to physical flat
. If you can see no such message, your system uses
flat
mode.If your system uses
x2apic
mode, you can disable it by adding thenox2apic
option to the kernel command line in thebootloader
configuration.Only non-physical flat mode (
flat
) supports distributing interrupts to multiple CPUs. This mode is available only for systems that have up to8
CPUs.
-
If your system uses a mode other than
-
Calculate the
smp_affinity mask
. For more information on how to calculate thesmp_affinity mask
, see Setting the smp_affinity mask.
Additional resources
-
journalctl(1)
andtaskset(1)
man pages
17.4.2. Setting the smp_affinity mask
The smp_affinity
value is stored as a hexadecimal bit mask representing all processors in the system. Each bit configures a different CPU. The least significant bit is CPU 0.
The default value of the mask is f
, which means that an interrupt request can be handled on any processor in the system. Setting this value to 1 means that only processor 0 can handle the interrupt.
Procedure
In binary, use the value 1 for CPUs that handle the interrupts. For example, to set CPU 0 and CPU 7 to handle interrupts, use
0000000010000001
as the binary code:Table 17.1. Binary Bits for CPUs
CPU
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Binary
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Convert the binary code to hexadecimal:
For example, to convert the binary code using Python:
>>> hex(int('0000000010000001', 2)) '0x81'
On systems with more than 32 processors, you must delimit the
smp_affinity
values for discrete 32 bit groups. For example, if you want only the first 32 processors of a 64 processor system to service an interrupt request, use0xffffffff,00000000
.The interrupt affinity value for a particular interrupt request is stored in the associated
/proc/irq/irq_number/smp_affinity
file. Set thesmp_affinity
mask in this file:# echo mask > /proc/irq/irq_number/smp_affinity
Additional resources
-
journalctl(1)
,irqbalance(1)
, andtaskset(1)
man pages
Chapter 18. Tuning scheduling policy
In Red Hat Enterprise Linux, the smallest unit of process execution is called a thread. The system scheduler determines which processor runs a thread, and for how long the thread runs. However, because the scheduler’s primary concern is to keep the system busy, it may not schedule threads optimally for application performance.
For example, say an application on a NUMA system is running on Node A when a processor on Node B becomes available. To keep the processor on Node B busy, the scheduler moves one of the application’s threads to Node B. However, the application thread still requires access to memory on Node A. But, this memory will take longer to access because the thread is now running on Node B and Node A memory is no longer local to the thread. Thus, it may take longer for the thread to finish running on Node B than it would have taken to wait for a processor on Node A to become available, and then to execute the thread on the original node with local memory access.
18.1. Categories of scheduling policies
Performance sensitive applications often benefit from the designer or administrator determining where threads are run. The Linux scheduler implements a number of scheduling policies which determine where and for how long a thread runs.
The following are the two major categories of scheduling policies:
Normal policies
- Normal threads are used for tasks of normal priority.
Realtime policies
Realtime policies are used for time-sensitive tasks that must complete without interruptions. Realtime threads are not subject to time slicing. This means the thread runs until they block, exit, voluntarily yield, or are preempted by a higher priority thread.
The lowest priority realtime thread is scheduled before any thread with a normal policy. For more information, see Static priority scheduling with SCHED_FIFO and Round robin priority scheduling with SCHED_RR.
Additional resources
-
sched(7)
,sched_setaffinity(2)
,sched_getaffinity(2)
,sched_setscheduler(2)
, andsched_getscheduler(2)
man pages
18.2. Static priority scheduling with SCHED_FIFO
The SCHED_FIFO
, also called static priority scheduling, is a realtime policy that defines a fixed priority for each thread. This policy allows administrators to improve event response time and reduce latency. It is recommended to not execute this policy for an extended period of time for time sensitive tasks.
When SCHED_FIFO
is in use, the scheduler scans the list of all the SCHED_FIFO
threads in order of priority and schedules the highest priority thread that is ready to run. The priority level of a SCHED_FIFO
thread can be any integer from 1
to 99
, where 99
is treated as the highest priority. Red Hat recommends starting with a lower number and increasing priority only when you identify latency issues.
Because realtime threads are not subject to time slicing, Red Hat does not recommend setting a priority as 99. This keeps your process at the same priority level as migration and watchdog threads; if your thread goes into a computational loop and these threads are blocked, they will not be able to run. Systems with a single processor will eventually hang in this situation.
Administrators can limit SCHED_FIFO
bandwidth to prevent realtime application programmers from initiating realtime tasks that monopolize the processor.
The following are some of the parameters used in this policy:
/proc/sys/kernel/sched_rt_period_us
-
This parameter defines the time period, in microseconds, that is considered to be one hundred percent of the processor bandwidth. The default value is
1000000 μs
, or1 second
. /proc/sys/kernel/sched_rt_runtime_us
-
This parameter defines the time period, in microseconds, that is devoted to running real-time threads. The default value is
950000 μs
, or0.95 seconds
.
18.3. Round robin priority scheduling with SCHED_RR
The SCHED_RR
is a round-robin variant of the SCHED_FIFO
. This policy is useful when multiple threads need to run at the same priority level.
Like SCHED_FIFO
, SCHED_RR
is a realtime policy that defines a fixed priority for each thread. The scheduler scans the list of all SCHED_RR threads in order of priority and schedules the highest priority thread that is ready to run. However, unlike SCHED_FIFO
, threads that have the same priority are scheduled in a round-robin style within a certain time slice.
You can set the value of this time slice in milliseconds with the sched_rr_timeslice_ms
kernel parameter in the /proc/sys/kernel/sched_rr_timeslice_ms
file. The lowest value is 1 millisecond
.
18.4. Normal scheduling with SCHED_OTHER
The SCHED_OTHER
is the default scheduling policy in Red Hat Enterprise Linux 9. This policy uses the Completely Fair Scheduler (CFS) to allow fair processor access to all threads scheduled with this policy. This policy is most useful when there are a large number of threads or when data throughput is a priority, as it allows more efficient scheduling of threads over time.
When this policy is in use, the scheduler creates a dynamic priority list based partly on the niceness value of each process thread. Administrators can change the niceness value of a process, but cannot change the scheduler’s dynamic priority list directly.
18.5. Setting scheduler policies
Check and adjust scheduler policies and priorities by using the chrt
command line tool. It can start new processes with the desired properties, or change the properties of a running process. It can also be used for setting the policy at runtime.
Procedure
View the process ID (PID) of the active processes:
# ps
Use the
--pid
or-p
option with theps
command to view the details of the particular PID.Check the scheduling policy, PID, and priority of a particular process:
# chrt -p 468 pid 468's current scheduling policy: SCHED_FIFO pid 468's current scheduling priority: 85 # chrt -p 476 pid 476's current scheduling policy: SCHED_OTHER pid 476's current scheduling priority: 0
Here, 468 and 476 are PID of a process.
Set the scheduling policy of a process:
For example, to set the process with PID 1000 to SCHED_FIFO, with a priority of 50:
# chrt -f -p 50 1000
For example, to set the process with PID 1000 to SCHED_OTHER, with a priority of 0:
# chrt -o -p 0 1000
For example, to set the process with PID 1000 to SCHED_RR, with a priority of 10:
# chrt -r -p 10 1000
To start a new application with a particular policy and priority, specify the name of the application:
# chrt -f 36 /bin/my-app
Additional resources
18.6. Policy options for the chrt command
Using the chrt
command, you can view and set the scheduling policy of a process.
The following table describes the appropriate policy options, which can be used to set the scheduling policy of a process.
Table 18.1. Policy Options for the chrt Command
Short option | Long option | Description |
---|---|---|
|
|
Set schedule to |
|
|
Set schedule to |
|
|
Set schedule to |
18.7. Changing the priority of services during the boot process
Using the systemd
service, it is possible to set up real-time priorities for services launched during the boot process. The unit configuration directives are used to change the priority of a service during the boot process.
The boot process priority change is done by using the following directives in the service section:
CPUSchedulingPolicy=
-
Sets the CPU scheduling policy for executed processes. It is used to set
other
,fifo
, andrr
policies. CPUSchedulingPriority=
-
Sets the CPU scheduling priority for executed processes. The available priority range depends on the selected CPU scheduling policy. For real-time scheduling policies, an integer between
1
(lowest priority) and99
(highest priority) can be used.
The following procedure describes how to change the priority of a service, during the boot process, using the mcelog
service.
Prerequisites
Install the tuned package:
# dnf install tuned
Enable and start the tuned service:
# systemctl enable --now tuned
Procedure
View the scheduling priorities of running threads:
# tuna --show_threads thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 1 OTHER 0 0xff 3181 292 systemd 2 OTHER 0 0xff 254 0 kthreadd 3 OTHER 0 0xff 2 0 rcu_gp 4 OTHER 0 0xff 2 0 rcu_par_gp 6 OTHER 0 0 9 0 kworker/0:0H-kblockd 7 OTHER 0 0xff 1301 1 kworker/u16:0-events_unbound 8 OTHER 0 0xff 2 0 mm_percpu_wq 9 OTHER 0 0 266 0 ksoftirqd/0 [...]
Create a supplementary
mcelog
service configuration directory file and insert the policy name and priority in this file:# cat <<-EOF > /etc/systemd/system/mcelog.system.d/priority.conf > [SERVICE] CPUSchedulingPolicy=_fifo_ CPUSchedulingPriority=_20_ EOF
Reload the
systemd
scripts configuration:# systemctl daemon-reload
Restart the
mcelog
service:# systemctl restart mcelog
Verification steps
Display the
mcelog
priority set bysystemd
issue:# tuna -t mcelog -P thread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 826 FIFO 20 0,1,2,3 13 0 mcelog
Additional resources
-
systemd(1)
andtuna(8)
man pages - Description of the priority range
18.8. Priority map
Priorities are defined in groups, with some groups dedicated to certain kernel functions. For real-time scheduling policies, an integer between 1
(lowest priority) and 99
(highest priority) can be used.
The following table describes the priority range, which can be used while setting the scheduling policy of a process.
Table 18.2. Description of the priority range
Priority | Threads | Description |
---|---|---|
1 | Low priority kernel threads |
This priority is usually reserved for the tasks that need to be just above |
2 - 49 | Available for use | The range used for typical application priorities. |
50 | Default hard-IRQ value | |
51 - 98 | High priority threads | Use this range for threads that execute periodically and must have quick response times. Do not use this range for CPU-bound threads as you will starve interrupts. |
99 | Watchdogs and migration | System threads that must run at the highest priority. |
18.9. TuneD cpu-partitioning profile
For tuning Red Hat Enterprise Linux 9 for latency-sensitive workloads, Red Hat recommends to use the cpu-partitioning
TuneD profile.
Prior to Red Hat Enterprise Linux 9, the low-latency Red Hat documentation described the numerous low-level steps needed to achieve low-latency tuning. In Red Hat Enterprise Linux 9, you can perform low-latency tuning more efficiently by using the cpu-partitioning
TuneD profile. This profile is easily customizable according to the requirements for individual low-latency applications.
The following figure is an example to demonstrate how to use the cpu-partitioning
profile. This example uses the CPU and node layout.
Figure 18.1. Figure cpu-partitioning

You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf
file using the following configuration options:
- Isolated CPUs with load balancing
In the cpu-partitioning figure, the blocks numbered from 4 to 23, are the default isolated CPUs. The kernel scheduler’s process load balancing is enabled on these CPUs. It is designed for low-latency processes with multiple threads that need the kernel scheduler load balancing.
You can configure the cpu-partitioning profile in the
/etc/tuned/cpu-partitioning-variables.conf
file using theisolated_cores=cpu-list
option, which lists CPUs to isolate that will use the kernel scheduler load balancing.The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as
3-5
. This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping CPU.- Isolated CPUs without load balancing
In the cpu-partitioning figure, the blocks numbered 2 and 3, are the isolated CPUs that do not provide any additional kernel scheduler process load balancing.
You can configure the cpu-partitioning profile in the
/etc/tuned/cpu-partitioning-variables.conf
file using theno_balance_cores=cpu-list
option, which lists CPUs to isolate that will not use the kernel scheduler load balancing.Specifying the
no_balance_cores
option is optional, however any CPUs in this list must be a subset of the CPUs listed in theisolated_cores
list.Application threads using these CPUs need to be pinned individually to each CPU.
- Housekeeping CPUs
-
Any CPU not isolated in the
cpu-partitioning-variables.conf
file is automatically considered a housekeeping CPU. On the housekeeping CPUs, all services, daemons, user processes, movable kernel threads, interrupt handlers, and kernel timers are permitted to execute.
Additional resources
-
tuned-profiles-cpu-partitioning(7)
man page
18.10. Using the TuneD cpu-partitioning profile for low-latency tuning
This procedure describes how to tune a system for low-latency using the TuneD’s cpu-partitioning
profile. It uses the example of a low-latency application that can use cpu-partitioning
and the CPU layout as mentioned in the cpu-partitioning figure.
The application in this case uses:
- One dedicated reader thread that reads data from the network will be pinned to CPU 2.
- A large number of threads that process this network data will be pinned to CPUs 4-23.
- A dedicated writer thread that writes the processed data to the network will be pinned to CPU 3.
Prerequisites
-
You have installed the
cpu-partitioning
TuneD profile by using thednf install tuned-profiles-cpu-partitioning
command as root.
Procedure
Edit
/etc/tuned/cpu-partitioning-variables.conf
file and add the following information:# Isolated CPUs with the kernel’s scheduler load balancing: isolated_cores=2-23 # Isolated CPUs without the kernel’s scheduler load balancing: no_balance_cores=2,3
Set the
cpu-partitioning
TuneD profile:# tuned-adm profile cpu-partitioning
Reboot
After rebooting, the system is tuned for low-latency, according to the isolation in the cpu-partitioning figure. The application can use taskset to pin the reader and writer threads to CPUs 2 and 3, and the remaining application threads on CPUs 4-23.
Additional resources
-
tuned-profiles-cpu-partitioning(7)
man page
18.11. Customizing the cpu-partitioning TuneD profile
You can extend the TuneD profile to make additional tuning changes.
For example, the cpu-partitioning
profile sets the CPUs to use cstate=1
. In order to use the cpu-partitioning
profile but to additionally change the CPU cstate from cstate1 to cstate0, the following procedure describes a new TuneD profile named my_profile, which inherits the cpu-partitioning
profile and then sets C state-0.
Procedure
Create the
/etc/tuned/my_profile
directory:# mkdir /etc/tuned/my_profile
Create a
tuned.conf
file in this directory, and add the following content:# vi /etc/tuned/my_profile/tuned.conf [main] summary=Customized tuning on top of cpu-partitioning include=cpu-partitioning [cpu] force_latency=cstate.id:0|1
Use the new profile:
# tuned-adm profile my_profile
In the shared example, a reboot is not required. However, if the changes in the my_profile profile require a reboot to take effect, then reboot your machine.
Additional resources
-
tuned-profiles-cpu-partitioning(7)
man page
Chapter 19. Using systemd to manage resources used by applications
RHEL 9 moves the resource management settings from the process level to the application level by binding the system of cgroup
hierarchies with the systemd
unit tree. Therefore, you can manage the system resources with the systemctl
command, or by modifying the systemd
unit files.
To achieve this, systemd
takes various configuration options from the unit files or directly via the systemctl
command. Then systemd
applies those options to specific process groups by utilizing the Linux kernel system calls and features like cgroups
and namespaces
.
You can review the full set of configuration options for systemd
in the following manual pages:
-
systemd.resource-control(5)
-
systemd.exec(5)
19.1. Allocating system resources using systemd
To modify the distribution of system resources, you can apply one or more of the following distribution models:
- Weights
You can distribute the resource by adding up the weights of all sub-groups and giving each sub-group the fraction matching its ratio against the sum.
For example, if you have 10 cgroups, each with weight of value 100, the sum is 1000. Each cgroup receives one tenth of the resource.
Weight is usually used to distribute stateless resources. For example the CPUWeight= option is an implementation of this resource distribution model.
- Limits
A cgroup can consume up to the configured amount of the resource. The sum of sub-group limits can exceed the limit of the parent cgroup. Therefore it is possible to overcommit resources in this model.
For example the MemoryMax= option is an implementation of this resource distribution model.
- Protections
You can set up a protected amount of a resource for a cgroup. If the resource usage is below the protection boundary, the kernel will try not to penalize this cgroup in favor of other cgroups that compete for the same resource. An overcommit is also possible.
For example the MemoryLow= option is an implementation of this resource distribution model.
- Allocations
- Exclusive allocations of an absolute amount of a finite resource. An overcommit is not possible. An example of this resource type in Linux is the real-time budget.
- unit file option
A setting for resource control configuration.
For example, you can configure CPU resource with options like CPUAccounting=, or CPUQuota=. Similarly, you can configure memory or I/O resources with options like AllowedMemoryNodes= and IOAccounting=.
Procedure
To change the required value of the unit file option of your service, you can adjust the value in the unit file, or use systemctl
command:
Check the assigned values for the service of your choice.
# systemctl show --property <unit file option> <service name>
Set the required value of the CPU time allocation policy option:
# systemctl set-property <service name> <unit file option>=<value>
Verification steps
Check the newly assigned values for the service of your choice.
# systemctl show --property <unit file option> <service name>
Additional resources
-
systemd.resource-control(5)
,systemd.exec(5)
manual pages
19.2. Role of systemd in resource management
The core function of systemd
is service management and supervision. The systemd
system and service manager ensures that managed services start at the right time and in the correct order during the boot process. The services have to run smoothly to use the underlying hardware platform optimally. Therefore, systemd
also provides capabilities to define resource management policies, and to tune various options, which can improve the performance of the service.
In general, Red Hat recommends you use systemd
for controlling the usage of system resources. You should manually configure the cgroups
virtual file system only in special cases. For example, when you need to use cgroup-v1
controllers that have no equivalents in cgroup-v2
hierarchy.
19.3. Overview of systemd hierarchy for cgroups
On the backend, the systemd
system and service manager makes use of the slice
, the scope
and the service
units to organize and structure processes in the control groups. You can further modify this hierarchy by creating custom unit files or using the systemctl
command. Also, systemd
automatically mounts hierarchies for important kernel resource controllers at the /sys/fs/cgroup/
directory.
Three systemd
unit types are used for resource control:
Service - A process or a group of processes, which
systemd
started according to a unit configuration file. Services encapsulate the specified processes so that they can be started and stopped as one set. Services are named in the following way:<name>.service
Scope - A group of externally created processes. Scopes encapsulate processes that are started and stopped by the arbitrary processes through the
fork()
function and then registered bysystemd
at runtime. For example, user sessions, containers, and virtual machines are treated as scopes. Scopes are named as follows:<name>.scope
Slice - A group of hierarchically organized units. Slices organize a hierarchy in which scopes and services are placed. The actual processes are contained in scopes or in services. Every name of a slice unit corresponds to the path to a location in the hierarchy. The dash ("-") character acts as a separator of the path components to a slice from the
-.slice
root slice. In the following example:<parent-name>.slice
parent-name.slice
is a sub-slice ofparent.slice
, which is a sub-slice of the-.slice
root slice.parent-name.slice
can have its own sub-slice namedparent-name-name2.slice
, and so on.
The service
, the scope
, and the slice
units directly map to objects in the control group hierarchy. When these units are activated, they map directly to control group paths built from the unit names.
The following is an abbreviated example of a control group hierarchy:
Control group /: -.slice ├─user.slice │ ├─user-42.slice │ │ ├─session-c1.scope │ │ │ ├─ 967 gdm-session-worker [pam/gdm-launch-environment] │ │ │ ├─1035 /usr/libexec/gdm-x-session gnome-session --autostart /usr/share/gdm/greeter/autostart │ │ │ ├─1054 /usr/libexec/Xorg vt1 -displayfd 3 -auth /run/user/42/gdm/Xauthority -background none -noreset -keeptty -verbose 3 │ │ │ ├─1212 /usr/libexec/gnome-session-binary --autostart /usr/share/gdm/greeter/autostart │ │ │ ├─1369 /usr/bin/gnome-shell │ │ │ ├─1732 ibus-daemon --xim --panel disable │ │ │ ├─1752 /usr/libexec/ibus-dconf │ │ │ ├─1762 /usr/libexec/ibus-x11 --kill-daemon │ │ │ ├─1912 /usr/libexec/gsd-xsettings │ │ │ ├─1917 /usr/libexec/gsd-a11y-settings │ │ │ ├─1920 /usr/libexec/gsd-clipboard … ├─init.scope │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18 └─system.slice ├─rngd.service │ └─800 /sbin/rngd -f ├─systemd-udevd.service │ └─659 /usr/lib/systemd/systemd-udevd ├─chronyd.service │ └─823 /usr/sbin/chronyd ├─auditd.service │ ├─761 /sbin/auditd │ └─763 /usr/sbin/sedispatch ├─accounts-daemon.service │ └─876 /usr/libexec/accounts-daemon ├─example.service │ ├─ 929 /bin/bash /home/jdoe/example.sh │ └─4902 sleep 1 …
The example above shows that services and scopes contain processes and are placed in slices that do not contain processes of their own.
Additional resources
- Configuring basic system settings in Red Hat Enterprise Linux
- What are kernel resource controllers
-
systemd.resource-control(5)
,systemd.exec(5)
,cgroups(7)
,fork()
,fork(2)
manual pages - Understanding cgroups
19.4. Listing systemd units
The following procedure describes how to use the systemd
system and service manager to list its units.
Procedure
To list all active units on the system, execute the
# systemctl
command and the terminal will return an output similar to the following example:# systemctl UNIT LOAD ACTIVE SUB DESCRIPTION … init.scope loaded active running System and Service Manager session-2.scope loaded active running Session 2 of user jdoe abrt-ccpp.service loaded active exited Install ABRT coredump hook abrt-oops.service loaded active running ABRT kernel log watcher abrt-vmcore.service loaded active exited Harvest vmcores for ABRT abrt-xorg.service loaded active running ABRT Xorg log watcher … -.slice loaded active active Root Slice machine.slice loaded active active Virtual Machine and Container Slice system-getty.slice loaded active active system-getty.slice system-lvm2\x2dpvscan.slice loaded active active system-lvm2\x2dpvscan.slice system-sshd\x2dkeygen.slice loaded active active system-sshd\x2dkeygen.slice system-systemd\x2dhibernate\x2dresume.slice loaded active active system-systemd\x2dhibernate\x2dresume> system-user\x2druntime\x2ddir.slice loaded active active system-user\x2druntime\x2ddir.slice system.slice loaded active active System Slice user-1000.slice loaded active active User Slice of UID 1000 user-42.slice loaded active active User Slice of UID 42 user.slice loaded active active User and Session Slice …
-
UNIT
- a name of a unit that also reflects the unit position in a control group hierarchy. The units relevant for resource control are a slice, a scope, and a service. -
LOAD
- indicates whether the unit configuration file was properly loaded. If the unit file failed to load, the field contains the state error instead of loaded. Other unit load states are: stub, merged, and masked. -
ACTIVE
- the high-level unit activation state, which is a generalization ofSUB
. -
SUB
- the low-level unit activation state. The range of possible values depends on the unit type. -
DESCRIPTION
- the description of the unit content and functionality.
-
To list inactive units, execute:
# systemctl --all
To limit the amount of information in the output, execute:
# systemctl --type service,masked
The
--type
option requires a comma-separated list of unit types such as a service and a slice, or unit load states such as loaded and masked.
Additional resources
- Configuring basic system settings in RHEL
-
systemd.resource-control(5)
,systemd.exec(5)
manual pages
19.5. Viewing systemd control group hierarchy
The following procedure describes how to display control groups (cgroups
) hierarchy and processes running in specific cgroups
.
Procedure
To display the whole
cgroups
hierarchy on your system, execute# systemd-cgls
:# systemd-cgls Control group /: -.slice ├─user.slice │ ├─user-42.slice │ │ ├─session-c1.scope │ │ │ ├─ 965 gdm-session-worker [pam/gdm-launch-environment] │ │ │ ├─1040 /usr/libexec/gdm-x-session gnome-session --autostart /usr/share/gdm/greeter/autostart … ├─init.scope │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18 └─system.slice … ├─example.service │ ├─6882 /bin/bash /home/jdoe/example.sh │ └─6902 sleep 1 ├─systemd-journald.service └─629 /usr/lib/systemd/systemd-journald …
The example output returns the entire
cgroups
hierarchy, where the highest level is formed by slices.To display the
cgroups
hierarchy filtered by a resource controller, execute# systemd-cgls <resource_controller>
:# systemd-cgls memory Controller memory; Control group /: ├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18 ├─user.slice │ ├─user-42.slice │ │ ├─session-c1.scope │ │ │ ├─ 965 gdm-session-worker [pam/gdm-launch-environment] … └─system.slice | … ├─chronyd.service │ └─844 /usr/sbin/chronyd ├─example.service │ ├─8914 /bin/bash /home/jdoe/example.sh │ └─8916 sleep 1 …
The example output of the above command lists the services that interact with the selected controller.
To display detailed information about a certain unit and its part of the
cgroups
hierarchy, execute# systemctl status <system_unit>
:# systemctl status example.service ● example.service - My example service Loaded: loaded (/usr/lib/systemd/system/example.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2019-04-16 12:12:39 CEST; 3s ago Main PID: 17737 (bash) Tasks: 2 (limit: 11522) Memory: 496.0K (limit: 1.5M) CGroup: /system.slice/example.service ├─17737 /bin/bash /home/jdoe/example.sh └─17743 sleep 1 Apr 16 12:12:39 redhat systemd[1]: Started My example service. Apr 16 12:12:39 redhat bash[17737]: The current time is Tue Apr 16 12:12:39 CEST 2019 Apr 16 12:12:40 redhat bash[17737]: The current time is Tue Apr 16 12:12:40 CEST 2019
Additional resources
- What are kernel resource controllers
-
systemd.resource-control(5)
,cgroups(7)
manual pages
19.6. Viewing cgroups of processes
The following procedure describes how to learn which control group (cgroup
) a process belongs to. Then you can check the cgroup
to learn which controllers and controller-specific configurations it uses.
Procedure
To view which
cgroup
a process belongs to, run the# cat proc/<PID>/cgroup
command:# cat /proc/2467/cgroup 0::/system.slice/example.service
The example output relates to a process of interest. In this case, it is a process identified by
PID 2467
, which belongs to theexample.service
unit. You can determine whether the process was placed in a correct control group as defined by thesystemd
unit file specifications.To display what controllers the
cgroup
utilizes and the respective configuration files, check thecgroup
directory:# cat /sys/fs/cgroup/system.slice/example.service/cgroup.controllers memory pids # ls /sys/fs/cgroup/system.slice/example.service/ cgroup.controllers cgroup.events … cpu.pressure cpu.stat io.pressure memory.current memory.events … pids.current pids.events pids.max
The version 1 hierarchy of cgroups
uses a per-controller model. Therefore the output from the /proc/PID/cgroup
file shows, which cgroups
under each controller the PID belongs to. You can find the respective cgroups
under the controller directories at /sys/fs/cgroup/<controller_name>/
.
Additional resources
-
cgroups(7)
manual page - What are kernel resource controllers
-
Documentation in the
/usr/share/doc/kernel-doc-<kernel_version>/Documentation/admin-guide/cgroup-v2.rst
file (after installing thekernel-doc
package)
19.7. Monitoring resource consumption
The following procedure describes how to view a list of currently running control groups (cgroups
) and their resource consumption in real-time.
Procedure
To see a dynamic account of currently running
cgroups
, execute the# systemd-cgtop
command:# systemd-cgtop Control Group Tasks %CPU Memory Input/s Output/s / 607 29.8 1.5G - - /system.slice 125 - 428.7M - - /system.slice/ModemManager.service 3 - 8.6M - - /system.slice/NetworkManager.service 3 - 12.8M - - /system.slice/accounts-daemon.service 3 - 1.8M - - /system.slice/boot.mount - - 48.0K - - /system.slice/chronyd.service 1 - 2.0M - - /system.slice/cockpit.socket - - 1.3M - - /system.slice/colord.service 3 - 3.5M - - /system.slice/crond.service 1 - 1.8M - - /system.slice/cups.service 1 - 3.1M - - /system.slice/dev-hugepages.mount - - 244.0K - - /system.slice/dev-mapper-rhel\x2dswap.swap - - 912.0K - - /system.slice/dev-mqueue.mount - - 48.0K - - /system.slice/example.service 2 - 2.0M - - /system.slice/firewalld.service 2 - 28.8M - - ...
The example output displays currently running
cgroups
ordered by their resource usage (CPU, memory, disk I/O load). The list refreshes every 1 second by default. Therefore, it offers a dynamic insight into the actual resource usage of each control group.
Additional resources
-
systemd-cgtop(1)
manual page
19.8. Using systemd unit files to set limits for applications
Each existing or running unit is supervised by the systemd
, which also creates control groups for them. The units have configuration files in the /usr/lib/systemd/system/
directory. You can manually modify the unit files to set limits, prioritize, or control access to hardware resources for groups of processes.
Prerequisites
-
You have the
root
privileges.
Procedure
Modify the
/usr/lib/systemd/system/example.service
file to limit the memory usage of a service:… [Service] MemoryMax=1500K …
The configuration above places a maximum memory limit, which the processes in a control group cannot exceed. The
example.service
service is part of such a control group which has imposed limitations. You can use suffixes K, M, G, or T to identify Kilobyte, Megabyte, Gigabyte, or Terabyte as a unit of measurement.Reload all unit configuration files:
# systemctl daemon-reload
Restart the service:
# systemctl restart example.service
You can review the full set of configuration options for systemd
in the following manual pages:
-
systemd.resource-control(5)
-
systemd.exec(5)
Verification
Check that the changes took effect:
# cat /sys/fs/cgroup/system.slice/example.service/memory.max 1536000
The example output shows that the memory consumption was limited at around 1,500 KB.
Additional resources
- Understanding cgroups
- Configuring basic system settings in Red Hat Enterprise Linux
-
systemd.resource-control(5)
,systemd.exec(5)
,cgroups(7)
manual pages
19.9. Using systemctl command to set limits to applications
CPU affinity settings help you restrict the access of a particular process to some CPUs. Effectively, the CPU scheduler never schedules the process to run on the CPU that is not in the affinity mask of the process.
The default CPU affinity mask applies to all services managed by systemd
.
To configure CPU affinity mask for a particular systemd
service, systemd
provides CPUAffinity=
both as a unit file option and a manager configuration option in the /etc/systemd/system.conf
file.
The CPUAffinity=
unit file option sets a list of CPUs or CPU ranges that are merged and used as the affinity mask.
After configuring CPU affinity mask for a particular systemd
service, you must restart the service to apply the changes.
Procedure
To set CPU affinity mask for a particular systemd
service using the CPUAffinity
unit file option:
Check the values of the
CPUAffinity
unit file option in the service of your choice:$ systemctl show --property <CPU affinity configuration option> <service name>
As a root, set the required value of the
CPUAffinity
unit file option for the CPU ranges used as the affinity mask:# systemctl set-property <service name> CPUAffinity=<value>
Restart the service to apply the changes.
# systemctl restart <service name>
You can review the full set of configuration options for systemd
in the following manual pages:
-
systemd.resource-control(5)
-
systemd.exec(5)
19.10. Setting global default CPU affinity through manager configuration
The CPUAffinity
option in the /etc/systemd/system.conf
file defines an affinity mask for the process identification number (PID) 1 and all processes forked off of PID1. You can then override the CPUAffinity
on a per-service basis.
To set default CPU affinity mask for all systemd services using the manager configuration option:
-
Set the CPU numbers for the
CPUAffinity=
option in the/etc/systemd/system.conf
file. Save the edited file and reload the
systemd
service:# systemctl daemon-reload
- Reboot the server to apply the changes.
You can review the full set of configuration options for systemd
in the following manual pages:
-
systemd.resource-control(5)
-
systemd.exec(5)
19.11. Configuring NUMA policies using systemd
Non-uniform memory access (NUMA) is a computer memory subsystem design, in which the memory access time depends on the physical memory location relative to the processor.
Memory close to the CPU has lower latency (local memory) than memory that is local for a different CPU (foreign memory) or is shared between a set of CPUs.
In terms of the Linux kernel, NUMA policy governs where (for example, on which NUMA nodes) the kernel allocates physical memory pages for the process.
systemd
provides unit file options NUMAPolicy
and NUMAMask
to control memory allocation policies for services.
Procedure
To set the NUMA memory policy through the NUMAPolicy
unit file option:
Check the values of the
NUMAPolicy
unit file option in the service of your choice:$ systemctl show --property <NUMA policy configuration option> <service name>
As a root, set the required policy type of the
NUMAPolicy
unit file option:# systemctl set-property <service name> NUMAPolicy=<value>
Restart the service to apply the changes.
# systemctl restart <service name>
To set a global NUMAPolicy
setting through the manager configuration option:
-
Search in the
/etc/systemd/system.conf
file for theNUMAPolicy
option. - Edit the policy type and save the file.
Reload the
systemd
configuration:# systemd daemon-reload
- Reboot the server.
When you configure a strict NUMA policy, for example bind
, make sure that you also appropriately set the CPUAffinity=
unit file option.
Additional resources
- Using systemctl command to set limits to applications
-
systemd.resource-control(5)
,systemd.exec(5)
,set_mempolicy(2)
manual pages.
19.12. NUMA policy configuration options for systemd
Systemd provides the following options to configure the NUMA policy:
NUMAPolicy
Controls the NUMA memory policy of the executed processes. The following policy types are possible:
- default
- preferred
- bind
- interleave
- local
NUMAMask
Controls the NUMA node list which is associated with the selected NUMA policy.
Note that the
NUMAMask
option is not required to be specified for the following policies:- default
- local
For the preferred policy, the list specifies only a single NUMA node.
Additional resources
-
systemd.resource-control(5)
,systemd.exec(5)
, andset_mempolicy(2)
manual pages
19.13. Creating transient cgroups using systemd-run command
The transient cgroups
set limits on resources consumed by a unit (service or scope) during its runtime.
Procedure
To create a transient control group, use the
systemd-run
command in the following format:# systemd-run --unit=<name> --slice=<name>.slice <command>
This command creates and starts a transient service or a scope unit and runs a custom command in such a unit.
-
The
--unit=<name>
option gives a name to the unit. If--unit
is not specified, the name is generated automatically. -
The
--slice=<name>.slice
option makes your service or scope unit a member of a specified slice. Replace<name>.slice
with the name of an existing slice (as shown in the output ofsystemctl -t slice
), or create a new slice by passing a unique name. By default, services and scopes are created as members of thesystem.slice
. Replace
<command>
with the command you wish to execute in the service or the scope unit.The following message is displayed to confirm that you created and started the service or the scope successfully:
# Running as unit <name>.service
-
The
Optionally, keep the unit running after its processes finished to collect run-time information:
# systemd-run --unit=<name> --slice=<name>.slice --remain-after-exit <command>
The command creates and starts a transient service unit and runs a custom command in such a unit. The
--remain-after-exit
option ensures that the service keeps running after its processes have finished.
Additional resources
- Understanding control groups
- Configuring basic system settings in RHEL
-
systemd-run(1)
manual page
19.14. Removing transient control groups
You can use the systemd
system and service manager to remove transient control groups (cgroups
) if you no longer need to limit, prioritize, or control access to hardware resources for groups of processes.
Transient cgroups
are automatically released once all the processes that a service or a scope unit contains, finish.
Procedure
To stop the service unit with all its processes, execute:
# systemctl stop name.service
To terminate one or more of the unit processes, execute:
# systemctl kill name.service --kill-who=PID,... --signal=<signal>
The command above uses the
--kill-who
option to select process(es) from the control group you wish to terminate. To kill multiple processes at the same time, pass a comma-separated list of PIDs. The--signal
option determines the type of POSIX signal to be sent to the specified processes. The default signal is SIGTERM.
Additional resources
- Understanding control groups
- What are kernel resource controllers
-
systemd.resource-control(5)
,cgroups(7)
manual pages - Role of systemd in control groups
- Configuring basic system settings in RHEL
Chapter 20. Understanding cgroups
You can use the control groups (cgroups
) kernel functionality to set limits, prioritize or isolate the hardware resources of processes. This allows you to granularly control resource usage of applications to utilize them more efficiently.
20.1. Understanding control groups
Control groups is a Linux kernel feature that enables you to organize processes into hierarchically ordered groups - cgroups
. The hierarchy (control groups tree) is defined by providing structure to cgroups
virtual file system, mounted by default on the /sys/fs/cgroup/
directory. The systemd
system and service manager utilizes cgroups
to organize all units and services that it governs. Alternatively, you can manage cgroups
hierarchies manually by creating and removing sub-directories in the /sys/fs/cgroup/
directory.
The resource controllers (a kernel component) then modify the behavior of processes in cgroups
by limiting, prioritizing or allocating system resources, (such as CPU time, memory, network bandwidth, or various combinations) of those processes.
The added value of cgroups
is process aggregation which enables division of hardware resources among applications and users. Thereby an increase in overall efficiency, stability and security of users' environment can be achieved.
- Control groups version 1
Control groups version 1 (
cgroups-v1
) provide a per-resource controller hierarchy. It means that each resource, such as CPU, memory, I/O, and so on, has its own control group hierarchy. It is possible to combine different control group hierarchies in a way that one controller can coordinate with another one in managing their respective resources. However, the two controllers may belong to different process hierarchies, which does not permit their proper coordination.The
cgroups-v1
controllers were developed across a large time span and as a result, the behavior and naming of their control files is not uniform.- Control groups version 2
The problems with controller coordination, which stemmed from hierarchy flexibility, led to the development of control groups version 2.
Control groups version 2 (
cgroups-v2
) provides a single control group hierarchy against which all resource controllers are mounted.The control file behavior and naming is consistent among different controllers.
RHEL 9, by default, mounts and utilizes cgroups-v2
.
This sub-section was based on a Devconf.cz 2019 presentation.[2]
Additional resources
- What are kernel resource controllers
-
cgroups(7)
manual page - cgroups-v1
- cgroups-v2
20.2. What are kernel resource controllers
The functionality of control groups is enabled by kernel resource controllers. RHEL 9 supports various controllers for control groups version 1 (cgroups-v1
) and control groups version 2 (cgroups-v2
).
A resource controller, also called a control group subsystem, is a kernel subsystem that represents a single resource, such as CPU time, memory, network bandwidth or disk I/O. The Linux kernel provides a range of resource controllers that are mounted automatically by the systemd
system and service manager. Find a list of currently mounted resource controllers in the /proc/cgroups
file.
The following controllers are available for cgroups-v1
:
-
blkio
- can set limits on input/output access to and from block devices. -
cpu
- can adjust the parameters of the Completely Fair Scheduler (CFS) scheduler for control group’s tasks. It is mounted together with thecpuacct
controller on the same mount. -
cpuacct
- creates automatic reports on CPU resources used by tasks in a control group. It is mounted together with thecpu
controller on the same mount. -
cpuset
- can be used to restrict control group tasks to run only on a specified subset of CPUs and to direct the tasks to use memory only on specified memory nodes. -
devices
- can control access to devices for tasks in a control group. -
freezer
- can be used to suspend or resume tasks in a control group. -
memory
- can be used to set limits on memory use by tasks in a control group and generates automatic reports on memory resources used by those tasks. -
net_cls
- tags network packets with a class identifier (classid
) that enables the Linux traffic controller (thetc
command) to identify packets that originate from a particular control group task. A subsystem ofnet_cls
, thenet_filter
(iptables), can also use this tag to perform actions on such packets. Thenet_filter
tags network sockets with a firewall identifier (fwid
) that allows the Linux firewall (throughiptables
command) to identify packets originating from a particular control group task. -
net_prio
- sets the priority of network traffic. -
pids
- can set limits for a number of processes and their children in a control group. -
perf_event
- can group tasks for monitoring by theperf
performance monitoring and reporting utility. -
rdma
- can set limits on Remote Direct Memory Access/InfiniBand specific resources in a control group. -
hugetlb
- can be used to limit the usage of large size virtual memory pages by tasks in a control group.
The following controllers are available for cgroups-v2
:
-
io
- A follow-up toblkio
ofcgroups-v1
. -
memory
- A follow-up tomemory
ofcgroups-v1
. -
pids
- Same aspids
incgroups-v1
. -
rdma
- Same asrdma
incgroups-v1
. -
cpu
- A follow-up tocpu
andcpuacct
ofcgroups-v1
. -
cpuset
- Supports only the core functionality (cpus{,.effective}
,mems{,.effective}
) with a new partition feature. -
perf_event
- Support is inherent, no explicit control file. You can specify av2 cgroup
as a parameter to theperf
command that will profile all the tasks within thatcgroup
.
A resource controller can be used either in a cgroups-v1
hierarchy or a cgroups-v2
hierarchy, not simultaneously in both.
Additional resources
-
cgroups(7)
manual page -
Documentation in
/usr/share/doc/kernel-doc-<kernel_version>/Documentation/cgroups-v1/
directory (after installing thekernel-doc
package).
20.3. What are namespaces
Namespaces are one of the most important methods for organizing and identifying software objects.
A namespace wraps a global system resource (for example a mount point, a network device, or a hostname) in an abstraction that makes it appear to processes within the namespace that they have their own isolated instance of the global resource. One of the most common technologies that utilize namespaces are containers.
Changes to a particular global resource are visible only to processes in that namespace and do not affect the rest of the system or other namespaces.
To inspect which namespaces a process is a member of, you can check the symbolic links in the /proc/<PID>/ns/
directory.
The following table shows supported namespaces and resources which they isolate:
Namespace | Isolates |
---|---|
Mount | Mount points |
UTS | Hostname and NIS domain name |
IPC | System V IPC, POSIX message queues |
PID | Process IDs |
Network | Network devices, stacks, ports, etc |
User | User and group IDs |
Control groups | Control group root directory |
Additional resources
-
namespaces(7)
andcgroup_namespaces(7)
manual pages - Understanding control groups
Chapter 21. Improving system performance with zswap
You can improve system performance by enabling the zswap
kernel feature.
21.1. What is zswap
This section explains what zswap
is and how it can lead to system performance improvement.
zswap
is a kernel feature that provides a compressed RAM cache for swap pages. The mechanism works as follows: zswap
takes pages that are in the process of being swapped out and attempts to compress them into a dynamically allocated RAM-based memory pool. When the pool becomes full or the RAM becomes exhausted, zswap
evicts pages from compressed cache on an LRU basis (least recently used) to the backing swap device. After the page has been decompressed into the swap cache, zswap
frees the compressed version in the pool.
- The benefits of
zswap
- significant I/O reduction
- significant improvement of workload performance
In Red Hat Enterprise Linux 9, zswap
is enabled by default.
Additional resources
21.2. Enabling zswap at runtime
You can enable the zswap
feature at system runtime using the sysfs
interface.
Prerequisites
- You have root permissions.
Procedure
Enable
zswap
:# echo 1 > /sys/module/zswap/parameters/enabled
Verification step
Verify that
zswap
is enabled:# grep -r . /sys/kernel/debug/zswap duplicate_entry:0 pool_limit_hit:13422200 pool_total_size:6184960 (pool size in total in pages) reject_alloc_fail:5 reject_compress_poor:0 reject_kmemcache_fail:0 reject_reclaim_fail:13422200 stored_pages:4251 (pool size after compression) written_back_pages:0
Additional resources
21.3. Enabling zswap permanently
You can enable the zswap
feature permanently by providing the zswap.enabled=1
kernel command-line parameter.
Prerequisites
- You have root permissions.
-
The
grubby
orzipl
utility is installed on your system.
Procedure
Enable
zswap
permanently:# grubby --update-kernel=/boot/vmlinuz-$(uname -r) --args="zswap.enabled=1"
- Reboot the system for the changes to take effect.
Verification steps
Verify that
zswap
is enabled:# cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-70.5.1.el9_0.x86_64 root=/dev/mapper/rhel-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet zswap.enabled=1
Additional resources
Chapter 22. Using cgroupfs to manually manage cgroups
You can manage cgroup
hierarchies on your system by creating directories on the cgroupfs
virtual file system. The file system is mounted by default on the /sys/fs/cgroup/
directory and you can specify desired configurations in dedicated control files.
In general, Red Hat recommends you use systemd
for controlling the usage of system resources. You should manually configure the cgroups
virtual file system only in special cases. For example, when you need to use cgroup-v1
controllers that have no equivalents in cgroup-v2
hierarchy.
22.1. Creating cgroups and enabling controllers in cgroups-v2 file system
You can manage the control groups (cgroups
) by creating or removing directories and by writing to files in the cgroups
virtual file system. The file system is by default mounted on the /sys/fs/cgroup/
directory. To use settings from the cgroups
controllers, you also need to enable the desired controllers for child cgroups
. The root cgroup
has, by default, enabled the memory
and pids
controllers for its child cgroups
. Therefore, Red Hat recommends to create at least two levels of child cgroups
inside the /sys/fs/cgroup/
root cgroup
. This way you optionally remove the memory
and pids
controllers from the child cgroups
and maintain better organizational clarity of cgroup
files.
Prerequisites
- You have root permissions.
Procedure
Create the
/sys/fs/cgroup/Example/
directory:# mkdir /sys/fs/cgroup/Example/
The
/sys/fs/cgroup/Example/
directory defines a child group. When you create the/sys/fs/cgroup/Example/
directory, somecgroups-v2
interface files are automatically created in the directory. The/sys/fs/cgroup/Example/
directory contains also controller-specific files for thememory
andpids
controllers.Optionally, inspect the newly created child control group:
# ll /sys/fs/cgroup/Example/ -r—r—r--. 1 root root 0 Jun 1 10:33 cgroup.controllers -r—r—r--. 1 root root 0 Jun 1 10:33 cgroup.events -rw-r—r--. 1 root root 0 Jun 1 10:33 cgroup.freeze -rw-r—r--. 1 root root 0 Jun 1 10:33 cgroup.procs … -rw-r—r--. 1 root root 0 Jun 1 10:33 cgroup.subtree_control -r—r—r--. 1 root root 0 Jun 1 10:33 memory.events.local -rw-r—r--. 1 root root 0 Jun 1 10:33 memory.high -rw-r—r--. 1 root root 0 Jun 1 10:33 memory.low … -r—r—r--. 1 root root 0 Jun 1 10:33 pids.current -r—r—r--. 1 root root 0 Jun 1 10:33 pids.events -rw-r—r--. 1 root root 0 Jun 1 10:33 pids.max
The example output shows general
cgroup
control interface files such ascgroup.procs
orcgroup.controllers
. These files are common to all control groups, regardless of enabled controllers.The files such as
memory.high
andpids.max
relate to thememory
andpids
controllers, which are in the root control group (/sys/fs/cgroup/
), and are enabled by default bysystemd
.By default, the newly created child group inherits all settings from the parent
cgroup
. In this case, there are no limits from the rootcgroup
.Verify that the desired controllers are available in the
/sys/fs/cgroup/cgroup.controllers
file:# cat /sys/fs/cgroup/cgroup.controllers cpuset cpu io memory hugetlb pids rdma
Enable the desired controllers. In this example it is
cpu
andcpuset
controllers:# echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control # echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
These commands enable the
cpu
andcpuset
controllers for the immediate child groups of the/sys/fs/cgroup/
root control group. Including the newly createdExample
control group. A child group is where you can specify processes and apply control checks to each of the processes based on your criteria.Users can read the contents of the
cgroup.subtree_control
file at any level to get an idea of what controllers are going to be available for enablement in the immediate child group.NoteBy default, the
/sys/fs/cgroup/cgroup.subtree_control
file in the root control group containsmemory
andpids
controllers.Enable the desired controllers for child
cgroups
of theExample
control group:# echo "+cpu +cpuset" >> /sys/fs/cgroup/Example/cgroup.subtree_control
This command ensures that the immediate child control group will only have controllers relevant to regulate the CPU time distribution - not to
memory
orpids
controllers.Create the
/sys/fs/cgroup/Example/tasks/
directory:# mkdir /sys/fs/cgroup/Example/tasks/
The
/sys/fs/cgroup/Example/tasks/
directory defines a child group with files that relate purely tocpu
andcpuset
controllers. You can now assign processes to this control group and utilizecpu
andcpuset
controller options for your processes.Optionally, inspect the child control group:
# ll /sys/fs/cgroup/Example/tasks -r—r—r--. 1 root root 0 Jun 1 11:45 cgroup.controllers -r—r—r--. 1 root root 0 Jun 1 11:45 cgroup.events -rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.freeze -rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.max.depth -rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.max.descendants -rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.procs -r—r—r--. 1 root root 0 Jun 1 11:45 cgroup.stat -rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.subtree_control -rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.threads -rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.type -rw-r—r--. 1 root root 0 Jun 1 11:45 cpu.max -rw-r—r--. 1 root root 0 Jun 1 11:45 cpu.pressure -rw-r—r--. 1 root root 0 Jun 1 11:45 cpuset.cpus -r—r—r--. 1 root root 0 Jun 1 11:45 cpuset.cpus.effective -rw-r—r--. 1 root root 0 Jun 1 11:45 cpuset.cpus.partition -rw-r—r--. 1 root root 0 Jun 1 11:45 cpuset.mems -r—r—r--. 1 root root 0 Jun 1 11:45 cpuset.mems.effective -r—r—r--. 1 root root 0 Jun 1 11:45 cpu.stat -rw-r—r--. 1 root root 0 Jun 1 11:45 cpu.weight -rw-r—r--. 1 root root 0 Jun 1 11:45 cpu.weight.nice -rw-r—r--. 1 root root 0 Jun 1 11:45 io.pressure -rw-r—r--. 1 root root 0 Jun 1 11:45 memory.pressure
The cpu
controller is only activated if the relevant child control group has at least 2 processes which compete for time on a single CPU.
Verification steps
Optional: confirm that you have created a new
cgroup
with only the desired controllers active:# cat /sys/fs/cgroup/Example/tasks/cgroup.controllers cpuset cpu
Additional resources
- Understanding control groups
- What are kernel resource controllers
- Mounting cgroups-v1
-
cgroups(7)
,sysfs(5)
manual pages
22.2. Controlling distribution of CPU time for applications by adjusting CPU weight
You need to assign values to the relevant files of the cpu
controller to regulate distribution of the CPU time to applications under the specific cgroup tree.
Prerequisites
- You have root permissions.
- You have applications for which you want to control distribution of CPU time.
You created a two level hierarchy of child control groups inside the
/sys/fs/cgroup/
root control group as in the following example:… ├── Example │ ├── g1 │ ├── g2 │ └── g3 …
-
You enabled the
cpu
controller in the parent control group and in child control groups similarly as described in Creating cgroups and enabling controllers in cgroups-v2 file system.
Procedure
Configure desired CPU weights to achieve resource restrictions within the control groups:
# echo "150" > /sys/fs/cgroup/Example/g1/cpu.weight # echo "100" > /sys/fs/cgroup/Example/g2/cpu.weight # echo "50" > /sys/fs/cgroup/Example/g3/cpu.weight
Add the applications' PIDs to the
g1
,g2
, andg3
child groups:# echo "33373" > /sys/fs/cgroup/Example/g1/cgroup.procs # echo "33374" > /sys/fs/cgroup/Example/g2/cgroup.procs # echo "33377" > /sys/fs/cgroup/Example/g3/cgroup.procs
The example commands ensure that desired applications become members of the
Example/g*/
child cgroups and will get their CPU time distributed as per the configuration of those cgroups.The weights of the children cgroups (
g1
,g2
,g3
) that have running processes are summed up at the level of the parent cgroup (Example
). The CPU resource is then distributed proportionally based on the respective weights.As a result, when all processes run at the same time, the kernel allocates to each of them the proportionate CPU time based on their respective cgroup’s
cpu.weight
file:Child cgroup cpu.weight
fileCPU time allocation g1
150
~50% (150/300)
g2
100
~33% (100/300)
g3
50
~16% (50/300)
The value of the
cpu.weight
controller file is not a percentage.If one process stopped running, leaving cgroup
g2
with no running processes, the calculation would omit the cgroupg2
and only account weights of cgroupsg1
andg3
:Child cgroup cpu.weight
fileCPU time allocation g1
150
~75% (150/200)
g3
50
~25% (50/200)
ImportantIf a child cgroup had multiple running processes, the CPU time allocated to the respective cgroup would be distributed equally to the member processes of that cgroup.
Verification
Verify that the applications run in the specified control groups:
# cat /proc/33373/cgroup /proc/33374/cgroup /proc/33377/cgroup 0::/Example/g1 0::/Example/g2 0::/Example/g3
The command output shows the processes of the specified applications that run in the
Example/g*/
child cgroups.Inspect the current CPU consumption of the throttled applications:
# top top - 05:17:18 up 1 day, 18:25, 1 user, load average: 3.03, 3.03, 3.00 Tasks: 95 total, 4 running, 91 sleeping, 0 stopped, 0 zombie %Cpu(s): 18.1 us, 81.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st MiB Mem : 3737.0 total, 3233.7 free, 132.8 used, 370.5 buff/cache MiB Swap: 4060.0 total, 4060.0 free, 0.0 used. 3373.1 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 33373 root 20 0 18720 1748 1460 R 49.5 0.0 415:05.87 sha1sum 33374 root 20 0 18720 1756 1464 R 32.9 0.0 412:58.33 sha1sum 33377 root 20 0 18720 1860 1568 R 16.3 0.0 411:03.12 sha1sum 760 root 20 0 416620 28540 15296 S 0.3 0.7 0:10.23 tuned 1 root 20 0 186328 14108 9484 S 0.0 0.4 0:02.00 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthread ...
NoteWe forced all the example processes to run on a single CPU for clearer illustration. The CPU weight applies the same principles also when used on multiple CPUs.
Notice that the CPU resource for the
PID 33373
,PID 33374
, andPID 33377
was allocated based on the weights, 150, 100, 50, you assigned to the respective child cgroups. The weights correspond to around 50%, 33%, and 16% allocation of CPU time for each application.
Additional resources
22.3. Mounting cgroups-v1
During the boot process, RHEL 9 mounts the cgroup-v2
virtual filesystem by default. To utilize cgroup-v1
functionality in limiting resources for your applications, manually configure the system.
Both cgroup-v1
and cgroup-v2
are fully enabled in the kernel. There is no default control group version from the kernel point of view, and is decided by systemd
to mount at startup.
Prerequisites
- You have root permissions.
Procedure
Configure the system to mount
cgroups-v1
by default during system boot by thesystemd
system and service manager:# grubby --update-kernel=/boot/vmlinuz-$(uname -r) --args="systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller"
This adds the necessary kernel command-line parameters to the current boot entry.
To add the same parameters to all kernel boot entries:
# grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller"
- Reboot the system for the changes to take effect.
Verification
Optionally, verify that the
cgroups-v1
filesystem was mounted:# mount -l | grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,size=4096k,nr_inodes=1024,mode=755,inode64) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,perf_event) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpu,cpuacct) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,pids) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuset) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,net_cls,net_prio) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,hugetlb) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,memory) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,blkio) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,devices) cgroup on /sys/fs/cgroup/misc type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,misc) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,freezer) cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,rdma)
The
cgroups-v1
filesystems that correspond to variouscgroup-v1
controllers, were successfully mounted on the/sys/fs/cgroup/
directory.Optionally, inspect the contents of the
/sys/fs/cgroup/
directory:# ll /sys/fs/cgroup/ dr-xr-xr-x. 10 root root 0 Mar 16 09:34 blkio lrwxrwxrwx. 1 root root 11 Mar 16 09:34 cpu → cpu,cpuacct lrwxrwxrwx. 1 root root 11 Mar 16 09:34 cpuacct → cpu,cpuacct dr-xr-xr-x. 10 root root 0 Mar 16 09:34 cpu,cpuacct dr-xr-xr-x. 2 root root 0 Mar 16 09:34 cpuset dr-xr-xr-x. 10 root root 0 Mar 16 09:34 devices dr-xr-xr-x. 2 root root 0 Mar 16 09:34 freezer dr-xr-xr-x. 2 root root 0 Mar 16 09:34 hugetlb dr-xr-xr-x. 10 root root 0 Mar 16 09:34 memory dr-xr-xr-x. 2 root root 0 Mar 16 09:34 misc lrwxrwxrwx. 1 root root 16 Mar 16 09:34 net_cls → net_cls,net_prio dr-xr-xr-x. 2 root root 0 Mar 16 09:34 net_cls,net_prio lrwxrwxrwx. 1 root root 16 Mar 16 09:34 net_prio → net_cls,net_prio dr-xr-xr-x. 2 root root 0 Mar 16 09:34 perf_event dr-xr-xr-x. 10 root root 0 Mar 16 09:34 pids dr-xr-xr-x. 2 root root 0 Mar 16 09:34 rdma dr-xr-xr-x. 11 root root 0 Mar 16 09:34 systemd
The
/sys/fs/cgroup/
directory, also called the root control group, by default, contains controller-specific directories such ascpuset
. In addition, there are some directories related tosystemd
.
Additional resources
- Understanding control groups
- What are kernel resource controllers
-
cgroups(7)
,sysfs(5)
manual pages - cgroup-v2 enabled by default in RHEL 9
22.4. Setting CPU limits to applications using cgroups-v1
Sometimes an application consumes a lot of CPU time, which may negatively impact the overall health of your environment. Use the /sys/fs/
virtual file system to configure CPU limits to an application using control groups version 1 (cgroups-v1
).
Prerequisites
- You have root permissions.
- You have an application whose CPU consumption you want to restrict.
You configured the system to mount
cgroups-v1
by default during system boot by thesystemd
system and service manager:# grubby --update-kernel=/boot/vmlinuz-$(uname -r) --args="systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller"
This adds the necessary kernel command-line parameters to the current boot entry.
Procedure
Identify the process ID (PID) of the application you want to restrict in CPU consumption:
# top top - 11:34:09 up 11 min, 1 user, load average: 0.51, 0.27, 0.22 Tasks: 267 total, 3 running, 264 sleeping, 0 stopped, 0 zombie %Cpu(s): 49.0 us, 3.3 sy, 0.0 ni, 47.5 id, 0.0 wa, 0.2 hi, 0.0 si, 0.0 st MiB Mem : 1826.8 total, 303.4 free, 1046.8 used, 476.5 buff/cache MiB Swap: 1536.0 total, 1396.0 free, 140.0 used. 616.4 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6955 root 20 0 228440 1752 1472 R 99.3 0.1 0:32.71 sha1sum 5760 jdoe 20 0 3603868 205188 64196 S 3.7 11.0 0:17.19 gnome-shell 6448 jdoe 20 0 743648 30640 19488 S 0.7 1.6 0:02.73 gnome-terminal- 1 root 20 0 245300 6568 4116 S 0.3 0.4 0:01.87 systemd 505 root 20 0 0 0 0 I 0.3 0.0 0:00.75 kworker/u4:4-events_unbound ...
The example output of the
top
program reveals thatPID 6955
(illustrative applicationsha1sum
) consumes a lot of CPU resources.Create a sub-directory in the
cpu
resource controller directory:# mkdir /sys/fs/cgroup/cpu/Example/
The directory above represents a control group, where you can place specific processes and apply certain CPU limits to the processes. At the same time, some
cgroups-v1
interface files andcpu
controller-specific files will be created in the directory.Optionally, inspect the newly created control group:
# ll /sys/fs/cgroup/cpu/Example/ -rw-r—r--. 1 root root 0 Mar 11 11:42 cgroup.clone_children -rw-r—r--. 1 root root 0 Mar 11 11:42 cgroup.procs -r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.stat -rw-r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage -r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_all -r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_percpu -r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_percpu_sys -r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_percpu_user -r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_sys -r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_user -rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.cfs_period_us -rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.cfs_quota_us -rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.rt_period_us -rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.rt_runtime_us -rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.shares -r—r—r--. 1 root root 0 Mar 11 11:42 cpu.stat -rw-r—r--. 1 root root 0 Mar 11 11:42 notify_on_release -rw-r—r--. 1 root root 0 Mar 11 11:42 tasks
The example output shows files, such as
cpuacct.usage
,cpu.cfs._period_us
, that represent specific configurations and/or limits, which can be set for processes in theExample
control group. Notice that the respective file names are prefixed with the name of the control group controller to which they belong.By default, the newly created control group inherits access to the system’s entire CPU resources without a limit.
Configure CPU limits for the control group:
# echo "1000000" > /sys/fs/cgroup/cpu/Example/cpu.cfs_period_us # echo "200000" > /sys/fs/cgroup/cpu/Example/cpu.cfs_quota_us
The
cpu.cfs_period_us
file represents a period of time in microseconds (µs, represented here as "us") for how frequently a control group’s access to CPU resources should be reallocated. The upper limit is 1 second and the lower limit is 1000 microseconds.The
cpu.cfs_quota_us
file represents the total amount of time in microseconds for which all processes collectively in a control group can run during one period (as defined bycpu.cfs_period_us
). As soon as processes in a control group, during a single period, use up all the time specified by the quota, they are throttled for the remainder of the period and not allowed to run until the next period. The lower limit is 1000 microseconds.The example commands above set the CPU time limits so that all processes collectively in the
Example
control group will be able to run only for 0.2 seconds (defined bycpu.cfs_quota_us
) out of every 1 second (defined bycpu.cfs_period_us
).Optionally, verify the limits:
# cat /sys/fs/cgroup/cpu/Example/cpu.cfs_period_us /sys/fs/cgroup/cpu/Example/cpu.cfs_quota_us 1000000 200000
Add the application’s PID to the
Example
control group:# echo "6955" > /sys/fs/cgroup/cpu/Example/cgroup.procs or # echo "6955" > /sys/fs/cgroup/cpu/Example/tasks
The previous command ensures that a desired application becomes a member of the
Example
control group and hence does not exceed the CPU limits configured for theExample
control group. The PID should represent an existing process in the system. ThePID 6955
here was assigned to processsha1sum /dev/zero &
, used to illustrate the use-case of thecpu
controller.Verify that the application runs in the specified control group:
# cat /proc/6955/cgroup 12:cpuset:/ 11:hugetlb:/ 10:net_cls,net_prio:/ 9:memory:/user.slice/user-1000.slice/user@1000.service 8:devices:/user.slice 7:blkio:/ 6:freezer:/ 5:rdma:/ 4:pids:/user.slice/user-1000.slice/user@1000.service 3:perf_event:/ 2:cpu,cpuacct:/Example 1:name=systemd:/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service
The example output above shows that the process of the desired application runs in the
Example
control group, which applies CPU limits to the application’s process.Identify the current CPU consumption of your throttled application:
# top top - 12:28:42 up 1:06, 1 user, load average: 1.02, 1.02, 1.00 Tasks: 266 total, 6 running, 260 sleeping, 0 stopped, 0 zombie %Cpu(s): 11.0 us, 1.2 sy, 0.0 ni, 87.5 id, 0.0 wa, 0.2 hi, 0.0 si, 0.2 st MiB Mem : 1826.8 total, 287.1 free, 1054.4 used, 485.3 buff/cache MiB Swap: 1536.0 total, 1396.7 free, 139.2 used. 608.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6955 root 20 0 228440 1752 1472 R 20.6 0.1 47:11.43 sha1sum 5760 jdoe 20 0 3604956 208832 65316 R 2.3 11.2 0:43.50 gnome-shell 6448 jdoe 20 0 743836 31736 19488 S 0.7 1.7 0:08.25 gnome-terminal- 505 root 20 0 0 0 0 I 0.3 0.0 0:03.39 kworker/u4:4-events_unbound 4217 root 20 0 74192 1612 1320 S 0.3 0.1 0:01.19 spice-vdagentd ...
Notice that the CPU consumption of the
PID 6955
has decreased from 99% to 20%.
The cgroups-v2
counterpart for cpu.cfs_period_us
and cpu.cfs_quota_us
is the cpu.max
file. The cpu.max
file is available through the cpu
controller.
Additional resources
- Understanding control groups
- What kernel resource controllers are
-
cgroups(7)
,sysfs(5)
manual pages
Chapter 23. Analyzing system performance with BPF Compiler Collection
As a system administrator, you can use the BPF Compiler Collection (BCC) library to create tools for analyzing the performance of your Linux operating system and gathering information, which could be difficult to obtain through other interfaces.
23.1. An introduction to BCC
BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended Berkeley Packet Filter (eBPF) programs. The main utility of eBPF programs is analyzing OS performance and network performance without experiencing overhead or security issues.
BCC removes the need for users to know deep technical details of eBPF, and provides many out-of-the-box starting points, such as the bcc-tools
package with pre-created eBPF programs.
The eBPF programs are triggered on events, such as disk I/O, TCP connections, and process creations. It is unlikely that the programs should cause the kernel to crash, loop or become unresponsive because they run in a safe virtual machine in the kernel.
23.2. Installing the bcc-tools package
This section describes how to install the bcc-tools
package, which also installs the BPF Compiler Collection (BCC) library as a dependency.
Procedure
Install
bcc-tools
:#
dnf install bcc-tools
The BCC tools are installed in the
/usr/share/bcc/tools/
directory.Optionally, inspect the tools:
#
ll /usr/share/bcc/tools/
... -rwxr-xr-x. 1 root root 4198 Dec 14 17:53 dcsnoop -rwxr-xr-x. 1 root root 3931 Dec 14 17:53 dcstat -rwxr-xr-x. 1 root root 20040 Dec 14 17:53 deadlock_detector -rw-r--r--. 1 root root 7105 Dec 14 17:53 deadlock_detector.c drwxr-xr-x. 3 root root 8192 Mar 11 10:28 doc -rwxr-xr-x. 1 root root 7588 Dec 14 17:53 execsnoop -rwxr-xr-x. 1 root root 6373 Dec 14 17:53 ext4dist -rwxr-xr-x. 1 root root 10401 Dec 14 17:53 ext4slower ...The
doc
directory in the listing above contains documentation for each tool.
23.3. Using selected bcc-tools for performance analyses
This section describes how to use certain pre-created programs from the BPF Compiler Collection (BCC) library to efficiently and securely analyze the system performance on the per-event basis. The set of pre-created programs in the BCC library can serve as examples for creation of additional programs.
Prerequisites
- Installed bcc-tools package
- Root permissions
Using execsnoop to examine the system processes
Execute the
execsnoop
program in one terminal:# /usr/share/bcc/tools/execsnoop
In another terminal execute for example:
$ ls /usr/share/bcc/tools/doc/
The above creates a short-lived process of the
ls
command.The terminal running
execsnoop
shows the output similar to the following:PCOMM PID PPID RET ARGS ls 8382 8287 0 /usr/bin/ls --color=auto /usr/share/bcc/tools/doc/ ...
The
execsnoop
program prints a line of output for each new process, which consumes system resources. It even detects processes of programs that run very shortly, such asls
, and most monitoring tools would not register them.The
execsnoop
output displays the following fields:-
PCOMM - The parent process name. (
ls
) -
PID - The process ID. (
8382
) -
PPID - The parent process ID. (
8287
) -
RET - The return value of the
exec()
system call (0
), which loads program code into new processes. - ARGS - The location of the started program with arguments.
-
PCOMM - The parent process name. (
To see more details, examples, and options for execsnoop
, refer to the /usr/share/bcc/tools/doc/execsnoop_example.txt
file.
For more information about exec()
, see exec(3)
manual pages.
Using opensnoop to track what files a command opens
Execute the
opensnoop
program in one terminal:# /usr/share/bcc/tools/opensnoop -n uname
The above prints output for files, which are opened only by the process of the
uname
command.In another terminal execute:
$ uname
The command above opens certain files, which are captured in the next step.
The terminal running
opensnoop
shows the output similar to the following:PID COMM FD ERR PATH 8596 uname 3 0 /etc/ld.so.cache 8596 uname 3 0 /lib64/libc.so.6 8596 uname 3 0 /usr/lib/locale/locale-archive ...
The
opensnoop
program watches theopen()
system call across the whole system, and prints a line of output for each file thatuname
tried to open along the way.The
opensnoop
output displays the following fields:-
PID - The process ID. (
8596
) -
COMM - The process name. (
uname
) -
FD - The file descriptor - a value that
open()
returns to refer to the open file. (3
) - ERR - Any errors.
PATH - The location of files that
open()
tried to open.If a command tries to read a non-existent file, then the
FD
column returns-1
and theERR
column prints a value corresponding to the relevant error. As a result,opensnoop
can help you identify an application that does not behave properly.
-
PID - The process ID. (
To see more details, examples, and options for opensnoop
, refer to the /usr/share/bcc/tools/doc/opensnoop_example.txt
file.
For more information about open()
, see open(2)
manual pages.
Using biotop to examine the I/O operations on the disk
Execute the
biotop
program in one terminal:# /usr/share/bcc/tools/biotop 30
The command enables you to monitor the top processes, which perform I/O operations on the disk. The argument ensures that the command will produce a 30 second summary.
NoteWhen no argument provided, the output screen by default refreshes every 1 second.
In another terminal execute for example :
# dd if=/dev/vda of=/dev/zero
The command above reads the content from the local hard disk device and writes the output to the
/dev/zero
file. This step generates certain I/O traffic to illustratebiotop
.The terminal running
biotop
shows the output similar to the following:PID COMM D MAJ MIN DISK I/O Kbytes AVGms 9568 dd R 252 0 vda 16294 14440636.0 3.69 48 kswapd0 W 252 0 vda 1763 120696.0 1.65 7571 gnome-shell R 252 0 vda 834 83612.0 0.33 1891 gnome-shell R 252 0 vda 1379 19792.0 0.15 7515 Xorg R 252 0 vda 280 9940.0 0.28 7579 llvmpipe-1 R 252 0 vda 228 6928.0 0.19 9515 gnome-control-c R 252 0 vda 62 6444.0 0.43 8112 gnome-terminal- R 252 0 vda 67 2572.0 1.54 7807 gnome-software R 252 0 vda 31 2336.0 0.73 9578 awk R 252 0 vda 17 2228.0 0.66 7578 llvmpipe-0 R 252 0 vda 156 2204.0 0.07 9581 pgrep R 252 0 vda 58 1748.0 0.42 7531 InputThread R 252 0 vda 30 1200.0 0.48 7504 gdbus R 252 0 vda 3 1164.0 0.30 1983 llvmpipe-1 R 252 0 vda 39 724.0 0.08 1982 llvmpipe-0 R 252 0 vda 36 652.0 0.06 ...
The
biotop
output displays the following fields:-
PID - The process ID. (
9568
) -
COMM - The process name. (
dd
) -
DISK - The disk performing the read operations. (
vda
) - I/O - The number of read operations performed. (16294)
- Kbytes - The amount of Kbytes reached by the read operations. (14,440,636)
- AVGms - The average I/O time of read operations. (3.69)
-
PID - The process ID. (
To see more details, examples, and options for biotop
, refer to the /usr/share/bcc/tools/doc/biotop_example.txt
file.
For more information about dd
, see dd(1)
manual pages.
Using xfsslower to expose unexpectedly slow file system operations
Execute the
xfsslower
program in one terminal:# /usr/share/bcc/tools/xfsslower 1
The command above measures the time the XFS file system spends in performing read, write, open or sync (
fsync
) operations. The1
argument ensures that the program shows only the operations that are slower than 1 ms.NoteWhen no arguments provided,
xfsslower
by default displays operations slower than 10 ms.In another terminal execute, for example, the following:
$ vim text
The command above creates a text file in the
vim
editor to initiate certain interaction with the XFS file system.The terminal running
xfsslower
shows something similar upon saving the file from the previous step:TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 13:07:14 b'bash' 4754 R 256 0 7.11 b'vim' 13:07:14 b'vim' 4754 R 832 0 4.03 b'libgpm.so.2.1.0' 13:07:14 b'vim' 4754 R 32 20 1.04 b'libgpm.so.2.1.0' 13:07:14 b'vim' 4754 R 1982 0 2.30 b'vimrc' 13:07:14 b'vim' 4754 R 1393 0 2.52 b'getscriptPlugin.vim' 13:07:45 b'vim' 4754 S 0 0 6.71 b'text' 13:07:45 b'pool' 2588 R 16 0 5.58 b'text' ...
Each line above represents an operation in the file system, which took more time than a certain threshold.
xfsslower
is good at exposing possible file system problems, which can take form of unexpectedly slow operations.The
xfsslower
output displays the following fields:-
COMM - The process name. (
b’bash'
) T - The operation type. (
R
)- Read
- Write
- Sync
- OFF_KB - The file offset in KB. (0)
- FILENAME - The file being read, written, or synced.
-
COMM - The process name. (
To see more details, examples, and options for xfsslower
, refer to the /usr/share/bcc/tools/doc/xfsslower_example.txt
file.
For more information about fsync
, see fsync(2)
manual pages.
Chapter 24. Configuring huge pages
Physical memory is managed in fixed-size chunks called pages. On the x86_64 architecture, supported by Red Hat Enterprise Linux 9, the default size of a memory page is 4 KB
. This default page size has proved to be suitable for general-purpose operating systems, such as Red Hat Enterprise Linux, which supports many different kinds of workloads.
However, specific applications can benefit from using larger page sizes in certain cases. For example, an application that works with a large and relatively fixed data set of hundreds of megabytes or even dozens of gigabytes can have performance issues when using 4 KB
pages. Such data sets can require a huge amount of 4 KB
pages, which can lead to overhead in the operating system and the CPU.
This section provides information about huge pages available in RHEL 9 and how you can configure them.
24.1. Available huge page features
With Red Hat Enterprise Linux 9, you can use huge pages for applications that work with big data sets, and improve the performance of such applications.
The following are the huge page methods, which are supported in RHEL 9:
HugeTLB pages
HugeTLB pages are also called static huge pages. There are two ways of reserving HugeTLB pages:
- At boot time: It increases the possibility of success because the memory has not yet been significantly fragmented. However, on NUMA machines, the number of pages is automatically split among the NUMA nodes. For more information on parameters that influence HugeTLB page behavior at boot time, see Parameters for reserving HugeTLB pages at boot time and how to use these parameters to configure HugeTLB pages at boot time, see Configuring HugeTLB at boot time.
- At run time: It allows you to reserve the huge pages per NUMA node. If the run-time reservation is done as early as possible in the boot process, the probability of memory fragmentation is lower. For more information on parameters that influence HugeTLB page behavior at run time, see Parameters for reserving HugeTLB pages at run time and how to use these parameters to configure HugeTLB pages at run time, see Configuring HugeTLB at run time.
Transparent HugePages (THP)
With THP, the kernel automatically assigns huge pages to processes, and therefore there is no need to manually reserve the static huge pages. The following are the two modes of operation in THP:
-
system-wide
: Here, the kernel tries to assign huge pages to a process whenever it is possible to allocate the huge pages and the process is using a large contiguous virtual memory area. per-process
: Here, the kernel only assigns huge pages to the memory areas of individual processes which you can specify using themadvise
() system call.NoteThe THP feature only supports
2 MB
pages.For more information on parameters that influence HugeTLB page behavior at boot time, see Enabling transparent hugepages and Disabling transparent hugepages.
-
24.2. Parameters for reserving HugeTLB pages at boot time
Use the following parameters to influence HugeTLB page behavior at boot time.
For more infomration on how to use these parameters to configure HugeTLB pages at boot time, see Configuring HugeTLB at boot time.
Table 24.1. Parameters used to configure HugeTLB pages at boot time
Parameter | Description | Default value |
---|---|---|
| Defines the number of persistent huge pages configured in the kernel at boot time. In a NUMA system, huge pages, that have this parameter defined, are divided equally between nodes.
You can assign huge pages to specific nodes at runtime by changing the value of the nodes in the |
The default value is
To update this value at boot, change the value of this parameter in the |
| Defines the size of persistent huge pages configured in the kernel at boot time. |
Valid values are |
| Defines the default size of persistent huge pages configured in the kernel at boot time. |
Valid values are |
24.3. Configuring HugeTLB at boot time
The page size, which the HugeTLB subsystem supports, depends on the architecture. The x86_64 architecture supports 2 MB
huge pages and 1 GB
gigantic pages.
This procedure describes how to reserve a 1 GB
page at boot time.
Procedure
Create a HugeTLB pool for
1 GB
pages by appending the following line to the kernel command-line options in the/etc/default/grub
file as root:default_hugepagesz=1G hugepagesz=1G
Regenerate the
GRUB2
configuration using the edited default file:If your system uses BIOS firmware, execute the following command:
# grub2-mkconfig -o /boot/grub2/grub.cfg
If your system uses UEFI framework, execute the following command:
# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
Create a new file called
hugetlb-gigantic-pages.service
in the/usr/lib/systemd/system/
directory and add the following content:[Unit] Description=HugeTLB Gigantic Pages Reservation DefaultDependencies=no Before=dev-hugepages.mount ConditionPathExists=/sys/devices/system/node ConditionKernelCommandLine=hugepagesz=1G [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/lib/systemd/hugetlb-reserve-pages.sh [Install] WantedBy=sysinit.target
Create a new file called
hugetlb-reserve-pages.sh
in the/usr/lib/systemd/
directory and add the following content:While adding the following content, replace number_of_pages with the number of 1GB pages you want to reserve, and node with the name of the node on which to reserve these pages.
#!/bin/sh nodes_path=/sys/devices/system/node/ if [ ! -d $nodes_path ]; then echo "ERROR: $nodes_path does not exist" exit 1 fi reserve_pages() { echo $1 > $nodes_path/$2/hugepages/hugepages-1048576kB/nr_hugepages } reserve_pages number_of_pages node
For example, to reserve two
1 GB
pages on node0 and one 1GB page on node1, replace the number_of_pages with 2 for node0 and 1 for node1:reserve_pages 2 node0 reserve_pages 1 node1
Create an executable script:
# chmod +x /usr/lib/systemd/hugetlb-reserve-pages.sh
Enable early boot reservation:
# systemctl enable hugetlb-gigantic-pages
-
You can try reserving more 1GB pages at runtime by writing to
nr_hugepages
at any time. However, such reservations can fail due to memory fragmentation. The most reliable way to reserve1 GB
pages is by using thishugetlb-reserve-pages.sh
script, which runs early during boot. - Reserving static huge pages can effectively reduce the amount of memory available to the system, and prevents it from properly utilizing its full memory capacity. Although a properly sized pool of reserved huge pages can be beneficial to applications that utilize it, an oversized or unused pool of reserved huge pages will eventually be detrimental to overall system performance. When setting a reserved huge page pool, ensure that the system can properly utilize its full memory capacity.
Additional resources
-
systemd.service(5)
man page -
/usr/share/doc/kernel-doc-kernel_version/Documentation/vm/hugetlbpage.txt
file
24.4. Parameters for reserving HugeTLB pages at run time
Use the following parameters to influence HugeTLB page behavior at run time.
For more information on how to use these parameters to configure HugeTLB pages at run time, see Configuring HugeTLB at run time.
Table 24.2. Parameters used to configure HugeTLB pages at run time
Parameter | Description | File name |
---|---|---|
| Defines the number of huge pages of a specified size assigned to a specified NUMA node. |
|
| Defines the maximum number of additional huge pages that can be created and used by the system through overcommitting memory. Writing any non-zero value into this file indicates that the system obtains that number of huge pages from the kernel’s normal page pool if the persistent huge page pool is exhausted. As these surplus huge pages become unused, they are then freed and returned to the kernel’s normal page pool. |
|
24.5. Configuring HugeTLB at run time
This procedure describes how to add 20 2048 kB huge pages to node2.
To reserve pages based on your requirements, replace:
- 20 with the number of huge pages you wish to reserve,
- 2048kB with the size of the huge pages,
- node2 with the node on which you wish to reserve the pages.
Procedure
Display the memory statistics:
# numastat -cm | egrep 'Node|Huge' Node 0 Node 1 Node 2 Node 3 Total add AnonHugePages 0 2 0 8 10 HugePages_Total 0 0 0 0 0 HugePages_Free 0 0 0 0 0 HugePages_Surp 0 0 0 0 0
Add the number of huge pages of a specified size to the node:
# echo 20 > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
Verification steps
Ensure that the number of huge pages are added:
# numastat -cm | egrep 'Node|Huge' Node 0 Node 1 Node 2 Node 3 Total AnonHugePages 0 2 0 8 10 HugePages_Total 0 0 40 0 40 HugePages_Free 0 0 40 0 40 HugePages_Surp 0 0 0 0 0
Additional resources
-
numastat(8)
man page
24.6. Enabling transparent hugepages
THP is enabled by default in Red Hat Enterprise Linux 9. However, you can enable or disable THP.
This procedure describes how to enable THP.
Procedure
Check the current status of THP:
# cat /sys/kernel/mm/transparent_hugepage/enabled
Enable THP:
# echo always > /sys/kernel/mm/transparent_hugepage/enabled
To prevent applications from allocating more memory resources than necessary, disable the system-wide transparent huge pages and only enable them for the applications that explicitly request it through the
madvise
:# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
Sometimes, providing low latency to short-lived allocations has higher priority than immediately achieving the best performance with long-lived allocations. In such cases, you can disable direct compaction while leaving THP enabled.
Direct compaction is a synchronous memory compaction during the huge page allocation. Disabling direct compaction provides no guarantee of saving memory, but can decrease the risk of higher latencies during frequent page faults. Note that if the workload benefits significantly from THP, the performance decreases. Disable direct compaction:
# echo madvise > /sys/kernel/mm/transparent_hugepage/defrag
Additional resources
-
madvise(2)
man page - Disabling transparent hugepages.
24.7. Disabling transparent hugepages
THP is enabled by default in Red Hat Enterprise Linux 9. However, you can enable or disable THP.
This procedure describes how to disable THP.
Procedure
Check the current status of THP:
# cat /sys/kernel/mm/transparent_hugepage/enabled
Disable THP:
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
24.8. Impact of page size on translation lookaside buffer size
Reading address mappings from the page table is time-consuming and resource-expensive, so CPUs are built with a cache for recently-used addresses, called the Translation Lookaside Buffer (TLB). However, the default TLB can only cache a certain number of address mappings.
If a requested address mapping is not in the TLB, called a TLB miss, the system still needs to read the page table to determine the physical to virtual address mapping. Because of the relationship between application memory requirements and the size of pages used to cache address mappings, applications with large memory requirements are more likely to suffer performance degradation from TLB misses than applications with minimal memory requirements. It is therefore important to avoid TLB misses wherever possible.
Both HugeTLB and Transparent Huge Page features allow applications to use pages larger than 4 KB
. This allows addresses stored in the TLB to reference more memory, which reduces TLB misses and improves application performance.
Chapter 25. Getting started with SystemTap
As a system administrator, you can use SystemTap to identify underlying causes of a bug or performance problem on a running Linux system.
As an application developer, you can use SystemTap to monitor in fine detail how your application behaves within the Linux system.
25.1. The purpose of SystemTap
SystemTap is a tracing and probing tool that you can use to study and monitor the activities of your operating system (particularly, the kernel) in fine detail. SystemTap provides information similar to the output of tools such as netstat
, ps
, top
, and iostat
. However, SystemTap provides more filtering and analysis options for collected information. In SystemTap scripts, you specify the information that SystemTap gathers.
SystemTap aims to supplement the existing suite of Linux monitoring tools by providing users with the infrastructure to track kernel activity and combining this capability with two attributes:
- Flexibility
- the SystemTap framework enables you to develop simple scripts for investigating and monitoring a wide variety of kernel functions, system calls, and other events that occur in kernel space. With this, SystemTap is not so much a tool as it is a system that allows you to develop your own kernel-specific forensic and monitoring tools.
- Ease-of-Use
- SystemTap enables you to monitor kernel activity without having to recompile the kernel or reboot the system.
25.2. Installing SystemTap
To begin using SystemTap, install the required packages. To use SystemTap on more than one kernel where a system has multiple kernels installed, install the corresponding required kernel packages for each kernel version.
Prerequisites
- You have enabled debug repositories as described in Enabling debug and source repositories.
Procedure
Install the required SystemTap packages:
# dnf install systemtap
Install the required kernel packages:
Using
stap-prep
:# stap-prep
If
stap-prep
does not work, install the required kernel packages manually:# dnf install kernel-debuginfo-$(uname -r) kernel-debuginfo-common-$(uname -i)-$(uname -r) kernel-devel-$(uname -r)
$(uname -i)
is automatically replaced with the hardware platform of your system and$(uname -r)
is automatically replaced with the version of your running kernel.
Verification steps
If the kernel to be probed with SystemTap is currently in use, test if your installation was successful:
# stap -v -e 'probe kernel.function("vfs_read") {printf("read performed\n"); exit()}'
A successful SystemTap deployment results in an output similar to the following:
Pass 1: parsed user script and 45 library script(s) in 340usr/0sys/358real ms. Pass 2: analyzed script: 1 probe(s), 1 function(s), 0 embed(s), 0 global(s) in 290usr/260sys/568real ms. Pass 3: translated to C into "/tmp/stapiArgLX/stap_e5886fa50499994e6a87aacdc43cd392_399.c" in 490usr/430sys/938real ms. Pass 4: compiled C into "stap_e5886fa50499994e6a87aacdc43cd392_399.ko" in 3310usr/430sys/3714real ms. Pass 5: starting run. 1 read performed 2 Pass 5: run completed in 10usr/40sys/73real ms. 3
The last three lines of output (beginning with
Pass 5
) indicate that:
25.3. Privileges to run SystemTap
Running SystemTap scripts requires elevated system privileges but, in some instances, non-privileged users might need to run SystemTap instrumentation on their machine.
To allow users to run SystemTap without root access, add users to both of these user groups:
stapdev
Members of this group can use
stap
to run SystemTap scripts, orstaprun
to run SystemTap instrumentation modules.Running
stap
involves compiling SystemTap scripts into kernel modules and loading them into the kernel. This requires elevated privileges to the system, which are granted tostapdev
members. Unfortunately, such privileges also grant effective root access tostapdev
members. As such, only grantstapdev
group membership to users who can be trusted with root access.stapusr
-
Members of this group can only use
staprun
to run SystemTap instrumentation modules. In addition, they can only run those modules from the/lib/modules/kernel_version/systemtap/
directory. This directory must be owned only by the root user, and must only be writable by the root user.
25.4. Running SystemTap scripts
You can run SystemTap scripts from standard input or from a file.
Sample scripts that are distributed with the installation of SystemTap can be found in the /usr/share/systemtap/examples
directory.
Prerequisites
- SystemTap and the associated required kernel packages are installed as described in Installing Systemtap.
To run SystemTap scripts as a normal user, add the user to the SystemTap groups:
# usermod --append --groups stapdev,stapusr user-name
Procedure
Run the SystemTap script:
From standard input:
# echo "probe timer.s(1) {exit()}" | stap -
This command instructs
stap
to run the script passed byecho
to standard input. To addstap
options, insert them before the-
character. For example, to make the results from this command more verbose, the command is:# echo "probe timer.s(1) {exit()}" | stap -v -
From a file:
# stap file_name
Chapter 26. Cross-instrumentation of SystemTap
Cross-instrumentation of SystemTap is creating SystemTap instrumentation modules from a SystemTap script on one system to be used on another system that does not have SystemTap fully deployed.
26.1. SystemTap cross-instrumentation
When you run a SystemTap script, a kernel module is built out of that script. SystemTap then loads the module into the kernel.
Normally, SystemTap scripts can run only on systems where SystemTap is deployed. To run SystemTap on ten systems, SystemTap needs to be deployed on all those systems. In some cases, this might be neither feasible nor desired. For example, corporate policy might prohibit you from installing packages that provide compilers or debug information on specific machines, which will prevent the deployment of SystemTap.
To work around this, use cross-instrumentation. Cross-instrumentation is the process of generating SystemTap instrumentation modules from a SystemTap script on one system to be used on another system. This process offers the following benefits:
The kernel information packages for various machines can be installed on a single host machine.
ImportantKernel packaging bugs may prevent the installation. In such cases, the
kernel-debuginfo
andkernel-devel
packages for the host system and target system must match. If a bug occurs, report the bug at https://bugzilla.redhat.com/.Each target machine needs only one package to be installed to use the generated SystemTap instrumentation module:
systemtap-runtime
.ImportantThe host system must be the same architecture and running the same distribution of Linux as the target system in order for the built instrumentation module to work.
- instrumentation module
- The kernel module built from a SystemTap script; the SystemTap module is built on the host system, and will be loaded on the target kernel of the target system.
- host system
- The system on which the instrumentation modules (from SystemTap scripts) are compiled, to be loaded on target systems.
- target system
- The system in which the instrumentation module is being built (from SystemTap scripts).
- target kernel
- The kernel of the target system. This is the kernel that loads and runs the instrumentation module.
26.2. Initializing cross-instrumentation of SystemTap
Initialize cross-instrumentation of SystemTap to build SystemTap instrumentation modules from a SystemTap script on one system and use them on another system that does not have SystemTap fully deployed.
Prerequisites
- SystemTap is installed on the host system as described in Installing Systemtap.
The
systemtap-runtime
package is installed on each target system:# dnf install systemtap-runtime
- Both the host system and target system are the same architecture.
- Both the host system and target system are running the same major version of Red Hat Enterprise Linux (such as Red Hat Enterprise Linux 9).
Kernel packaging bugs may prevent multiple kernel-debuginfo
and kernel-devel
packages from being installed on one system. In such cases, the minor version for the host system and target system must match. If a bug occurs, report it at https://bugzilla.redhat.com/.
Procedure
Determine the kernel running on each target system:
$ uname -r
Repeat this step for each target system.
- On the host system, install the target kernel and related packages for each target system by the method described in Installing Systemtap.
Build an instrumentation module on the host system, copy this module to and run this module on on the target system either:
Using remote implementation:
# stap --remote target_system script
This command remotely implements the specified script on the target system. You must ensure an SSH connection can be made to the target system from the host system for this to be successful.
Manually:
Build the instrumentation module on the host system:
# stap -r kernel_version script -m module_name -p 4
Here, kernel_version refers to the version of the target kernel determined in step 1, script refers to the script to be converted into an instrumentation module, and module_name is the desired name of the instrumentation module. The
-p4
option tells SystemTap to not load and run the compiled module.Once the instrumentation module is compiled, copy it to the target system and load it using the following command:
# staprun module_name.ko
Chapter 27. Monitoring network activity with SystemTap
You can use helpful example SystemTap scripts available in the /usr/share/systemtap/testsuite/systemtap.examples/
directory, upon installing the systemtap-testsuite
package, to monitor and investigate the network activity of your system.
27.1. Profiling network activity with SystemTap
You can use the nettop.stp
example SystemTap script to profile network activity. The script tracks which processes are generating network traffic on the system, and provides the following information about each process:
- PID
- The ID of the listed process.
- UID
- User ID. A user ID of 0 refers to the root user.
- DEV
- Which ethernet device the process used to send or receive data (for example, eth0, eth1).
- XMIT_PK
- The number of packets transmitted by the process.
- RECV_PK
- The number of packets received by the process.
- XMIT_KB
- The amount of data sent by the process, in kilobytes.
- RECV_KB
- The amount of data received by the service, in kilobytes.
Prerequisites
- You have installed SystemTap as described in Installing SystemTap.
Procedure
Run the
nettop.stp
script:# stap --example nettop.stp
The
nettop.stp
script provides network profile sampling every 5 seconds.Output of the
nettop.stp
script looks similar to the following:[...] PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 0 0 eth0 0 5 0 0 swapper 11178 0 eth0 2 0 0 0 synergyc PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 2886 4 eth0 79 0 5 0 cups-polld 11362 0 eth0 0 61 0 5 firefox 0 0 eth0 3 32 0 3 swapper 2886 4 lo 4 4 0 0 cups-polld 11178 0 eth0 3 0 0 0 synergyc PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 0 0 eth0 0 6 0 0 swapper 2886 4 lo 2 2 0 0 cups-polld 11178 0 eth0 3 0 0 0 synergyc 3611 0 eth0 0 1 0 0 Xorg PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 0 0 eth0 3 42 0 2 swapper 11178 0 eth0 43 1 3 0 synergyc 11362 0 eth0 0 7 0 0 firefox 3897 0 eth0 0 1 0 0 multiload-apple
27.2. Tracing functions called in network socket code with SystemTap
You can use the socket-trace.stp
example SystemTap script to trace functions called from the kernel’s net/socket.c file. This helps you identify, in finer detail, how each process interacts with the network at the kernel level.
Prerequisites
- You have installed SystemTap as described in Installing SystemTap.
Procedure
Run the
socket-trace.stp
script:# stap --example socket-trace.stp
A 3-second excerpt of the output of the
socket-trace.stp
script looks similar to the following:[...] 0 Xorg(3611): -> sock_poll 3 Xorg(3611): <- sock_poll 0 Xorg(3611): -> sock_poll 3 Xorg(3611): <- sock_poll 0 gnome-terminal(11106): -> sock_poll 5 gnome-terminal(11106): <- sock_poll 0 scim-bridge(3883): -> sock_poll 3 scim-bridge(3883): <- sock_poll 0 scim-bridge(3883): -> sys_socketcall 4 scim-bridge(3883): -> sys_recv 8 scim-bridge(3883): -> sys_recvfrom 12 scim-bridge(3883):-> sock_from_file 16 scim-bridge(3883):<- sock_from_file 20 scim-bridge(3883):-> sock_recvmsg 24 scim-bridge(3883):<- sock_recvmsg 28 scim-bridge(3883): <- sys_recvfrom 31 scim-bridge(3883): <- sys_recv 35 scim-bridge(3883): <- sys_socketcall [...]
27.3. Monitoring network packet drops with SystemTap
The network stack in Linux can discard packets for various reasons. Some Linux kernels include a tracepoint, kernel.trace("kfree_skb")`
, which tracks where packets are discarded.
The dropwatch.stp
SystemTap script uses kernel.trace("kfree_skb")
to trace packet discards; the script summarizes what locations discard packets in every 5-second interval.
Prerequisites
- You have installed SystemTap as described in Installing SystemTap.
Procedure
Run the
dropwatch.stp
script:# stap --example dropwatch.stp
Running the
dropwatch.stp
script for 15 seconds results in output similar to the following:Monitoring for dropped packets 51 packets dropped at location 0xffffffff8024cd0f 2 packets dropped at location 0xffffffff8044b472 51 packets dropped at location 0xffffffff8024cd0f 1 packets dropped at location 0xffffffff8044b472 97 packets dropped at location 0xffffffff8024cd0f 1 packets dropped at location 0xffffffff8044b472 Stopping dropped packet monitor
NoteTo make the location of packet drops more meaningful, see the
/boot/System.map-$(uname -r)
file. This file lists the starting addresses for each function, enabling you to map the addresses in the output of thedropwatch.stp
script to a specific function name. Given the following snippet of the/boot/System.map-$(uname -r)
file, the address0xffffffff8024cd0f
maps to the functionunix_stream_recvmsg
and the address0xffffffff8044b472
maps to the functionarp_rcv
:[...] ffffffff8024c5cd T unlock_new_inode ffffffff8024c5da t unix_stream_sendmsg ffffffff8024c920 t unix_stream_recvmsg ffffffff8024cea1 t udp_v4_lookup_longway [...] ffffffff8044addc t arp_process ffffffff8044b360 t arp_rcv ffffffff8044b487 t parp_redo ffffffff8044b48c t arp_solicit [...]
Chapter 28. Profiling kernel activity with SystemTap
The following sections showcase scripts that profile kernel activity by monitoring function calls.
28.1. Counting function calls with SystemTap
You can use the functioncallcount.stp SystemTap script to count specific kernel function calls. You can also use this script to target multiple kernel functions.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
Run the functioncallcount.stp script:
# stap --example functioncallcount.stp 'argument'
This script takes the targeted kernel function as an argument. You can use the argument wildcards to target multiple kernel functions up to a certain extent.
The output of the script, in alphabetical order, contains the names of the functions called and how many times it was called during the sample time.
Consider the following example:
# stap -w -v --example functioncallcount.stp "*@mm*.c" -c /bin/true
where:
- -w : Suppresses warnings.
- -v : Makes the output of starting kernel visible.
-c command : Tells SystemTap to count function calls during the execution of a command, in this example being
/bin/true
.The output should look similar to the following:
[...] __vma_link 97 __vma_link_file 66 __vma_link_list 97 __vma_link_rb 97 __xchg 103 add_page_to_active_list 102 add_page_to_inactive_list 19 add_to_page_cache 19 add_to_page_cache_lru 7 all_vm_events 6 alloc_pages_node 4630 alloc_slabmgmt 67 anon_vma_alloc 62 anon_vma_free 62 anon_vma_lock 66 anon_vma_prepare 98 anon_vma_unlink 97 anon_vma_unlock 66 arch_get_unmapped_area_topdown 94 arch_get_unmapped_exec_area 3 arch_unmap_area_topdown 97 atomic_add 2 atomic_add_negative 97 atomic_dec_and_test 5153 atomic_inc 470 atomic_inc_and_test 1 [...]
28.2. Tracing function calls with SystemTap
You can use the para-callgraph.stp SystemTap script to trace function calls and function returns.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
- Run the para-callgraph.stp script.
# stap --example para-callgraph.stp 'argument1' 'argument2'
The script para-callgraph.stp takes two command-line arguments:
- The name of the function(s) whose entry/exit you’d like to trace.
- An optional trigger function, which enables or disables tracing on a per-thread basis. Tracing in each thread will continue as long as the trigger function has not exited yet.
Consider the following example:
# stap -wv --example para-callgraph.stp 'kernel.function("*@fs/proc.c*")' 'kernel.function("vfs_read")' -c "cat /proc/sys/vm/* || true"
where:
- -w : Suppresses warnings.
- -v : Makes the output of starting kernel visible.
-
-c command : Tells SystemTap to count function calls during the execution of a command, in this example being
/bin/true
.
The output should look similar to the following:
[...] 267 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5 269 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5 0 gnome-terminal(2921):->fput file=0xffff880111eebbc0 2 gnome-terminal(2921):<-fput 0 gnome-terminal(2921):->fget_light fd=0x3 fput_needed=0xffff88010544df54 3 gnome-terminal(2921):<-fget_light return=0xffff8801116ce980 0 gnome-terminal(2921):->vfs_read file=0xffff8801116ce980 buf=0xc86504 count=0x1000 pos=0xffff88010544df48 4 gnome-terminal(2921): ->rw_verify_area read_write=0x0 file=0xffff8801116ce980 ppos=0xffff88010544df48 count=0x1000 7 gnome-terminal(2921): <-rw_verify_area return=0x1000 12 gnome-terminal(2921): ->do_sync_read filp=0xffff8801116ce980 buf=0xc86504 len=0x1000 ppos=0xffff88010544df48 15 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5 18 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5 0 gnome-terminal(2921):->fput file=0xffff8801116ce980
28.3. Determining time spent in kernel and user space with SystemTap
You can use the thread-times.stp SystemTap script to determine the amount of time any given thread is spending in either the kernel or user-space.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
Run the thread-times.stp script:
# stap --example thread-times.stp
This script will display the top 20 processes taking up CPU time during a 5-second period, along with the total number of CPU ticks made during the sample. The output of this script also notes the percentage of CPU time each process used, as well as whether that time was spent in kernel space or user space.
tid %user %kernel (of 20002 ticks) 0 0.00% 87.88% 32169 5.24% 0.03% 9815 3.33% 0.36% 9859 0.95% 0.00% 3611 0.56% 0.12% 9861 0.62% 0.01% 11106 0.37% 0.02% 32167 0.08% 0.08% 3897 0.01% 0.08% 3800 0.03% 0.00% 2886 0.02% 0.00% 3243 0.00% 0.01% 3862 0.01% 0.00% 3782 0.00% 0.00% 21767 0.00% 0.00% 2522 0.00% 0.00% 3883 0.00% 0.00% 3775 0.00% 0.00% 3943 0.00% 0.00% 3873 0.00% 0.00%
28.4. Monitoring polling applications with SystemTap
You can use timeout.stp SystemTap script to identify and monitor which applications are polling. Doing so allows you to track unnecessary or excessive polling, which helps you pinpoint areas for improvement in terms of CPU usage and power savings.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
Run the timeout.stp script:
# stap --example timeout.stp
This script will track how many times each application uses the following system calls over time:
-
poll
-
select
-
epoll
-
itimer
-
futex
-
nanosleep
-
signal
In this example output you can see which process used which system call and how many times.
uid | poll select epoll itimer futex nanosle signal| process 28937 | 148793 0 0 4727 37288 0 0| firefox 22945 | 0 56949 0 1 0 0 0| scim-bridge 0 | 0 0 0 36414 0 0 0| swapper 4275 | 23140 0 0 1 0 0 0| mixer_applet2 4191 | 0 14405 0 0 0 0 0| scim-launcher 22941 | 7908 1 0 62 0 0 0| gnome-terminal 4261 | 0 0 0 2 0 7622 0| escd 3695 | 0 0 0 0 0 7622 0| gdm-binary 3483 | 0 7206 0 0 0 0 0| dhcdbd 4189 | 6916 0 0 2 0 0 0| scim-panel-gtk 1863 | 5767 0 0 0 0 0 0| iscsid
28.5. Tracking most frequently used system calls with SystemTap
You can use the topsys.stp SystemTap script to list the top 20 system calls used by the system per 5-second interval. It also lists how many times each system call was used during that period.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
Run the topsys.stp script:
# stap --example topsys.stp
Consider the following example:
# stap -v --example topsys.stp
where -v makes the output of starting kernel visible.
The output should look similar to the following:
-------------------------------------------------------------- SYSCALL COUNT gettimeofday 1857 read 1821 ioctl 1568 poll 1033 close 638 open 503 select 455 write 391 writev 335 futex 303 recvmsg 251 socket 137 clock_gettime 124 rt_sigprocmask 121 sendto 120 setitimer 106 stat 90 time 81 sigreturn 72 fstat 66 --------------------------------------------------------------
28.6. Tracking system call volume per process with SystemTap
You can use the syscalls_by_proc.stp SystemTap script to see which processes are performing the highest volume of system calls. It displays 20 processes performing the most of system calls.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
Run the syscalls_by_proc.stp script:
# stap --example syscalls_by_proc.stp
Output of the syscalls_by_proc.stp script looks similar to the following:
Collecting data... Type Ctrl-C to exit and display results #SysCalls Process Name 1577 multiload-apple 692 synergyc 408 pcscd 376 mixer_applet2 299 gnome-terminal 293 Xorg 206 scim-panel-gtk 95 gnome-power-man 90 artsd 85 dhcdbd 84 scim-bridge 78 gnome-screensav 66 scim-launcher [...]
Chapter 29. Monitoring disk and I/O activity with SystemTap
The following sections showcase scripts that monitor disk and I/O activity.
29.1. Summarizing disk read/write traffic with SystemTap
You can use the disktop.stp SystemTap script to identify which processes are performing the heaviest disk reads and writes to the system.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
Run the disktop.stp script:
# stap --example disktop.stp
The script displays the top ten processes responsible for the heaviest reads or writes to a disk.
The output includes the following data per listed process:
- UID
-
User ID. A user ID of
0
refers to the root user. - PID
- The ID of the listed process.
- PPID
- The process ID of the listed process’s parent process.
- CMD
- The name of the listed process.
- DEVICE
- Which storage device the listed process is reading from or writing to.
- T
-
The type of action performed by the listed process, where
W
refers to write, andR
refers to read. - BYTES
- The amount of data read to or written from disk.
Output of the disktop.stp
script looks similar to the following:
[...] Mon Sep 29 03:38:28 2008 , Average: 19Kb/sec, Read: 7Kb, Write: 89Kb UID PID PPID CMD DEVICE T BYTES 0 26319 26294 firefox sda5 W 90229 0 2758 2757 pam_timestamp_c sda5 R 8064 0 2885 1 cupsd sda5 W 1678 Mon Sep 29 03:38:38 2008 , Average: 1Kb/sec, Read: 7Kb, Write: 1Kb UID PID PPID CMD DEVICE T BYTES 0 2758 2757 pam_timestamp_c sda5 R 8064 0 2885 1 cupsd sda5 W 1678
29.2. Tracking I/O time for each file read or write with SystemTap
You can use the iotime.stp SystemTap script to monitor the amount of time it takes for each process to read from or write to any file. This helps you to determine what files are slow to load on a system.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
Run the iotime.stp script:
# stap --example iotime.stp
The script tracks each time a system call opens, closes, reads from, and writes to a file. For each file any system call accesses, It counts the number of microseconds it takes for any reads or writes to finish and tracks the amount of data , in bytes, read from or written to the file.
The output contains:
- A timestamp, in microseconds
- Process ID and process name
-
An
access
oriotime
flag The file accessed
If a process was able to read or write any data, a pair of access and
iotime
lines should appear together. The access line refers to the time that a given process started accessing a file. The end of the access line will show the amount of data read or written. Theiotime
line will show the amount of time, in microseconds, that the process took in order to perform the read or write.
Output of the iotime.stp
script looks similar to the following:
[...] 825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0 825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9 [...] 117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0 117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7 [...] 3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0 3973744 2886 (sendmail) iotime /proc/loadavg time: 11 [...]
29.3. Tracking cumulative I/O with SystemTap
You can use the traceio.stp SystemTap script to track the cumulative amount of I/O to the system.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
Run the traceio.stp script:
# stap --example traceio.stp
The script prints the top ten executables generating I/O traffic over time. It also tracks the cumulative amount of I/O reads and writes done by those executables. This information is tracked and printed out in 1-second intervals, and in descending order.
Output of the
traceio.stp
script looks similar to the following:
[...] Xorg r: 583401 KiB w: 0 KiB floaters r: 96 KiB w: 7130 KiB multiload-apple r: 538 KiB w: 537 KiB sshd r: 71 KiB w: 72 KiB pam_timestamp_c r: 138 KiB w: 0 KiB staprun r: 51 KiB w: 51 KiB snmpd r: 46 KiB w: 0 KiB pcscd r: 28 KiB w: 0 KiB irqbalance r: 27 KiB w: 4 KiB cupsd r: 4 KiB w: 18 KiB Xorg r: 588140 KiB w: 0 KiB floaters r: 97 KiB w: 7143 KiB multiload-apple r: 543 KiB w: 542 KiB sshd r: 72 KiB w: 72 KiB pam_timestamp_c r: 138 KiB w: 0 KiB staprun r: 51 KiB w: 51 KiB snmpd r: 46 KiB w: 0 KiB pcscd r: 28 KiB w: 0 KiB irqbalance r: 27 KiB w: 4 KiB cupsd r: 4 KiB w: 18 KiB
29.4. Monitoring I/O activity on a specific device with SystemTap
You can use the traceio2.stp SystemTap script to monitor I/O activity on a specific device.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
- Run the traceio2.stp script.
# stap --example traceio2.stp 'argument'
This script takes the whole device number as an argument. To find this number you can use:
# stat -c "0x%D" directory
Where directory is located on the device you want to monitor.
The output contains following:
- The name and ID of any process performing a read or write
-
The function it is performing (
vfs_read
orvfs_write
) - The kernel device number
Consider following output of # stap traceio2.stp 0x805
[...] synergyc(3722) vfs_read 0x800005 synergyc(3722) vfs_read 0x800005 cupsd(2889) vfs_write 0x800005 cupsd(2889) vfs_write 0x800005 cupsd(2889) vfs_write 0x800005 [...]
29.5. Monitoring reads and writes to a file with SystemTap
You can use the inodewatch.stp SystemTap script to monitor reads from and writes to a file in real time.
Prerequisites
- You have installed SystemTap as described in Installing Systemtap.
Procedure
-
Run the
inodewatch.stp
script.
# stap --example inodewatch.stp 'argument1' 'argument2' 'argument3'
The script inodewatch.stp
takes three command-line arguments:
- The file’s major device number.
- The file’s minor device number.
- The file’s inode number.
You can get these numbers using:
# stat -c '%D %i' filename
Where filename is an absolute path.
Consider following example:
# stat -c '%D %i' /etc/crontab
The output should look like:
805 1078319
where:
-
805
is the base-16 (hexadecimal) device number. The last two digits are the minor device number, and the remaining digits are the major number. -
1078319
is the inode number.
To start monitoring /etc/crontab
, run:
# stap inodewatch.stp 0x8 0x05 1078319
In the first two arguments you must use 0x prefixes for base-16 numbers.
The output contains following:
- The name and ID of any process performing a read or write
-
The function it is performing (
vfs_read
orvfs_write
) - The kernel device number
The output of this example should look like:
cat(16437) vfs_read 0x800005/1078319 cat(16437) vfs_read 0x800005/1078319