Red Hat Training

A Red Hat training course is available for RHEL 8

Monitoring and managing system status and performance

Red Hat Enterprise Linux 8

Optimizing system throughput, latency, and power consumption

Red Hat Customer Content Services

Abstract

This documentation collection provides instructions on how to monitor and optimize the throughput, latency, and power consumption of Red Hat Enterprise Linux 8 in different scenarios.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. We are beginning with these four terms: master, slave, blacklist, and whitelist. Because of the enormity of this endeavor, these changes will be implemented gradually over several upcoming releases. For more details, see our CTO Chris Wright’s message.

Providing feedback on Red Hat documentation

We appreciate your input on our documentation. Please let us know how we could make it better. To do so:

  • For simple comments on specific passages:

    1. Make sure you are viewing the documentation in the Multi-page HTML format. In addition, ensure you see the Feedback button in the upper right corner of the document.
    2. Use your mouse cursor to highlight the part of text that you want to comment on.
    3. Click the Add Feedback pop-up that appears below the highlighted text.
    4. Follow the displayed instructions.
  • For submitting more complex feedback, create a Bugzilla ticket:

    1. Go to the Bugzilla website.
    2. As the Component, use Documentation.
    3. Fill in the Description field with your suggestion for improvement. Include a link to the relevant part(s) of documentation.
    4. Click Submit Bug.

Chapter 1. Overview of performance monitoring options

The following are some of the performance monitoring and configuration tools available in Red Hat Enterprise Linux 8:

  • Performance Co-Pilot (pcp) is used for monitoring, visualizing, storing, and analyzing system-level performance measurements. It allows the monitoring and management of real-time data, and logging and retrieval of historical data.
  • Red Hat Enterprise Linux 8 provides several tools that can be used from the command line to monitor a system outside run level 5. The following are the built-in command line tools:

    • top is provided by the procps-ng package. It gives a dynamic view of the processes in a running system. It displays a variety of information, including a system summary and a list of tasks currently being managed by the Linux kernel.
    • ps is provided by the procps-ng package. It captures a snapshot of a select group of active processes. By default, the examined group is limited to processes that are owned by the current user and associated with the terminal where the ps command is executed.
    • Virtual memory statistics (vmstat) is provided by the procps-ng package. It provides instant reports of your system’s processes, memory, paging, block input/output, interrupts, and CPU activity.
    • System activity reporter (sar) is provided by the sysstat package. It collects and reports information about system activity that has occurred so far on the current day.
  • perf uses hardware performance counters and kernel trace-points to track the impact of other commands and applications on a system.
  • bcc-tools is used for BPF Compiler Collection (BCC). It provides over 100 eBPF scripts that monitor kernel activities. For more information on each of this tool, see the man page describing how to use it and what functions it performs.
  • turbostat is provided by the kernel-tools package. It reports on processor topology, frequency, idle power-state statistics, temperature, and power usage on the Intel 64 processors.
  • iostat is provided by the sysstat package. It monitors and reports on system IO device loading to help administrators make decisions about how to balance IO load between physical disks.
  • irqbalance distributes hardware interrupts across processors to improve system performance.
  • ss prints statistical information about sockets, allowing administrators to assess device performance over time. Red Hat recommends using ss over netstat in Red Hat Enterprise Linux 8.
  • numastat is provided by the numactl package. By default, numastat displays per-node NUMA hit an miss system statistics from the kernel memory allocator. Optimal performance is indicated by high numa_hit values and low numa_miss values.
  • numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system that dynamically improves NUMA resource allocation, management, and therefore system performance.
  • SystemTap monitors and analyzes operating system activities, especially the kernel activities.
  • valgrind analyzes applications by running it on a synthetic CPU and instrumenting existing application code as it is executed. It then prints commentary that clearly identifies each process involved in application execution to a user-specified file, file descriptor, or network socket. It is also useful for finding memory leaks.
  • pqos is provided by the the intel-cmt-cat package. It monitors and controls CPU cache and memory bandwidth on recent Intel processors.

Additional resources

Chapter 2. Getting started with Tuned

As a system administrator, you can use the Tuned application to optimize the performance profile of your system for a variety of use cases.

2.1. The purpose of Tuned

Tuned is a service that monitors your system and optimizes the performance under certain workloads. The core of Tuned are profiles, which tune your system for different use cases.

Tuned is distributed with a number of predefined profiles for use cases such as:

  • High throughput
  • Low latency
  • Saving power

It is possible to modify the rules defined for each profile and customize how to tune a particular device. When you switch to another profile or deactivate Tuned, all changes made to the system settings by the previous profile revert back to their original state.

You can also configure Tuned to react to changes in device usage and adjusts settings to improve performance of active devices and reduce power consumption of inactive devices.

2.2. Tuned profiles

A detailed analysis of a system can be very time-consuming. Tuned provides a number of predefined profiles for typical use cases. You can also create, modify, and delete profiles.

The profiles provided with Tuned are divided into the following categories:

  • Power-saving profiles
  • Performance-boosting profiles

The performance-boosting profiles include profiles that focus on the following aspects:

  • Low latency for storage and network
  • High throughput for storage and network
  • Virtual machine performance
  • Virtualization host performance

Syntax of profile configuration

The tuned.conf file can contain one [main] section and other sections for configuring plug-in instances. However, all sections are optional.

Lines starting with the hash sign (#) are comments.

Additional resources

  • tuned.conf(5) man page.

2.3. The default Tuned profile

During the installation, the best profile for your system is selected automatically. Currently, the default profile is selected according to the following customizable rules:

EnvironmentDefault profileGoal

Compute nodes

throughput-performance

The best throughput performance

Virtual machines

virtual-guest

The best performance. If you are not interested in the best performance, you can change it to the balanced or powersave profile.

Other cases

balanced

Balanced performance and power consumption

Additional resources

  • tuned.conf(5) man page.

2.4. Merged Tuned profiles

As an experimental feature, it is possible to select more profiles at once. Tuned will try to merge them during the load.

If there are conflicts, the settings from the last specified profile takes precedence.

Example 2.1. Low power consumption in a virtual guest

The following example optimizes the system to run in a virtual machine for the best performance and concurrently tunes it for low power consumption, while the low power consumption is the priority:

# tuned-adm profile virtual-guest powersave
Warning

Merging is done automatically without checking whether the resulting combination of parameters makes sense. Consequently, the feature might tune some parameters the opposite way, which might be counterproductive: for example, setting the disk for high throughput by using the throughput-performance profile and concurrently setting the disk spindown to the low value by the spindown-disk profile.

Additional resources

  • tuned.conf(5) man page.

2.5. The location of Tuned profiles

Tuned stores profiles in the following directories:

/usr/lib/tuned/
Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile consists of the main configuration file called tuned.conf, and optionally other files, for example helper scripts.
/etc/tuned/
If you need to customize a profile, copy the profile directory into the directory, which is used for custom profiles. If there are two profiles of the same name, the custom profile located in /etc/tuned/ is used.

Additional resources

  • tuned.conf(5) man page.

2.6. Tuned profiles distributed with RHEL

The following is a list of profiles that are installed with Tuned on Red Hat Enterprise Linux.

Note

There might be more product-specific or third-party Tuned profiles available. Such profiles are usually provided by separate RPM packages.

balanced

The default power-saving profile. It is intended to be a compromise between performance and power consumption. It uses auto-scaling and auto-tuning whenever possible. The only drawback is the increased latency. In the current Tuned release, it enables the CPU, disk, audio, and video plugins, and activates the conservative CPU governor. The radeon_powersave option uses the dpm-balanced value if it is supported, otherwise it is set to auto.

It changes the energy_performance_preference attribute to the normal energy setting. It also changes the scaling_governor policy attribute to either the conservative or powersave CPU governor.

powersave

A profile for maximum power saving performance. It can throttle the performance in order to minimize the actual power consumption. In the current Tuned release it enables USB autosuspend, WiFi power saving, and Aggressive Link Power Management (ALPM) power savings for SATA host adapters. It also schedules multi-core power savings for systems with a low wakeup rate and activates the ondemand governor. It enables AC97 audio power saving or, depending on your system, HDA-Intel power savings with a 10 seconds timeout. If your system contains a supported Radeon graphics card with enabled KMS, the profile configures it to automatic power saving. On ASUS Eee PCs, a dynamic Super Hybrid Engine is enabled.

It changes the energy_performance_preference attribute to the powersave or power energy setting. It also changes the scaling_governor policy attribute to either the ondemand or powersave CPU governor.

Note

In certain cases, the balanced profile is more efficient compared to the powersave profile.

Consider there is a defined amount of work that needs to be done, for example a video file that needs to be transcoded. Your machine might consume less energy if the transcoding is done on the full power, because the task is finished quickly, the machine starts to idle, and it can automatically step-down to very efficient power save modes. On the other hand, if you transcode the file with a throttled machine, the machine consumes less power during the transcoding, but the process takes longer and the overall consumed energy can be higher.

That is why the balanced profile can be generally a better option.

throughput-performance

A server profile optimized for high throughput. It disables power savings mechanisms and enables sysctl settings that improve the throughput performance of the disk and network IO. CPU governor is set to performance.

It changes the energy_performance_preference and scaling_governor attribute to the performance profile.

accelerator-performance
The accelerator-performance profile contains the same tuning as the throughput-performance profile. Additionally, it locks the CPU to low C states so that the latency is less than 100us. This improves the performance of certain accelerators, such as GPUs.
latency-performance

A server profile optimized for low latency. It disables power savings mechanisms and enables sysctl settings that improve latency. CPU governor is set to performance and the CPU is locked to the low C states (by PM QoS).

It changes the energy_performance_preference and scaling_governor attribute to the performance profile.

network-latency

A profile for low latency network tuning. It is based on the latency-performance profile. It additionally disables transparent huge pages and NUMA balancing, and tunes several other network-related sysctl parameters.

It inherits the latency-performance profile which changes the energy_performance_preference and scaling_governor attribute to the performance profile.

hpc-compute
A profile optimized for high-performance computing. It is based on the latency-performance profile.
network-throughput

A profile for throughput network tuning. It is based on the throughput-performance profile. It additionally increases kernel network buffers.

It inherits either the latency-performance or throughput-performance profile, and changes the energy_performance_preference and scaling_governor attribute to the performance profile.

virtual-guest

A profile designed for Red Hat Enterprise Linux 8 virtual machines and VMWare guests based on the throughput-performance profile that, among other tasks, decreases virtual memory swappiness and increases disk readahead values. It does not disable disk barriers.

It inherits the throughput-performance profile and changes the energy_performance_preference and scaling_governor attribute to the performance profile.

virtual-host

A profile designed for virtual hosts based on the throughput-performance profile that, among other tasks, decreases virtual memory swappiness, increases disk readahead values, and enables a more aggressive value of dirty pages writeback.

It inherits the throughput-performance profile and changes the energy_performance_preference and scaling_governor attribute to the performance profile.

oracle
A profile optimized for Oracle databases loads based on throughput-performance profile. It additionally disables transparent huge pages and modifies other performance-related kernel parameters. This profile is provided by the tuned-profiles-oracle package.
desktop
A profile optimized for desktops, based on the balanced profile. It additionally enables scheduler autogroups for better response of interactive applications.
cpu-partitioning

The cpu-partitioning profile partitions the system CPUs into isolated and housekeeping CPUs. To reduce jitter and interruptions on an isolated CPU, the profile clears the isolated CPU from user-space processes, movable kernel threads, interrupt handlers, and kernel timers.

A housekeeping CPU can run all services, shell processes, and kernel threads.

You can configure the cpu-partitioning profile in /etc/tuned/cpu-partitioning-variables.conf file. The configuration options are:

isolated_cores=cpu-list
Lists CPUs to isolate. The list of isolated CPUs is comma-separated or the user can specify the range. You can specify a range using a dash, such as 3-5. This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping CPU.
no_balance_cores=cpu-list
Lists CPUs which are not considered by the kernel during system wide process load-balancing. This option is optional. This is usually the same list as isolated_cores.

For more information on cpu-partitioning, see the tuned-profiles-cpu-partitioning(7) man page.

optimize-serial-console

A profile that tunes down I/O activity to the serial console by reducing the printk value. This should make the serial console more responsive. This profile is intended to be used as an overlay on other profiles. For example:

# tuned-adm profile throughput-performance optimize-serial-console
mssql
A profile provided for Microsoft SQL Server. It is based on the thoguhput-performance profile.
intel-sst

A profile optimized for systems with user-defined Intel Speed Select Technology configurations. This profile is intended to be used as an overlay on other profiles. For example:

# tuned-adm profile cpu-partitioning intel-sst

2.7. Real-time Tuned profiles distributed with RHEL

Real-time profiles are intended for systems running the real-time kernel. Without a special kernel build, they do not configure the system to be real-time. On RHEL, the profiles are available from additional repositories.

The following real-time profiles are available:

realtime

Use on bare-metal real-time systems.

Provided by the tuned-profiles-realtime package, which is available from the RT or NFV repositories.

realtime-virtual-host

Use in a virtualization host configured for real-time.

Provided by the tuned-profiles-nfv-host package, which is available from the NFV repository.

realtime-virtual-guest

Use in a virtualization guest configured for real-time.

Provided by the tuned-profiles-nfv-guest package, which is available from the NFV repository.

2.8. Static and dynamic tuning in Tuned

This section explains the difference between the two categories of system tuning that Tuned applies: static and dynamic.

Static tuning
Mainly consists of the application of predefined sysctl and sysfs settings and one-shot activation of several configuration tools such as ethtool.
Dynamic tuning

Watches how various system components are used throughout the uptime of your system. Tuned adjusts system settings dynamically based on that monitoring information.

For example, the hard drive is used heavily during startup and login, but is barely used later when the user might mainly work with applications such as web browsers or email clients. Similarly, the CPU and network devices are used differently at different times. Tuned monitors the activity of these components and reacts to the changes in their use.

By default, dynamic tuning is disabled. To enable it, edit the /etc/tuned/tuned-main.conf file and change the dynamic_tuning option to 1. Tuned then periodically analyzes system statistics and uses them to update your system tuning settings. To configure the time interval in seconds between these updates, use the update_interval option.

Currently implemented dynamic tuning algorithms try to balance the performance and powersave, and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be enabled or disabled in the Tuned profiles.

Example 2.2. Static and dynamic tuning on a workstation

On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a few emails go in and out or some web pages might be loaded.

For those kinds of loads, the network interface does not have to run at full speed all the time, as it does by default. Tuned has a monitoring and tuning plug-in for network devices that can detect this low activity and then automatically lower the speed of that interface, typically resulting in a lower power usage.

If the activity on the interface increases for a longer period of time, for example because a DVD image is being downloaded or an email with a large attachment is opened, Tuned detects this and sets the interface speed to maximum to offer the best performance while the activity level is high.

This principle is used for other plug-ins for CPU and disks as well.

2.9. Tuned no-daemon mode

You can run Tuned in no-daemon mode, which does not require any resident memory. In this mode, Tuned applies the settings and exits.

By default, no-daemon mode is disabled because a lot of Tuned functionality is missing in this mode, including:

  • D-Bus support
  • Hot-plug support
  • Rollback support for settings

To enable no-daemon mode, include the following line in the /etc/tuned/tuned-main.conf file:

daemon = 0

2.10. Installing and enabling Tuned

This procedure installs and enables the Tuned application, installs Tuned profiles, and presets a default Tuned profile for your system.

Procedure

  1. Install the tuned package:

    # yum install tuned
  2. Enable and start the tuned service:

    # systemctl enable --now tuned
  3. Optionally, install Tuned profiles for real-time systems:

    # yum install tuned-profiles-realtime tuned-profiles-nfv
  4. Verify that a Tuned profile is active and applied:

    $ tuned-adm active
    
    Current active profile: balanced
    $ tuned-adm verify
    
    Verfication succeeded, current system settings match the preset profile.
    See tuned log file ('/var/log/tuned/tuned.log') for details.

2.11. Listing available Tuned profiles

This procedure lists all Tuned profiles that are currently available on your system.

Procedure

  • To list all available Tuned profiles on your system, use:

    $ tuned-adm list
    
    Available profiles:
    - balanced               - General non-specialized tuned profile
    - desktop                - Optimize for the desktop use-case
    - latency-performance    - Optimize for deterministic performance at the cost of increased power consumption
    - network-latency        - Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance
    - network-throughput     - Optimize for streaming network throughput, generally only necessary on older CPUs or 40G+ networks
    - powersave              - Optimize for low power consumption
    - throughput-performance - Broadly applicable tuning that provides excellent performance across a variety of common server workloads
    - virtual-guest          - Optimize for running inside a virtual guest
    - virtual-host           - Optimize for running KVM guests
    Current active profile: balanced
  • To display only the currently active profile, use:

    $ tuned-adm active
    
    Current active profile: balanced

Additional resources

  • tuned-adm(8) man page.

2.12. Setting a Tuned profile

This procedure activates a selected Tuned profile on your system.

Prerequisites

Procedure

  1. Optionally, you can let Tuned recommend the most suitable profile for your system:

    # tuned-adm recommend
    
    balanced
  2. Activate a profile:

    # tuned-adm profile selected-profile

    Alternatively, you can activate a combination of multiple profiles:

    # tuned-adm profile profile1 profile2

    Example 2.3. A virtual machine optimized for low power consumption

    The following example optimizes the system to run in a virtual machine with the best performance and concurrently tunes it for low power consumption, while the low power consumption is the priority:

    # tuned-adm profile virtual-guest powersave
  3. View the current active Tuned profile on your system:

    # tuned-adm active
    
    Current active profile: selected-profile
  4. Reboot the system:

    # reboot

Verification steps

  • Verify that the Tuned profile is active and applied:

    $ tuned-adm verify
    
    Verfication succeeded, current system settings match the preset profile.
    See tuned log file ('/var/log/tuned/tuned.log') for details.

Additional resources

  • tuned-adm(8) man page

2.13. Disabling Tuned

This procedure disables Tuned and resets all affected system settings to their original state before Tuned modified them.

Procedure

  • To disable all tunings temporarily:

    # tuned-adm off

    The tunings are applied again after the tuned service restarts.

  • Alternatively, to stop and disable the tuned service permanently:

    # systemctl disable --now tuned

Additional resources

  • tuned-adm(8) man page

Chapter 3. Customizing Tuned profiles

You can create or modify Tuned profiles to optimize system performance for your intended use case.

Prerequisites

3.1. Tuned profiles

A detailed analysis of a system can be very time-consuming. Tuned provides a number of predefined profiles for typical use cases. You can also create, modify, and delete profiles.

The profiles provided with Tuned are divided into the following categories:

  • Power-saving profiles
  • Performance-boosting profiles

The performance-boosting profiles include profiles that focus on the following aspects:

  • Low latency for storage and network
  • High throughput for storage and network
  • Virtual machine performance
  • Virtualization host performance

Syntax of profile configuration

The tuned.conf file can contain one [main] section and other sections for configuring plug-in instances. However, all sections are optional.

Lines starting with the hash sign (#) are comments.

Additional resources

  • tuned.conf(5) man page.

3.2. The default Tuned profile

During the installation, the best profile for your system is selected automatically. Currently, the default profile is selected according to the following customizable rules:

EnvironmentDefault profileGoal

Compute nodes

throughput-performance

The best throughput performance

Virtual machines

virtual-guest

The best performance. If you are not interested in the best performance, you can change it to the balanced or powersave profile.

Other cases

balanced

Balanced performance and power consumption

Additional resources

  • tuned.conf(5) man page.

3.3. Merged Tuned profiles

As an experimental feature, it is possible to select more profiles at once. Tuned will try to merge them during the load.

If there are conflicts, the settings from the last specified profile takes precedence.

Example 3.1. Low power consumption in a virtual guest

The following example optimizes the system to run in a virtual machine for the best performance and concurrently tunes it for low power consumption, while the low power consumption is the priority:

# tuned-adm profile virtual-guest powersave
Warning

Merging is done automatically without checking whether the resulting combination of parameters makes sense. Consequently, the feature might tune some parameters the opposite way, which might be counterproductive: for example, setting the disk for high throughput by using the throughput-performance profile and concurrently setting the disk spindown to the low value by the spindown-disk profile.

Additional resources

  • tuned.conf(5) man page.

3.4. The location of Tuned profiles

Tuned stores profiles in the following directories:

/usr/lib/tuned/
Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile consists of the main configuration file called tuned.conf, and optionally other files, for example helper scripts.
/etc/tuned/
If you need to customize a profile, copy the profile directory into the directory, which is used for custom profiles. If there are two profiles of the same name, the custom profile located in /etc/tuned/ is used.

Additional resources

  • tuned.conf(5) man page.

3.5. Inheritance between Tuned profiles

Tuned profiles can be based on other profiles and modify only certain aspects of their parent profile.

The [main] section of Tuned profiles recognizes the include option:

[main]
include=parent

All settings from the parent profile are loaded in this child profile. In the following sections, the child profile can override certain settings inherited from the parent profile or add new settings not present in the parent profile.

You can create your own child profile in the /etc/tuned/ directory based on a pre-installed profile in /usr/lib/tuned/ with only some parameters adjusted.

If the parent profile is updated, such as after a Tuned upgrade, the changes are reflected in the child profile.

Example 3.2. A power-saving profile based on balanced

The following is an example of a custom profile that extends the balanced profile and sets Aggressive Link Power Management (ALPM) for all devices to the maximum powersaving.

[main]
include=balanced

[scsi_host]
alpm=min_power

Additional resources

  • tuned.conf(5) man page

3.6. Static and dynamic tuning in Tuned

This section explains the difference between the two categories of system tuning that Tuned applies: static and dynamic.

Static tuning
Mainly consists of the application of predefined sysctl and sysfs settings and one-shot activation of several configuration tools such as ethtool.
Dynamic tuning

Watches how various system components are used throughout the uptime of your system. Tuned adjusts system settings dynamically based on that monitoring information.

For example, the hard drive is used heavily during startup and login, but is barely used later when the user might mainly work with applications such as web browsers or email clients. Similarly, the CPU and network devices are used differently at different times. Tuned monitors the activity of these components and reacts to the changes in their use.

By default, dynamic tuning is disabled. To enable it, edit the /etc/tuned/tuned-main.conf file and change the dynamic_tuning option to 1. Tuned then periodically analyzes system statistics and uses them to update your system tuning settings. To configure the time interval in seconds between these updates, use the update_interval option.

Currently implemented dynamic tuning algorithms try to balance the performance and powersave, and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be enabled or disabled in the Tuned profiles.

Example 3.3. Static and dynamic tuning on a workstation

On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a few emails go in and out or some web pages might be loaded.

For those kinds of loads, the network interface does not have to run at full speed all the time, as it does by default. Tuned has a monitoring and tuning plug-in for network devices that can detect this low activity and then automatically lower the speed of that interface, typically resulting in a lower power usage.

If the activity on the interface increases for a longer period of time, for example because a DVD image is being downloaded or an email with a large attachment is opened, Tuned detects this and sets the interface speed to maximum to offer the best performance while the activity level is high.

This principle is used for other plug-ins for CPU and disks as well.

3.7. Tuned plug-ins

Plug-ins are modules in Tuned profiles that Tuned uses to monitor or optimize different devices on the system.

Tuned uses two types of plug-ins:

Monitoring plug-ins

Monitoring plug-ins are used to get information from a running system. The output of the monitoring plug-ins can be used by tuning plug-ins for dynamic tuning.

Monitoring plug-ins are automatically instantiated whenever their metrics are needed by any of the enabled tuning plug-ins. If two tuning plug-ins require the same data, only one instance of the monitoring plug-in is created and the data is shared.

Tuning plug-ins
Each tuning plug-in tunes an individual subsystem and takes several parameters that are populated from the tuned profiles. Each subsystem can have multiple devices, such as multiple CPUs or network cards, that are handled by individual instances of the tuning plug-ins. Specific settings for individual devices are also supported.

Syntax for plug-ins in Tuned profiles

Sections describing plug-in instances are formatted in the following way:

[NAME]
type=TYPE
devices=DEVICES
NAME
is the name of the plug-in instance as it is used in the logs. It can be an arbitrary string.
TYPE
is the type of the tuning plug-in.
DEVICES

is the list of devices that this plug-in instance handles.

The devices line can contain a list, a wildcard (*), and negation (!). If there is no devices line, all devices present or later attached on the system of the TYPE are handled by the plug-in instance. This is same as using the devices=* option.

Example 3.4. Matching block devices with a plug-in

The following example matches all block devices starting with sd, such as sda or sdb, and does not disable barriers on them:

[data_disk]
type=disk
devices=sd*
disable_barriers=false

The following example matches all block devices except sda1 and sda2:

[data_disk]
type=disk
devices=!sda1, !sda2
disable_barriers=false

If no instance of a plug-in is specified, the plug-in is not enabled.

If the plug-in supports more options, they can be also specified in the plug-in section. If the option is not specified and it was not previously specified in the included plug-in, the default value is used.

Short plug-in syntax

If you do not need custom names for the plug-in instance and there is only one definition of the instance in your configuration file, Tuned supports the following short syntax:

[TYPE]
devices=DEVICES

In this case, it is possible to omit the type line. The instance is then referred to with a name, same as the type. The previous example could be then rewritten into:

Example 3.5. Matching block devices using the short syntax

[disk]
devices=sdb*
disable_barriers=false

Conflicting plug-in definitions in a profile

If the same section is specified more than once using the include option, the settings are merged. If they cannot be merged due to a conflict, the last conflicting definition overrides the previous settings. If you do not know what was previously defined, you can use the replace Boolean option and set it to true. This causes all the previous definitions with the same name to be overwritten and the merge does not happen.

You can also disable the plug-in by specifying the enabled=false option. This has the same effect as if the instance was never defined. Disabling the plug-in is useful if you are redefining the previous definition from the include option and do not want the plug-in to be active in your custom profile.

NOTE

Tuned includes the ability to run any shell command as part of enabling or disabling a tuning profile. This enables you to extend Tuned profiles with functionality that has not been integrated into Tuned yet.

You can specify arbitrary shell commands using the script plug-in.

Additional resources

  • tuned.conf(5) man page

3.8. Available Tuned plug-ins

This section lists all monitoring and tuning plug-ins currently available in Tuned.

Monitoring plug-ins

Currently, the following monitoring plug-ins are implemented:

disk
Gets disk load (number of IO operations) per device and measurement interval.
net
Gets network load (number of transferred packets) per network card and measurement interval.
load
Gets CPU load per CPU and measurement interval.

Tuning plug-ins

Currently, the following tuning plug-ins are implemented. Only some of these plug-ins implement dynamic tuning. Options supported by plug-ins are also listed:

cpu

Sets the CPU governor to the value specified by the governor option and dynamically changes the Power Management Quality of Service (PM QoS) CPU Direct Memory Access (DMA) latency according to the CPU load.

If the CPU load is lower than the value specified by the load_threshold option, the latency is set to the value specified by the latency_high option, otherwise it is set to the value specified by latency_low.

You can also force the latency to a specific value and prevent it from dynamically changing further. To do so, set the force_latency option to the required latency value.

eeepc_she

Dynamically sets the front-side bus (FSB) speed according to the CPU load.

This feature can be found on some netbooks and is also known as the ASUS Super Hybrid Engine (SHE).

If the CPU load is lower or equal to the value specified by the load_threshold_powersave option, the plug-in sets the FSB speed to the value specified by the she_powersave option. If the CPU load is higher or equal to the value specified by the load_threshold_normal option, it sets the FSB speed to the value specified by the she_normal option.

Static tuning is not supported and the plug-in is transparently disabled if Tuned does not detect the hardware support for this feature.

net
Configures the Wake-on-LAN functionality to the values specified by the wake_on_lan option. It uses the same syntax as the ethtool utility. It also dynamically changes the interface speed according to the interface utilization.
sysctl

Sets various sysctl settings specified by the plug-in options.

The syntax is name=value, where name is the same as the name provided by the sysctl utility.

Use the sysctl plug-in if you need to change system settings that are not covered by other plug-ins available in Tuned. If the settings are covered by some specific plug-ins, prefer these plug-ins.

usb

Sets autosuspend timeout of USB devices to the value specified by the autosuspend parameter.

The value 0 means that autosuspend is disabled.

vm

Enables or disables transparent huge pages depending on the value of the transparent_hugepages option.

Valid values of the transparent_hugepages option are:

  • "always"
  • "never"
  • "madvise"
audio

Sets the autosuspend timeout for audio codecs to the value specified by the timeout option.

Currently, the snd_hda_intel and snd_ac97_codec codecs are supported. The value 0 means that the autosuspend is disabled. You can also enforce the controller reset by setting the Boolean option reset_controller to true.

disk

Sets the disk elevator to the value specified by the elevator option.

It also sets:

  • APM to the value specified by the apm option
  • Scheduler quantum to the value specified by the scheduler_quantum option
  • Disk spindown timeout to the value specified by the spindown option
  • Disk readahead to the value specified by the readahead parameter
  • The current disk readahead to a value multiplied by the constant specified by the readahead_multiply option

In addition, this plug-in dynamically changes the advanced power management and spindown timeout setting for the drive according to the current drive utilization. The dynamic tuning can be controlled by the Boolean option dynamic and is enabled by default.

scsi_host

Tunes options for SCSI hosts.

It sets Aggressive Link Power Management (ALPM) to the value specified by the alpm option.

mounts
Enables or disables barriers for mounts according to the Boolean value of the disable_barriers option.
script

Executes an external script or binary when the profile is loaded or unloaded. You can choose an arbitrary executable.

Important

The script plug-in is provided mainly for compatibility with earlier releases. Prefer other Tuned plug-ins if they cover the required functionality.

Tuned calls the executable with one of the following arguments:

  • start when loading the profile
  • stop when unloading the profile

You need to correctly implement the stop action in your executable and revert all settings that you changed during the start action. Otherwise, the roll-back step after changing your Tuned profile will not work.

Bash scripts can import the /usr/lib/tuned/functions Bash library and use the functions defined there. Use these functions only for functionality that is not natively provided by Tuned. If a function name starts with an underscore, such as _wifi_set_power_level, consider the function private and do not use it in your scripts, because it might change in the future.

Specify the path to the executable using the script parameter in the plug-in configuration.

Example 3.6. Running a Bash script from a profile

To run a Bash script named script.sh that is located in the profile directory, use:

[script]
script=${i:PROFILE_DIR}/script.sh
sysfs

Sets various sysfs settings specified by the plug-in options.

The syntax is name=value, where name is the sysfs path to use.

Use this plugin in case you need to change some settings that are not covered by other plug-ins. Prefer specific plug-ins if they cover the required settings.

video

Sets various powersave levels on video cards. Currently, only the Radeon cards are supported.

The powersave level can be specified by using the radeon_powersave option. Supported values are:

  • default
  • auto
  • low
  • mid
  • high
  • dynpm
  • dpm-battery
  • dpm-balanced
  • dpm-perfomance

For details, see www.x.org. Note that this plug-in is experimental and the option might change in future releases.

bootloader

Adds options to the kernel command line. This plug-in supports only the GRUB 2 boot loader.

Customized non-standard location of the GRUB 2 configuration file can be specified by the grub2_cfg_file option.

The kernel options are added to the current GRUB configuration and its templates. The system needs to be rebooted for the kernel options to take effect.

Switching to another profile or manually stopping the tuned service removes the additional options. If you shut down or reboot the system, the kernel options persist in the grub.cfg file.

The kernel options can be specified by the following syntax:

cmdline=arg1 arg2 ... argN

Example 3.7. Modifying the kernel command line

For example, to add the quiet kernel option to a Tuned profile, include the following lines in the tuned.conf file:

[bootloader]
cmdline=quiet

The following is an example of a custom profile that adds the isolcpus=2 option to the kernel command line:

[bootloader]
cmdline=isolcpus=2

3.9. Variables in Tuned profiles

Variables expand at run time when a Tuned profile is activated.

Using Tuned variables reduces the amount of necessary typing in Tuned profiles.

There are no predefined variables in Tuned profiles. You can define your own variables by creating the [variables] section in a profile and using the following syntax:

[variables]

variable_name=value

To expand the value of a variable in a profile, use the following syntax:

${variable_name}

Example 3.8. Isolating CPU cores using variables

In the following example, the ${isolated_cores} variable expands to 1,2; hence the kernel boots with the isolcpus=1,2 option:

[variables]
isolated_cores=1,2

[bootloader]
cmdline=isolcpus=${isolated_cores}

The variables can be specified in a separate file. For example, you can add the following lines to tuned.conf:

[variables]
include=/etc/tuned/my-variables.conf

[bootloader]
cmdline=isolcpus=${isolated_cores}

If you add the isolated_cores=1,2 option to the /etc/tuned/my-variables.conf file, the kernel boots with the isolcpus=1,2 option.

Additional resources

  • tuned.conf(5) man page

3.10. Built-in functions in Tuned profiles

Built-in functions expand at run time when a Tuned profile is activated.

You can:

  • Use various built-in functions together with Tuned variables
  • Create custom functions in Python and add them to Tuned in the form of plug-ins

To call a function, use the following syntax:

${f:function_name:argument_1:argument_2}

To expand the directory path where the profile and the tuned.conf file are located, use the PROFILE_DIR function, which requires special syntax:

${i:PROFILE_DIR}

Example 3.9. Isolating CPU cores using variables and built-in functions

In the following example, the ${non_isolated_cores} variable expands to 0,3-5, and the cpulist_invert built-in function is called with the 0,3-5 argument:

[variables]
non_isolated_cores=0,3-5

[bootloader]
cmdline=isolcpus=${f:cpulist_invert:${non_isolated_cores}}

The cpulist_invert function inverts the list of CPUs. For a 6-CPU machine, the inversion is 1,2, and the kernel boots with the isolcpus=1,2 command-line option.

Additional resources

  • tuned.conf(5) man page

3.11. Built-in functions available in Tuned profiles

The following built-in functions are available in all Tuned profiles:

PROFILE_DIR
Returns the directory path where the profile and the tuned.conf file are located.
exec
Executes a process and returns its output.
assertion
Compares two arguments. If they do not match, the function logs text from the first argument and aborts profile loading.
assertion_non_equal
Compares two arguments. If they match, the function logs text from the first argument and aborts profile loading.
kb2s
Converts kilobytes to disk sectors.
s2kb
Converts disk sectors to kilobytes.
strip
Creates a string from all passed arguments and deletes both leading and trailing white space.
virt_check

Checks whether Tuned is running inside a virtual machine (VM) or on bare metal:

  • Inside a VM, the function returns the first argument.
  • On bare metal, the function returns the second argument, even in case of an error.
cpulist_invert
Inverts a list of CPUs to make its complement. For example, on a system with 4 CPUs, numbered from 0 to 3, the inversion of the list 0,2,3 is 1.
cpulist2hex
Converts a CPU list to a hexadecimal CPU mask.
cpulist2hex_invert
Converts a CPU list to a hexadecimal CPU mask and inverts it.
hex2cpulist
Converts a hexadecimal CPU mask to a CPU list.
cpulist_online
Checks whether the CPUs from the list are online. Returns the list containing only online CPUs.
cpulist_present
Checks whether the CPUs from the list are present. Returns the list containing only present CPUs.
cpulist_unpack
Unpacks a CPU list in the form of 1-3,4 to 1,2,3,4.
cpulist_pack
Packs a CPU list in the form of 1,2,3,5 to 1-3,5.

3.12. Creating new Tuned profiles

This procedure creates a new Tuned profile with custom performance rules.

Prerequisites

Procedure

  1. In the /etc/tuned/ directory, create a new directory named the same as the profile that you want to create:

    # mkdir /etc/tuned/my-profile
  2. In the new directory, create a file named tuned.conf. Add a [main] section and plug-in definitions in it, according to your requirements.

    For example, see the configuration of the balanced profile:

    [main]
    summary=General non-specialized tuned profile
    
    [cpu]
    governor=conservative
    energy_perf_bias=normal
    
    [audio]
    timeout=10
    
    [video]
    radeon_powersave=dpm-balanced, auto
    
    [scsi_host]
    alpm=medium_power
  3. To activate the profile, use:

    # tuned-adm profile my-profile
  4. Verify that the Tuned profile is active and the system settings are applied:

    $ tuned-adm active
    
    Current active profile: my-profile
    $ tuned-adm verify
    
    Verfication succeeded, current system settings match the preset profile.
    See tuned log file ('/var/log/tuned/tuned.log') for details.

Additional resources

  • tuned.conf(5) man page

3.13. Modifying existing Tuned profiles

This procedure creates a modified child profile based on an existing Tuned profile.

Prerequisites

Procedure

  1. In the /etc/tuned/ directory, create a new directory named the same as the profile that you want to create:

    # mkdir /etc/tuned/modified-profile
  2. In the new directory, create a file named tuned.conf, and set the [main] section as follows:

    [main]
    include=parent-profile

    Replace parent-profile with the name of the profile you are modifying.

  3. Include your profile modifications.

    Example 3.10. Lowering swappiness in the throughput-performance profile

    To use the settings from the throughput-performance profile and change the value of vm.swappiness to 5, instead of the default 10, use:

    [main]
    include=throughput-performance
    
    [sysctl]
    vm.swappiness=5
  4. To activate the profile, use:

    # tuned-adm profile modified-profile
  5. Verify that the Tuned profile is active and the system settings are applied:

    $ tuned-adm active
    
    Current active profile: my-profile
    $ tuned-adm verify
    
    Verfication succeeded, current system settings match the preset profile.
    See tuned log file ('/var/log/tuned/tuned.log') for details.

Additional resources

  • tuned.conf(5) man page

3.14. Setting the disk scheduler using Tuned

This procedure creates and enables a Tuned profile that sets a given disk scheduler for selected block devices. The setting persists across system reboots.

In the following commands and configuration, replace:

  • device with the name of the block device, for example sdf
  • selected-scheduler with the disk scheduler that you want to set for the device, for example bfq

Prerequisites

Procedure

  1. Optional: Select an existing Tuned profile on which your profile will be based. For a list of available profiles, see Tuned profiles distributed with RHEL.

    To see which profile is currently active, use:

    $ tuned-adm active
  2. Create a new directory to hold your Tuned profile:

    # mkdir /etc/tuned/my-profile
  3. Find the system unique identifier of the selected block device:

    $ udevadm info --query=property --name=/dev/device | grep -E '(WWN|SERIAL)'
    
    ID_WWN=0x5002538d00000000_
    ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0
    ID_SERIAL_SHORT=20120501030900000
    Note

    The command in the this example will return all values identified as a World Wide Name (WWN) or serial number associated with the specified block device. Although it is preferred to use a WWN, the WWN is not always available for a given device and any values returned by the example command are acceptable to use as the device system unique ID.

  4. Create the /etc/tuned/my-profile/tuned.conf configuration file. In the file, set the following options:

    1. Optional: Include an existing profile:

      [main]
      include=existing-profile
    2. Set the selected disk scheduler for the device that matches the WWN identifier:

      [disk]
      devices_udev_regex=IDNAME=device system unique id
      elevator=selected-scheduler

      Here:

      • Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
      • Replace device system unique id with the value of the chosen identifier (for example, 0x5002538d00000000).

        To match multiple devices in the devices_udev_regex option, enclose the identifiers in parentheses and separate them with vertical bars:

        devices_udev_regex=(ID_WWN=0x5002538d00000000)|(ID_WWN=0x1234567800000000)
  5. Enable your profile:

    # tuned-adm profile my-profile

Verification steps

  • Verify that the Tuned profile is active and applied:

    $ tuned-adm active
    
    Current active profile: my-profile
    $ tuned-adm verify
    
    Verification succeeded, current system settings match the preset profile.
    See tuned log file ('/var/log/tuned/tuned.log') for details.

Additional resources

Chapter 4. Reviewing a system using tuna interface

Use the tuna tool to adjust scheduler tunables, tune thread priority, IRQ handlers, and isolate CPU cores and sockets. Tuna reduces the complexity of performing tuning tasks.

The tuna tool performs the following operations:

  • Lists the CPUs on a system
  • Lists the interrupt requests (IRQs) currently running on a system
  • Changes policy and priority information on threads
  • Displays the current policies and priorities of a system

4.1. Installing tuna tool

The tuna tool is designed to be used on a running system. This allows application-specific measurement tools to see and analyze system performance immediately after changes have been made.

This procedure describes how to install the tuna tool.

Procedure

  • Install the tuna tool:

    # yum install tuna

Verification steps

  • View the available tuna CLI options:

    # tuna -h

Additional resources

  • tuna(8) man page

4.2. Viewing the system status using tuna tool

This procedure describes how to view the system status using the tuna command-line interface (CLI) tool.

Prerequisites

Procedure

  • To view the current policies and priorities:

    # tuna --show_threads
                thread
    pid   SCHED_ rtpri affinity             cmd
    1      OTHER     0      0,1            init
    2       FIFO    99        0     migration/0
    3      OTHER     0        0     ksoftirqd/0
    4       FIFO    99        0      watchdog/0
  • To view a specific thread corresponding to a PID or matching a command name:

    # tuna --threads=pid_or_cmd_list --show_threads

    The pid_or_cmd_list argument is a list of comma-separated PIDs or command-name patterns.

  • To tune CPUs using the tuna CLI, see Tuning CPUs using tuna tool.
  • To tune the IRQs using the tuna tool, see Tuning IRQs using tuna tool.
  • To save the changed configuration:

    # tuna --save=filename

    This command saves only currently running kernel threads. Processes that are not running are not saved.

Additional resources

  • tuna(8) man page

4.3. Tuning CPUs using tuna tool

The tuna tool commands can target individual CPUs.

Using the tuna tool, you can:

Isolate CPUs
All tasks running on the specified CPU move to the next available CPU. Isolating a CPU makes it unavailable by removing it from the affinity mask of all threads.
Include CPUs
Allows tasks to run on the specified CPU
Restore CPUs
Restores the specified CPU to its previous configuration.

This procedure describes how to tune CPUs using the tuna CLI.

Prerequisites

Procedure

  • To specify the list of CPUs to be affected by a command:

    # tuna --cpus=cpu_list [command]

    The cpu_list argument is a list of comma-separated CPU numbers. For example, --cpus=0,2. CPU lists can also be specified in a range, for example --cpus=”1-3, which would select CPUs 1, 2, and 3.

    To add a specific CPU to the current cpu_list, for example, use --cpus=+0.

    Replace [command] with, for example, --isolate.

  • To isolate a CPU:

    # tuna --cpus=cpu_list --isolate
  • To include a CPU:

    # tuna --cpus=cpu_list --include
  • To use a system with four or more processors, display how to make all the ssh threads run on CPU 0 and 1, and all the http threads on CPU 2 and 3:

    # tuna --cpus=0,1 --threads=ssh\* \
    --move --cpus=2,3 --threads=http\* --move

    This command performs the following operations sequentially:

    1. Selects CPUs 0 and 1.
    2. Selects all threads that begin with ssh.
    3. Moves the selected threads to the selected CPUs. Tuna sets the affinity mask of threads starting with ssh to the appropriate CPUs. The CPUs can be expressed numerically as 0 and 1, in hex mask as 0x3, or in binary as 11.
    4. Resets the CPU list to 2 and 3.
    5. Selects all threads that begin with http.
    6. Moves the selected threads to the specified CPUs. Tuna sets the affinity mask of threads starting with http to the specified CPUs. The CPUs can be expressed numerically as 2 and 3, in hex mask as 0xC, or in binary as 1100.

Verification steps

  • Display the current configuration and verify that the changes were performed as expected:

    # tuna --threads=gnome-sc\* --show_threads \
    --cpus=0 --move --show_threads --cpus=1 \
    --move --show_threads --cpus=+0 --move --show_threads
    
                           thread       ctxt_switches
         pid SCHED_ rtpri affinity voluntary nonvoluntary             cmd
       3861   OTHER     0      0,1     33997           58 gnome-screensav
                           thread       ctxt_switches
         pid SCHED_ rtpri affinity voluntary nonvoluntary             cmd
       3861   OTHER     0        0     33997           58 gnome-screensav
                           thread       ctxt_switches
         pid SCHED_ rtpri affinity voluntary nonvoluntary             cmd
       3861   OTHER     0        1     33997           58 gnome-screensav
                           thread       ctxt_switches
         pid SCHED_ rtpri affinity voluntary nonvoluntary             cmd
       3861   OTHER     0      0,1     33997           58 gnome-screensav

    This command performs the following operations sequentially:

    1. Selects all threads that begin with the gnome-sc threads.
    2. Displays the selected threads to enable the user to verify their affinity mask and RT priority.
    3. Selects CPU 0.
    4. Moves the gnome-sc threads to the specified CPU, CPU 0.
    5. Shows the result of the move.
    6. Resets the CPU list to CPU 1.
    7. Moves the gnome-sc threads to the specified CPU, CPU 1.
    8. Displays the result of the move.
    9. Adds CPU 0 to the CPU list.
    10. Moves the gnome-sc threads to the specified CPUs, CPUs 0 and 1.
    11. Displays the result of the move.

Additional resources

  • /proc/cpuinfo file
  • tuna(8) man page

4.4. Tuning IRQs using tuna tool

The /proc/interrupts file records the number of interrupts per IRQ, the type of interrupt, and the name of the device that is located at that IRQ.

This procedure describes how to tune the IRQs using the tuna tool.

Prerequisites

Procedure

  • To view the current IRQs and their affinity:

    # tuna --show_irqs
    # users            affinity
    0 timer                   0
    1 i8042                   0
    7 parport0                0
  • To specify the list of IRQs to be affected by a command:

    # tuna --irqs=irq_list [command]

    The irq_list argument is a list of comma-separated IRQ numbers or user-name patterns.

    Replace [command] with, for example, --isolate.

  • To move an interrupt to a specified CPU:

    # tuna --irqs=128 --show_irqs
       # users            affinity
     128 iwlwifi           0,1,2,3
    
    # tuna --irqs=128 --cpus=3 --move

    Replace 128 with the irq_list argument and 3 with the cpu_list argument.

    The cpu_list argument is a list of comma-separated CPU numbers, for example, --cpus=0,2. For more information, see Tuning CPUs using tuna tool.

Verification steps

  • Compare the state of the selected IRQs before and after moving any interrupt to a specified CPU:

    # tuna --irqs=128 --show_irqs
       # users            affinity
     128 iwlwifi                 3

Additional resources

  • /procs/interrupts file
  • tuna(8) man page

Chapter 5. Monitoring performance using RHEL System Roles

As a system administrator, you can use the metrics RHEL System Role with any Ansible Automation Platform control node to monitor the performance of a system.

5.1. Introduction to RHEL System Roles

RHEL System Roles is a collection of Ansible roles and modules. RHEL System Roles provide a configuration interface to remotely manage multiple RHEL systems. The interface enables managing system configurations across multiple versions of RHEL, as well as adopting new major releases.

On Red Hat Enterprise Linux 8, the interface currently consists of the following roles:

  • kdump
  • network
  • selinux
  • storage
  • certificate
  • kernel_settings
  • logging
  • metrics
  • nbde_client and nbde_server
  • timesync
  • tlog

All these roles are provided by the rhel-system-roles package available in the AppStream repository.

Additional resources

5.2. RHEL System Roles terminology

You can find the following terms across this documentation:

System Roles terminology

Ansible playbook
Playbooks are Ansible’s configuration, deployment, and orchestration language. They can describe a policy you want your remote systems to enforce, or a set of steps in a general IT process.
Control node
Any machine with Ansible installed. You can run commands and playbooks, invoking /usr/bin/ansible or /usr/bin/ansible-playbook, from any control node. You can use any computer that has Python installed on it as a control node - laptops, shared desktops, and servers can all run Ansible. However, you cannot use a Windows machine as a control node. You can have multiple control nodes.
Inventory
A list of managed nodes. An inventory file is also sometimes called a “hostfile”. Your inventory can specify information like IP address for each managed node. An inventory can also organize managed nodes, creating and nesting groups for easier scaling. To learn more about inventory, see the Working with Inventory section.
Managed nodes
The network devices, servers, or both that you manage with Ansible. Managed nodes are also sometimes called “hosts”. Ansible is not installed on managed nodes.

5.3. Installing RHEL System Roles in your system

To use the RHEL System Roles, install the required packages in your system.

Prerequisites

Procedure

  1. Install the rhel-system-roles package on the system that you want to use as a control node:

    # yum install rhel-system-roles

    If you do not have a Red Hat Ansible Engine Subscription, you can use a limited supported version of Red Hat Ansible Engine provided with your Red Hat Enterprise Linux subscription. In this case, follow these steps:

    1. Enable the RHEL Ansible Engine repository:

      # subscription-manager refresh
      
      # subscription-manager repos --enable ansible-2-for-rhel-8-x86_64-rpms
    2. Install Ansible Engine:

      # yum install ansible

As a result, you are able to create an Ansible playbook.

Additional resources

5.4. Applying a role

The following procedure describes how to apply a particular role.

Prerequisites

  • Ensure that the rhel-system-roles package is installed on the system that you want to use as a control node:

    # yum install rhel-system-roles
  • You need the ansible package to run playbooks that use RHEL System Roles. Ensure that the Ansible Engine repository is enabled, and the ansible package is installed on the system that you want to use as a control node.

    • If you do not have a Red Hat Ansible Engine Subscription, you can use a limited supported version of Red Hat Ansible Engine provided with your Red Hat Enterprise Linux subscription. In this case, follow these steps:

      1. Enable the RHEL Ansible Engine repository:

        # subscription-manager refresh
        # subscription-manager repos --enable ansible-2-for-rhel-8-x86_64-rpms
      2. Install Ansible Engine:

        # yum install ansible
    • If you have a Red Hat Ansible Engine Subscription, follow the procedure described in How do I Download and Install Red Hat Ansible Engine?.
  • Ensure that you are able to create an Ansible inventory.

    Inventories represent the hosts, host groups, and some of the configuration parameters used by the Ansible playbooks.

    Playbooks are typically human-readable, and are defined in ini, yaml, json, and other file formats.

  • Ensure that you are able to create an Ansible playbook.

    Playbooks represent Ansible’s configuration, deployment, and orchestration language. By using playbooks, you can declare and manage configurations of remote machines, deploy multiple remote machines or orchestrate steps of any manual ordered process.

    A playbook is a list of one or more plays. Every play can include Ansible variables, tasks, or roles.

    Playbooks are human-readable, and are defined in the yaml format.

Procedure

  1. Create the required Ansible inventory containing the hosts and groups that you want to manage. Here is an example using a file called inventory.ini of a group of hosts called webservers:

    [webservers]
    host1
    host2
    host3
  2. Create an Ansible playbook including the required role. The following example shows how to use roles through the roles: option for a playbook:

    The following example shows how to use roles through the roles: option for a given play:

    ---
    - hosts: webservers
      roles:
         - rhel-system-roles.network
         - rhel-system-roles.timesync
    Note

    Every role includes a README file, which documents how to use the role and supported parameter values. You can also find an example playbook for a particular role under the documentation directory of the role. Such documentation directory is provided by default with the rhel-system-roles package, and can be found in the following location:

    /usr/share/doc/rhel-system-roles/SUBSYSTEM/

    Replace SUBSYSTEM with the name of the required role, such as selinux, kdump, network, timesync, or storage.

  3. To execute the playbook on specific hosts, you must perform one of the following:

    • Edit the playbook to use hosts: host1[,host2,…​], or hosts: all, and execute the command:

      # ansible-playbook name.of.the.playbook
    • Edit the inventory to ensure that the hosts you want to use are defined in a group, and execute the command:

      # ansible-playbook -i name.of.the.inventory name.of.the.playbook
    • Specify all hosts when executing the ansible-playbook command:

      # ansible-playbook -i host1,host2,... name.of.the.playbook
      Important

      Be aware that the -i flag specifies the inventory of all hosts that are available. If you have multiple targeted hosts, but want to select a host against which you want to run the playbook, you can add a variable in the playbook to be able to select a host. For example:

      Ansible Playbook | example-playbook.yml:
      
      - hosts: "{{ target_host }}"
        roles:
           - rhel-system-roles.network
           - rhel-system-roles.timesync

      Playbook execution command:

      # ansible-playbook -i host1,..hostn -e target_host=host5 example-playbook.yml

5.5. Introduction to the metrics System Role

RHEL System Roles is a collection of Ansible roles and modules that provide a consistent configuration interface to remotely manage multiple RHEL systems. The metrics System Role configures performance analysis services for the local system and, optionally, includes a list of remote systems to be monitored by the local system. The metrics System Role enables you to use pcp to monitor your systems performance without having to configure pcp separately, as the set-up and deployment of pcp is handled by the playbook.

Table 5.1. Metrics system role variables

Role variableDescriptionExample usage

metrics_monitored_hosts

List of remote hosts to be analyzed by the target host. These hosts will have metrics recorded on the target host, so ensure enough disk space exists below /var/log for each host.

metrics_monitored_hosts: ["webserver.example.com", "database.example.com"]

metrics_retention_days

Configures the number of days for performance data retention before deletion.

metrics_retention_days: 14

metrics_graph_service

A boolean flag that enables the host to be set up with services for performance data visualization via pcp and grafana. Set to false by default.

metrics_graph_service: false

metrics_query_service

A boolean flag that enables the host to be set up with time series query services for querying recorded pcp metrics via redis. Set to false by default.

metrics_query_service: false

metrics_provider

Specifies which metrics collector to use to provide metrics. Currently, pcp is the only supported metrics provider.

metrics_provider: "pcp"

Note

For details about the parameters used in metrics_connections and additional information about the metrics System Role, see the /usr/share/ansible/roles/rhel-system-roles.metrics/README.md file.

5.6. Using the metrics System Role to monitor your local system with visualization

This procedure describes how to use the metrics RHEL System Role to monitor your local system while simultaneously provisioning data visualization via grafana.

Prerequisites

  • You have Red Hat Ansible Engine installed on the machine you want to monitor.
  • You have the rhel-system-roles package installed on the machine you want to monitor.

Procedure

  1. Configure localhost in the the /etc/ansible/hosts Ansible inventory by adding the following content to the inventory:

    localhost ansible_connection=local
  2. Create an Ansible playbook with the following content:

    ---
    - hosts: localhost
      vars:
        metrics_graph_service: yes
      roles:
        - rhel-system-roles.metrics
  3. Run the Ansible playbook:

    # ansible-playbook name_of_your_playbook.yml
    Note

    Since the metrics_graph_service boolean is set to value="yes", grafana is automatically installed and provisioned with pcp added as a data source.

  4. To view visualization of the metrics being collected on your machine, access the grafana web interface as described in Accessing the Grafana web UI.

5.7. Using the metrics System Role to setup a fleet of individual systems to monitor themselves

This procedure describes how to use the metrics System Role to set up a fleet of machines to monitor themselves.

Prerequisites

  • You have Red Hat Ansible Engine installed on the machine you want to use to run the playbook.
  • You have the rhel-system-roles package installed on the machine you want to use to run the playbook.

Procedure

  1. Add the name or IP of the machines you wish to monitor via the playbook to the /etc/ansible/hosts Ansible inventory file under an identifying group name enclosed in brackets:

    [remotes]
    webserver.example.com
    database.example.com
  2. Create an Ansible playbook with the following content:

    ---
    - hosts: remotes
      vars:
        metrics_retention_days: 0
      roles:
        - rhel-system-roles.metrics
  3. Run the Ansible playbook:

    # ansible-playbook name_of_your_playbook.yml

5.8. Using the metrics System Role to monitor a fleet of machines centrally via your local machine

This procedure describes how to use the metrics System Role to set up your local machine to centrally monitor a fleet of machines while also provisioning visualization of the data via grafana and querying of the data via redis.

Prerequisites

  • You have Red Hat Ansible Engine installed on the machine you want to use to run the playbook.
  • You have the rhel-system-roles package installed on the machine you want to use to run the playbook.

Procedure

  1. Create an Ansible playbook with the following content:

    ---
    - hosts: localhost
      vars:
        metrics_graph_service: yes
        metrics_query_service: yes
        metrics_retention_days: 10
        metrics_monitored_hosts: ["database.example.com", "webserver.example.com"]
      roles:
        - rhel-system-roles.metrics
  2. Run the Ansible playbook:

    # ansible-playbook name_of_your_playbook.yml
    Note

    Since the metrics_graph_service and metrics_query_service booleans are set to value="yes", grafana is automatically installed and provisioned with pcp added as a data source with the pcp data recording indexed into redis, allowing the pcp querying language to be used for complex querying of the data.

  3. To view graphical representation of the metrics being collected centrally by your machine and to query the data, access the grafana web interface as described in Accessing the Grafana web UI.

5.9. Setting up authentication while monitoring a system using the metrics System Role

PCP supports the scram-sha-256 authentication mechanism through the Simple Authentication Security Layer (SASL) framework. The metrics RHEL System Role automates the steps to setup authentication using the scram-sha-256 authentication mechanism. This procedure describes how to setup authentication using the metrics RHEL System Role.

Prerequisites

  • You have Red Hat Ansible Engine installed on the machine you want to use to run the playbook.
  • You have the rhel-system-roles package installed on the machine you want to use to run the playbook.

Procedure

  1. Include the following variables in the Ansible playbook you want to setup authentication for:

    ---
      vars:
        metrics_username: your_username
        metrics_password: your_password
  2. Run the Ansible playbook:

    # ansible-playbook name_of_your_playbook.yml

Verification steps

  • Verify the sasl configuration:

    # pminfo -f -h "pcp://127.0.0.1?username=your_username" disk.dev.read
    Password:
    disk.dev.read
    inst [0 or "sda"] value 19540

5.10. Using the metrics System Role to configure and enable metrics collection for SQL Server

This procedure describes how to use the metrics RHEL System Role to automate the configuration and enabling of metrics collection for Microsoft SQL Server via pcp on your local system.

Prerequisites

  • You have Red Hat Ansible Engine installed on the machine you want to monitor.
  • You have the rhel-system-roles package installed on the machine you want to monitor.
  • You have installed Microsoft SQL Server for Red Hat Enterprise Linux and established a 'trusted' connection to an SQL server.
  • You have installed the Microsoft ODBC driver for SQL Server for Red Hat Enterprise Linux.

Procedure

  1. Configure localhost in the the /etc/ansible/hosts Ansible inventory by adding the following content to the inventory:

    localhost ansible_connection=local
  2. Create an Ansible playbook that contains the following content:

    ---
    - hosts: localhost
      roles:
        - role: rhel-system-roles.metrics
          vars:
            metrics_from_sql: yes
  3. Run the Ansible playbook:

    # ansible-playbook name_of_your_playbook.yml

Verification steps

  • Use the pcp command to verify that SQL Server PMDA agent (mssql) is loaded and running:

    # pcp
    platform: Linux rhel82-2.local 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC 2019 x86_64
     hardware: 2 cpus, 1 disk, 1 node, 2770MB RAM
     timezone: PDT+7
     services: pmcd pmproxy
         pmcd: Version 5.0.2-1, 12 agents, 4 clients
         pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm mssql
               jbd2 dm
     pmlogger: primary logger: /var/log/pcp/pmlogger/rhel82-2.local/20200326.16.31
         pmie: primary engine: /var/log/pcp/pmie/rhel82-2.local/pmie.log


[1] This documentation is installed automatically with the rhel-system-roles package.

Chapter 6. Setting up PCP

Performance Co-Pilot (PCP) is a suite of tools, services, and libraries for monitoring, visualizing, storing, and analyzing system-level performance measurements.

This section describes how to install and enable PCP on your system.

6.1. Overview of PCP

You can add performance metrics using Python, Perl, C++, and C interfaces. Analysis tools can use the Python, C++, C client APIs directly, and rich web applications can explore all available performance data using a JSON interface.

You can analyze data patterns by comparing live results with archived data.

Features of PCP:

  • Light-weight distributed architecture, which is useful during the centralized analysis of complex systems.
  • It allows the monitoring and management of real-time data.
  • It allows logging and retrieval of historical data.

PCP has the following components:

  • The Performance Metric Collector Daemon (pmcd) collects performance data from the installed Performance Metric Domain Agents (pmda). PMDAs can be individually loaded or unloaded on the system and are controlled by the PMCD on the same host.
  • Various client tools, such as pminfo or pmstat, can retrieve, display, archive, and process this data on the same host or over the network.
  • The pcp package provides the command-line tools and underlying functionality.
  • The pcp-gui package provides the graphical application. Install the pcp-gui package by executing the yum install pcp-gui command. For more information, see Visually tracing PCP log archives with the PCP Charts application.

6.2. Installing and enabling PCP

To begin using PCP, install all the required packages and enable the PCP monitoring services.

This procedure describes how to install PCP using the pcp package. If you want to automate the PCP installation, install it using the pcp-zeroconf package. For more information on installing PCP by using pcp-zeroconf, see Setting up PCP with pcp-zeroconf.

Procedure

  1. Install the pcp package:

    # yum install pcp
  2. Enable and start the pmcd service on the host machine:

    # systemctl enable pmcd
    
    # systemctl start pmcd

Verification steps

  • Verify if the pmcd process is running on the host:

    # pcp
    
    Performance Co-Pilot configuration on workstation:
    
    platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019 x86_64
    hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM
    timezone: CEST-2
    services: pmcd
    pmcd: Version 4.3.0-1, 8 agents
    pmda: root pmcd proc xfs linux mmv kvm jbd2

Additional resources

6.3. Deploying a minimal PCP setup

The minimal PCP setup collects performance statistics on Red Hat Enterprise Linux. The setup involves adding the minimum number of packages on a production system needed to gather data for further analysis.

You can analyze the resulting tar.gz file and the archive of the pmlogger output using various PCP tools and compare them with other sources of performance information.

Prerequisites

Procedure

  1. Update the pmlogger configuration:

    # pmlogconf -r /var/lib/pcp/config/pmlogger/config.default
  2. Start the pmcd and pmlogger services:

    # systemctl start pmcd.service
    
    # systemctl start pmlogger.service
  3. Execute the required operations to record the performance data.
  4. Stop the pmcd and pmlogger services:

    # systemctl stop pmcd.service
    
    # systemctl stop pmlogger.service
  5. Save the output and save it to a tar.gz file named based on the host name and the current date and time:

    # cd /var/log/pcp/pmlogger/
    
    # tar -czf $(hostname).$(date +%F-%Hh%M).pcp.tar.gz $(hostname)

    Extract this file and analyze the data using PCP tools.

Additional resources

6.4. System services distributed with PCP

The following table describes roles of various system services, which are distributed with PCP.

Table 6.1. Roles of system services distributed with PCP

Name

Description

pmcd

The Performance Metric Collector Daemon (PMCD).

pmie

The Performance Metrics Inference Engine.

pmlogger

The performance metrics logger.

pmproxy

The realtime and historical performance metrics proxy, time series query and REST API service.

6.5. Tools distributed with PCP

The following table describes usage of various tools, which are distributed with PCP.

Table 6.2. Usage of tools distributed with PCP

Name

Description

pcp

Displays the current status of a Performance Co-Pilot installation.

pcp-atop

Shows the system-level occupation of the most critical hardware resources from the performance point of view: CPU, memory, disk, and network.

pcp-atopsar

Generates a system-level activity report over a variety of system resource utilization. The report is generated from a raw logfile previously recorded using pmlogger or the -w option of pcp-atop.

pcp-dmcache

Displays information about configured Device Mapper Cache targets, such as: device IOPs, cache and metadata device utilization, as well as hit and miss rates and ratios for both reads and writes for each cache device.

pcp-dstat

Displays metrics of one system at a time. To display metrics of multiple systems, use --host option.

pcp-free

Reports on free and used memory in a system.

pcp-htop

Displays all processes running on a system along with their command line arguments in a manner similar to the top command, but allows you to scroll vertically and horizontally as well as interact using a mouse. You can also view processes in a tree format and select and act on multiple processes at once.

pcp-ipcs

Displays information on the inter-process communication (IPC) facilities that the calling process has read access for.

pcp-numastat

Displays NUMA allocation statistics from the kernel memory allocator.

pcp-pidstat

Displays information about individual tasks or processes running on the system such as: CPU percentage, memory and stack usage, scheduling, and priority. Reports live data for the local host by default.

pcp-ss

Displays socket statistics collected by the pmdasockets Performance Metrics Domain Agent (PMDA).

pcp-uptime

Displays how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.

pcp-vmstat

Provides a high-level system performance overview every 5 seconds. Displays information about processes, memory, paging, block IO, traps, and CPU activity.

pmchart

Plots performance metrics values available through the facilities of the Performance Co-Pilot.

pmclient

Displays high-level system performance metrics by using the Performance Metrics Application Programming Interface (PMAPI).

pmconfig

Displays the values of configuration parameters.

pmdbg

Displays available Performance Co-Pilot debug control flags and their values.

pmdiff

Compares the average values for every metric in either one or two archives, in a given time window, for changes that are likely to be of interest when searching for performance regressions.

pmdumplog

Displays control, metadata, index, and state information from a Performance Co-Pilot archive file.

pmdumptext

Outputs the values of performance metrics collected live or from a Performance Co-Pilot archive.

pmerr

Displays available Performance Co-Pilot error codes and their corresponding error messages.

pmfind

Finds PCP services on the network.

pmie

An inference engine that periodically evaluates a set of arithmetic, logical, and rule expressions. The metrics are collected either from a live system, or from a Performance Co-Pilot archive file.

pmieconf

Displays or sets configurable pmie variables.

pmiectl

Manages non-primary instances of pmie.

pminfo

Displays information about performance metrics. The metrics are collected either from a live system, or from a Performance Co-Pilot archive file.

pmiostat

Reports I/O statistics for SCSI devices (by default) or device-mapper devices (with the -x dm option).

pmlc

Interactively configures active pmlogger instances.

pmlogcheck

Identifies invalid data in a Performance Co-Pilot archive file.

pmlogconf

Creates and modifies a pmlogger configuration file.

pmlogctl

Manages non-primary instances of pmlogger.

pmloglabel

Verifies, modifies, or repairs the label of a Performance Co-Pilot archive file.

pmlogsummary

Calculates statistical information about performance metrics stored in a Performance Co-Pilot archive file.

pmprobe

Determines the availability of performance metrics.

pmrep

Reports on selected, easily customizable, performance metrics values.

pmsocks

Allows access to a Performance Co-Pilot hosts through a firewall.

pmstat

Periodically displays a brief summary of system performance.

pmstore

Modifies the values of performance metrics.

pmtrace

Provides a command line interface to the trace PMDA.

pmval

Displays the current value of a performance metric.

6.6. PCP deployment architectures

Performance Co-Pilot (PCP) offers many options to accomplish advanced setups. From the huge variety of possible architectures, this section describes how to scale your PCP deployment based on the recommended deployment set up by Red Hat, sizing factors, and configuration options.

PCP supports multiple deployment architectures, based on the scale of the PCP deployment.

Available scaling deployment setup variants:

Note

Since the PCP version 5.3.0 is unavailable in Red Hat Enterprise Linux 8.4 and the prior minor versions of Red Hat Enterprise Linux 8, Red Hat recommends localhost and pmlogger farm architectures.

For more information about known memory leaks in pmproxy in PCP versions before 5.3.0, see Memory leaks in pmproxy in PCP.

Localhost

Each service runs locally on the monitored machine. When you start a service without any configuration changes, this is the default deployment. Scaling beyond the individual node is not possible in this case.

By default, the deployment setup for Redis is standalone, localhost. However, Redis can optionally perform in a highly-available and highly scalable clustered fashion, where data is shared across multiple hosts. Another viable option is to deploy a Redis cluster in the cloud, or to utilize a managed Redis cluster from a cloud vendor.

Decentralized

The only difference between localhost and decentralized setup is the centralized Redis service. In this model, the host executes pmlogger service on each monitored host and retrieves metrics from a local pmcd instance. A local pmproxy service then exports the performance metrics to a central Redis instance.

Figure 6.1. Decentralized logging

Decentralized logging
Centralized logging - pmlogger farm

When the resource usage on the monitored hosts is constrained, another deployment option is a pmlogger farm, which is also known as centralized logging. In this setup, a single logger host executes multiple pmlogger processes, and each is configured to retrieve performance metrics from a different remote pmcd host. The centralized logger host is also configured to execute the pmproxy service, which discovers the resulting PCP archives logs and loads the metric data into a Redis instance.

Figure 6.2. Centralized logging - pmlogger farm

Centralized logging - pmlogger farm
Federated - multiple pmlogger farms

For large scale deployments, Red Hat recommends to deploy multiple pmlogger farms in a federated fashion. For example, one pmlogger farm per rack or data center. Each pmlogger farm loads the metrics into a central Redis instance.

Figure 6.3. Federated - multiple pmlogger farms

Federated - multiple pmlogger farms
Note

By default, the deployment setup for Redis is standalone, localhost. However, Redis can optionally perform in a highly-available and highly scalable clustered fashion, where data is shared across multiple hosts. Another viable option is to deploy a Redis cluster in the cloud, or to utilize a managed Redis cluster from a cloud vendor.

Additional resources

6.8. Sizing factors

The following are the sizing factors required for scaling:

Remote system size
The number of CPUs, disks, network interfaces, and other hardware resources affects the amount of data collected by each pmlogger on the centralized logging host.
Logged Metrics
The number and types of logged metrics play an important role. In particular, the per-process proc.* metrics require a large amount of disk space, for example, with the standard pcp-zeroconf setup, 10s logging interval, 11 MB without proc metrics versus 155 MB with proc metrics - a factor of 10 times more. Additionally, the number of instances for each metric, for example the number of CPUs, block devices, and network interfaces also impacts the required storage capacity.
Logging Interval
The interval how often metrics are logged, affects the storage requirements. The expected daily PCP archive file sizes are written to the pmlogger.log file for each pmlogger instance. These values are uncompressed estimates. Since PCP archives compress very well, approximately 10:1, the actual long term disk space requirements can be determined for a particular site.
pmlogrewrite
After every PCP upgrade, the pmlogrewrite tool is executed and rewrites old archives if there were changes in the metric metadata from the previous version and the new version of PCP. This process duration scales linear with the number of archives stored.

Additional resources

  • pmlogrewrite(1) and pmlogger(1) man pages

6.9. Configuration options for PCP scaling

The following are the configuration options, which are required for scaling:

sysctl and rlimit settings
When archive discovery is enabled, pmproxy requires four descriptors for every pmlogger that it is monitoring or log-tailing, along with the additional file descriptors for the service logs and pmproxy client sockets, if any. Each pmlogger process uses about 20 file descriptors for the remote pmcd socket, archive files, service logs, and others. In total, this can exceed the default 1024 soft limit on a system running around 200 pmlogger processes. The pmproxy service in pcp-5.3.0 and later automatically increases the soft limit to the hard limit. On earlier versions of PCP, tuning is required if a high number of pmlogger processes are to be deployed, and this can be accomplished by increasing the soft or hard limits for pmlogger. For more information, see How to set limits (ulimit) for services run by systemd.
Local Archives
The pmlogger service stores metrics of local and remote pmcds in the /var/log/pcp/pmlogger/ directory. To control the logging interval of the local system, update the /etc/pcp/pmlogger/control.d/configfile file and add -t X in the arguments, where X is the logging interval in seconds. To configure which metrics should be logged, execute pmlogconf /var/lib/pcp/config/pmlogger/config.clienthostname. This command deploys a configuration file with a default set of metrics, which can optionally be further customized. To specify retention settings, that is when to purge old PCP archives, update the /etc/sysconfig/pmlogger_timers file and specify PMLOGGER_DAILY_PARAMS="-E -k X", where X is the amount of days to keep PCP archives.
Redis

The pmproxy service sends logged metrics from pmlogger to a Redis instance. The following are the available two options to specify the retention settings in the /etc/pcp/pmproxy/pmproxy.conf configuration file:

  • stream.expire specifies the duration when stale metrics should be removed, that is metrics which were not updated in a specified amount of time in seconds.
  • stream.maxlen specifies the maximum number of metric values for one metric per host. This setting should be the retention time divided by the logging interval, for example 20160 for 14 days of retention and 60s logging interval (60*60*24*14/60)

Additional resources

  • pmproxy(1), pmlogger(1), and sysctl(8) man pages

Chapter 7. Logging performance data with pmlogger

With the PCP tool you can log the performance metric values and replay them later. This allows you to perform a retrospective performance analysis.

Using the pmlogger tool, you can:

  • Create the archived logs of selected metrics on the system
  • Specify which metrics are recorded on the system and how often

7.1. Modifying the pmlogger configuration file with pmlogconf

When the pmlogger service is running, PCP logs a default set of metrics on the host.

Use the pmlogconf utility to check the default configuration. If the pmlogger configuration file does not exist, pmlogconf creates it with a default metric values.

Prerequisites

Procedure

  1. Create or modify the pmlogger configuration file:

    # pmlogconf -r /var/lib/pcp/config/pmlogger/config.default
  2. Follow pmlogconf prompts to enable or disable groups of related performance metrics and to control the logging interval for each enabled group.

Additional resources

7.2. Editing the pmlogger configuration file manually

To create a tailored logging configuration with specific metrics and given intervals, edit the pmlogger configuration file manually. The default pmlogger configuration file is /var/lib/pcp/config/pmlogger/config.default. The configuration file specifies which metrics are logged by the primary logging instance.

In manual configuration, you can:

  • Record metrics which are not listed in the automatic configuration.
  • Choose custom logging frequencies.
  • Add PMDA with the application metrics.

Prerequisites

Procedure

  • Open and edit the /var/lib/pcp/config/pmlogger/config.default file to add specific metrics:

    # It is safe to make additions from here on ...
    #
    
    log mandatory on every 5 seconds {
        xfs.write
        xfs.write_bytes
        xfs.read
        xfs.read_bytes
    }
    
    log mandatory on every 10 seconds {
        xfs.allocs
        xfs.block_map
        xfs.transactions
        xfs.log
    
    }
    
    [access]
    disallow * : all;
    allow localhost : enquire;

Additional resources

7.3. Enabling the pmlogger service

The pmlogger service must be started and enabled to log the metric values on the local machine.

This procedure describes how to enable the pmlogger service.

Prerequisites

Procedure

  • Start and enable the pmlogger service:

    # systemctl start pmlogger
    
    # systemctl enable pmlogger

Verification steps

  • Verify if the pmlogger service is enabled:

    # pcp
    
    Performance Co-Pilot configuration on workstation:
    
    platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019 x86_64
    hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM
    timezone: CEST-2
    services: pmcd
    pmcd: Version 4.3.0-1, 8 agents, 1 client
    pmda: root pmcd proc xfs linux mmv kvm jbd2
    pmlogger: primary logger: /var/log/pcp/pmlogger/workstation/20190827.15.54

Additional resources

7.4. Setting up a client system for metrics collection

This procedure describes how to set up a client system so that a central server can collect metrics from clients running PCP.

Prerequisites

Procedure

  1. Install the pcp-system-tools package:

    # yum install pcp-system-tools
  2. Configure an IP address for pmcd:

    # echo "-i 192.168.4.62" >>/etc/pcp/pmcd/pmcd.options

    Replace 192.168.4.62 with the IP address, the client should listen on.

    By default, pmcd is listening on the localhost.

  3. Configure the firewall to add the public zone permanently:

    # firewall-cmd --permanent --zone=public --add-port=44321/tcp
    success
    
    # firewall-cmd --reload
    success
  4. Set an SELinux boolean:

    # setsebool -P pcp_bind_all_unreserved_ports on
  5. Enable the pmcd and pmlogger services:

    # systemctl enable pmcd pmlogger
    # systemctl restart pmcd pmlogger

Verification steps

  • Verify if the pmcd is correctly listening on the configured IP address:

    # ss -tlp | grep 44321
    LISTEN   0   5     127.0.0.1:44321   0.0.0.0:*   users:(("pmcd",pid=151595,fd=6))
    LISTEN   0   5  192.168.4.62:44321   0.0.0.0:*   users:(("pmcd",pid=151595,fd=0))
    LISTEN   0   5         [::1]:44321      [::]:*   users:(("pmcd",pid=151595,fd=7))

Additional resources

7.5. Setting up a central server to collect data

This procedure describes how to create a central server to collect metrics from clients running PCP.

Prerequisites

Procedure

  1. Install the pcp-system-tools package:

    # yum install pcp-system-tools
  2. Create the /etc/pcp/pmlogger/control.d/remote file with the following content:

    # DO NOT REMOVE OR EDIT THE FOLLOWING LINE
    $version=1.1
    
    192.168.4.13 n n PCP_LOG_DIR/pmlogger/rhel7u4a -r -T24h10m -c config.rhel7u4a
    192.168.4.14 n n PCP_LOG_DIR/pmlogger/rhel6u10a -r -T24h10m -c config.rhel6u10a
    192.168.4.62 n n PCP_LOG_DIR/pmlogger/rhel8u1a -r -T24h10m -c config.rhel8u1a

    Replace 192.168.4.13, 192.168.4.14, and 192.168.4.62 with the client IP addresses.

  3. Enable the pmcd and pmlogger services:

    # systemctl enable pmcd pmlogger
    # systemctl restart pmcd pmlogger

Verification steps

  • Ensure that you can access the latest archive file from each directory:

    # for i in /var/log/pcp/pmlogger/rhel*/*.0; do pmdumplog -L $i; done
    Log Label (Log Format Version 2)
    Performance metrics from host rhel6u10a.local
      commencing Mon Nov 25 21:55:04.851 2019
      ending     Mon Nov 25 22:06:04.874 2019
    Archive timezone: JST-9
    PID for pmlogger: 24002
    Log Label (Log Format Version 2)
    Performance metrics from host rhel7u4a
      commencing Tue Nov 26 06:49:24.954 2019
      ending     Tue Nov 26 07:06:24.979 2019
    Archive timezone: CET-1
    PID for pmlogger: 10941
    [..]

    The archive files from the /var/log/pcp/pmlogger/ directory can be used for further analysis and graphing.

Additional resources

7.6. Replaying the PCP log archives with pmdumptext

After recording the metric data, you can replay the PCP log archives. To export the logs to text files and import them into spreadsheets, use PCP utilities such as pmdumptext, pmrep, or pmlogsummary.

Using the pmdumptext tool, you can:

  • View the log files
  • Parse the selected PCP log archive and export the values into an ASCII table
  • Extract the entire archive log or only select metric values from the log by specifying individual metrics on the command line

Prerequisites

Procedure

  • Display the data on the metric:

    $ pmdumptext -t 5seconds -H -a 20170605 xfs.perdev.log.writes
    
    Time local::xfs.perdev.log.writes["/dev/mapper/fedora-home"] local::xfs.perdev.log.writes["/dev/mapper/fedora-root"]
    ? 0.000 0.000
    none count / second count / second
    Mon Jun 5 12:28:45 ? ?
    Mon Jun 5 12:28:50 0.000 0.000
    Mon Jun 5 12:28:55 0.200 0.200
    Mon Jun 5 12:29:00 6.800 1.000

    The mentioned example displays the data on the xfs.perdev.log metric collected in an archive at a 5 second interval and display all the headers.

Additional resources

Chapter 8. Monitoring performance with Performance Co-Pilot

Performance Co-Pilot (PCP) is a suite of tools, services, and libraries for monitoring, visualizing, storing, and analyzing system-level performance measurements.

As a system administrator, you can monitor the system’s performance using the the PCP application in Red Hat Enterprise Linux 8.

8.1. Monitoring postfix with pmda-postfix

This procedure describes how to monitor performance metrics of the postfix mail server with pmda-postfix. It helps to check how many emails are received per second.

Prerequisites

Procedure

  1. Install the following packages:

    1. Install the pcp-system-tools:

      # yum install pcp-system-tools
    2. Install the pmda-postfix package to monitor postfix:

      # yum install pcp-pmda-postfix postfix
    3. Install the logging daemon:

      # yum install rsyslog
    4. Install the mail client for testing:

      # yum install mutt
  2. Enable the postfix and rsyslog services:

    # systemctl enable postfix rsyslog
    # systemctl restart postfix rsyslog
  3. Enable the SELinux boolean, so that pmda-postfix can access the required log files:

    # setsebool -P pcp_read_generic_logs=on
  4. Install the PMDA:

    # cd /var/lib/pcp/pmdas/postfix/
    
    # ./Install
    
    Updating the Performance Metrics Name Space (PMNS) ...
    Terminate PMDA if already installed ...
    Updating the PMCD control file, and notifying PMCD ...
    Waiting for pmcd to terminate ...
    Starting pmcd ...
    Check postfix metrics have appeared ... 7 metrics and 58 values

Verification steps

  • Verify the pmda-postfix operation:

    echo testmail | mutt root
  • Verify the available metrics:

    # pminfo postfix
    
    postfix.received
    postfix.sent
    postfix.queues.incoming
    postfix.queues.maildrop
    postfix.queues.hold
    postfix.queues.deferred
    postfix.queues.active

Additional resources

8.2. Visually tracing PCP log archives with the PCP Charts application

After recording metric data, you can replay the PCP log archives as graphs. The metrics are sourced from one or more live hosts with alternative options to use metric data from PCP log archives as a source of historical data. To customize the PCP Charts application interface to display the data from the performance metrics, you can use line plot, bar graphs, or utilization graphs.

Using the PCP Charts application, you can:

  • Replay the data in the PCP Charts application application and use graphs to visualize the retrospective data alongside live data of the system.
  • Plot performance metric values into graphs.
  • Display multiple charts simultaneously.

Prerequisites

Procedure

  1. Launch the PCP Charts application from the command line:

    # pmchart

    Figure 8.1. PCP Charts application

    pmchart started

    The pmtime server settings are located at the bottom. The start and pause button allows you to control:

    • The interval in which PCP polls the metric data
    • The date and time for the metrics of historical data
  2. Click File and then New Chart to select metric from both the local machine and remote machines by specifying their host name or address. Advanced configuration options include the ability to manually set the axis values for the chart, and to manually choose the color of the plots.
  3. Record the views created in the PCP Charts application:

    Following are the options to take images or record the views created in the PCP Charts application:

    • Click File and then Export to save an image of the current view.
    • Click Record and then Start to start a recording. Click Record and then Stop to stop the recording. After stopping the recording, the recorded metrics are archived to be viewed later.
  4. Optional: In the PCP Charts application, the main configuration file, known as the view, allows the metadata associated with one or more charts to be saved. This metadata describes all chart aspects, including the metrics used and the chart columns. Save the custom view configuration by clicking File and then Save View, and load the view configuration later.

    The following example of the PCP Charts application view configuration file describes a stacking chart graph showing the total number of bytes read and written to the given XFS file system loop1:

    #kmchart
    version 1
    
    chart title "Filesystem Throughput /loop1" style stacking antialiasing off
        plot legend "Read rate"   metric xfs.read_bytes   instance  "loop1"
        plot legend "Write rate"  metric xfs.write_bytes  instance  "loop1"

Additional resources

8.3. Collecting data from SQL server using PCP

With Red Hat Enterprise Linux 8.2 or later, the SQL Server agent is available in Performance Co-Pilot (PCP), which helps you to monitor and analyze database performance issues.

This procedure describes how to collect data for Microsoft SQL Server via pcp on your system.

Prerequisites

  • You have installed Microsoft SQL Server for Red Hat Enterprise Linux and established a 'trusted' connection to an SQL server.
  • You have installed the Microsoft ODBC driver for SQL Server for Red Hat Enterprise Linux.

Procedure

  1. Install PCP:

    # yum install pcp-zeroconf
  2. Install packages required for the pyodbc driver:

    # yum install gcc-c++ python3-devel unixODBC-devel
    
    $ pip3 install pyodbc
  3. Install the mssql agent:

    1. Install the Microsoft SQL Server domain agent for PCP:

      # yum install pcp-pmda-mssql
    2. Edit the /var/lib/pcp/pmdas/mssql/mssql.conf file to configure the SQL server account’s username and password for the mssql agent. Ensure that the account you configure has access rights to performance data.

      username: user_name
      password: user_password

      Replace user_name with the SQL Server account and user_password with the SQL Server user password for this account.

  4. Add the required permissions:

    # chown root:root /var/lib/pcp/pmdas/mssql/mssql.conf
    # chmod 400 /var/lib/pcp/pmdas/mssql/mssql.conf
  5. Install the agent:

    # cd /var/lib/pcp/pmdas/mssql
    # ./Install
    Updating the Performance Metrics Name Space (PMNS) ...
    Terminate PMDA if already installed ...
    Updating the PMCD control file, and notifying PMCD ...
    Check mssql metrics have appeared ... 168 metrics and 598 values
    [...]

Verification steps

  • Using the pcp command, verify if the SQL Server PMDA (mssql) is loaded and running:

    $ pcp
    Performance Co-Pilot configuration on rhel.local:
    
    platform: Linux rhel.local 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC 2019 x86_64
     hardware: 2 cpus, 1 disk, 1 node, 2770MB RAM
     timezone: PDT+7
     services: pmcd pmproxy
         pmcd: Version 5.0.2-1, 12 agents, 4 clients
         pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm mssql
               jbd2 dm
     pmlogger: primary logger: /var/log/pcp/pmlogger/rhel.local/20200326.16.31
         pmie: primary engine: /var/log/pcp/pmie/rhel.local/pmie.log
  • View the complete list of metrics that PCP can collect from the SQL Server:

    # pminfo mssql
  • After viewing the list of metrics, you can report the rate of transactions. For example, to report on the overall transaction count per second, over a five second time window:

    # pmval -t 1 -T 5 mssql.databases.transactions
  • View the graphical chart of these metrics on your system by using the pmchart command. For more information, see Visually tracing PCP log archives with the PCP Charts application.

Additional resources

Chapter 9. Performance analysis of XFS with PCP

The XFS PMDA ships as part of the pcp package and is enabled by default during the installation. It is used to gather performance metric data of XFS file systems in Performance Co-Pilot (PCP).

This section describes how to analyze XFS file system’s performance using PCP.

9.1. Installing XFS PMDA manually

If the XFS PMDA is not listed in the pcp configuration output, install the PMDA agent manually.

This procedure describes how to manually install the PMDA agent.

Prerequisites

Procedure

  1. Navigate to the xfs directory:

    # cd /var/lib/pcp/pmdas/xfs/
  2. Install the XFS PMDA manually:

    xfs]# ./Install
    
    You will need to choose an appropriate configuration for install of
    the “xfs” Performance Metrics Domain Agent (PMDA).
    
      collector     collect performance statistics on this system
      monitor       allow this system to monitor local and/or remote systems
      both          collector and monitor configuration for this system
    
    Please enter c(ollector) or m(onitor) or (both) [b]
    Updating the Performance Metrics Name Space (PMNS) ...
    Terminate PMDA if already installed ...
    Updating the PMCD control file, and notifying PMCD ...
    Waiting for pmcd to terminate ...
    Starting pmcd ...
    Check xfs metrics have appeared ... 149 metrics and 149 values
  3. Select the intended PMDA role by entering c for collector, m for monitor, or b for both. The PMDA installation script prompts you to specify one of the following PMDA roles:

    • The collector role allows the collection of performance metrics on the current system
    • The monitor role allows the system to monitor local systems, remote systems, or both

      The default option is both collector and monitor, which allows the XFS PMDA to operate correctly in most scenarios.

Verification steps

  • Verify that the pmcd process is running on the host and the XFS PMDA is listed as enabled in the configuration:

    # pcp
    
    Performance Co-Pilot configuration on workstation:
    
    platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019 x86_64
    hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM
    timezone: CEST-2
    services: pmcd
    pmcd: Version 4.3.0-1, 8 agents
    pmda: root pmcd proc xfs linux mmv kvm jbd2

Additional resources

9.2. Examining XFS performance metrics with pminfo

PCP enables XFS PMDA to allow the reporting of certain XFS metrics per each of the mounted XFS file systems. This makes it easier to pinpoint specific mounted file system issues and evaluate performance.

The pminfo command provides per-device XFS metrics for each mounted XFS file system.

This procedure displays a list of all available metrics provided by the XFS PMDA.

Prerequisites

Procedure

  • Display the list of all available metrics provided by the XFS PMDA:

    # pminfo xfs
  • Display information for the individual metrics. The following examples examine specific XFS read and write metrics using the pminfo tool:

    • Display a short description of the xfs.write_bytes metric:

      # pminfo --oneline xfs.write_bytes
      
      xfs.write_bytes [number of bytes written in XFS file system write operations]
    • Display a long description of the xfs.read_bytes metric:

      # pminfo --helptext xfs.read_bytes
      
      xfs.read_bytes
      Help:
      This is the number of bytes read via read(2) system calls to files in
      XFS file systems. It can be used in conjunction with the read_calls
      count to calculate the average size of the read operations to file in
      XFS file systems.
    • Obtain the current performance value of the xfs.read_bytes metric:

      # pminfo --fetch xfs.read_bytes
      
      xfs.read_bytes
          value 4891346238
    • Obtain per-device XFS metrics with pminfo:

      # pminfo --fetch --oneline xfs.perdev.read xfs.perdev.write
      
      xfs.perdev.read [number of XFS file system read operations]
      inst [0 or "loop1"] value 0
      inst [0 or "loop2"] value 0
      
      xfs.perdev.write [number of XFS file system write operations]
      inst [0 or "loop1"] value 86
      inst [0 or "loop2"] value 0

Additional resources

9.3. Resetting XFS performance metrics with pmstore

With PCP, you can modify the values of certain metrics, especially if the metric acts as a control variable, such as the xfs.control.reset metric. To modify a metric value, use the pmstore tool.

This procedure describes how to reset XFS metrics using the pmstore tool.

Prerequisites

Procedure

  1. Display the value of a metric:

    $ pminfo -f xfs.write
    
    xfs.write
        value 325262
  2. Reset all the XFS metrics:

    # pmstore xfs.control.reset 1
    
    xfs.control.reset old value=0 new value=1

Verification steps

  • View the information after resetting the metric:

    $ pminfo --fetch xfs.write
    
    xfs.write
        value 0

Additional resources

9.4. PCP metric groups for XFS

The following table describes the available PCP metric groups for XFS.

Table 9.1. Metric groups for XFS

Metric Group

Metrics provided

xfs.*

General XFS metrics including the read and write operation counts, read and write byte counts. Along with counters for the number of times inodes are flushed, clustered and number of failure to cluster.

xfs.allocs.*

xfs.alloc_btree.*

Range of metrics regarding the allocation of objects in the file system, these include number of extent and block creations/frees. Allocation tree lookup and compares along with extend record creation and deletion from the btree.

xfs.block_map.*

xfs.bmap_tree.*

Metrics include the number of block map read/write and block deletions, extent list operations for insertion, deletions and lookups. Also operations counters for compares, lookups, insertions and deletion operations from the blockmap.

xfs.dir_ops.*

Counters for directory operations on XFS file systems for creation, entry deletions, count of “getdent” operations.

xfs.transactions.*

Counters for the number of meta-data transactions, these include the count for the number of synchronous and asynchronous transactions along with the number of empty transactions.

xfs.inode_ops.*

Counters for the number of times that the operating system looked for an XFS inode in the inode cache with different outcomes. These count cache hits, cache misses, and so on.

xfs.log.*

xfs.log_tail.*

Counters for the number of log buffer writes over XFS file sytems includes the number of blocks written to disk. Metrics also for the number of log flushes and pinning.

xfs.xstrat.*

Counts for the number of bytes of file data flushed out by the XFS flush deamon along with counters for number of buffers flushed to contiguous and non-contiguous space on disk.

xfs.attr.*

Counts for the number of attribute get, set, remove and list operations over all XFS file systems.

xfs.quota.*

Metrics for quota operation over XFS file systems, these include counters for number of quota reclaims, quota cache misses, cache hits and quota data reclaims.

xfs.buffer.*

Range of metrics regarding XFS buffer objects. Counters include the number of requested buffer calls, successful buffer locks, waited buffer locks, miss_locks, miss_retries and buffer hits when looking up pages.

xfs.btree.*

Metrics regarding the operations of the XFS btree.

xfs.control.reset

Configuration metrics which are used to reset the metric counters for the XFS stats. Control metrics are toggled by means of the pmstore tool.

9.5. Per-device PCP metric groups for XFS

The following table describes the available per-device PCP metric group for XFS.

Table 9.2. Per-device PCP metric groups for XFS

Metric Group

Metrics provided

xfs.perdev.*

General XFS metrics including the read and write operation counts, read and write byte counts. Along with counters for the number of times inodes are flushed, clustered and number of failure to cluster.

xfs.perdev.allocs.*

xfs.perdev.alloc_btree.*

Range of metrics regarding the allocation of objects in the file system, these include number of extent and block creations/frees. Allocation tree lookup and compares along with extend record creation and deletion from the btree.

xfs.perdev.block_map.*

xfs.perdev.bmap_tree.*

Metrics include the number of block map read/write and block deletions, extent list operations for insertion, deletions and lookups. Also operations counters for compares, lookups, insertions and deletion operations from the blockmap.

xfs.perdev.dir_ops.*

Counters for directory operations of XFS file systems for creation, entry deletions, count of “getdent” operations.

xfs.perdev.transactions.*

Counters for the number of meta-data transactions, these include the count for the number of synchronous and asynchronous transactions along with the number of empty transactions.

xfs.perdev.inode_ops.*

Counters for the number of times that the operating system looked for an XFS inode in the inode cache with different outcomes. These count cache hits, cache misses, and so on.

xfs.perdev.log.*

xfs.perdev.log_tail.*

Counters for the number of log buffer writes over XFS filesytems includes the number of blocks written to disk. Metrics also for the number of log flushes and pinning.

xfs.perdev.xstrat.*

Counts for the number of bytes of file data flushed out by the XFS flush deamon along with counters for number of buffers flushed to contiguous and non-contiguous space on disk.

xfs.perdev.attr.*

Counts for the number of attribute get, set, remove and list operations over all XFS file systems.

xfs.perdev.quota.*

Metrics for quota operation over XFS file systems, these include counters for number of quota reclaims, quota cache misses, cache hits and quota data reclaims.

xfs.perdev.buffer.*

Range of metrics regarding XFS buffer objects. Counters include the number of requested buffer calls, successful buffer locks, waited buffer locks, miss_locks, miss_retries and buffer hits when looking up pages.

xfs.perdev.btree.*

Metrics regarding the operations of the XFS btree.

Chapter 10. Setting up graphical representation of PCP metrics

Using a combination of pcp, grafana, pcp redis, pcp bpftrace, and pcp vector provides graphs, based on the live data or data collected by Performance Co-Pilot (PCP).

This section describes how to set up and access the graphical representation of PCP metrics.

10.1. Setting up PCP with pcp-zeroconf

This procedure describes how to set up PCP on a system with the pcp-zeroconf package. Once the pcp-zeroconf package is installed, the system records the default set of metrics into archived files.

Procedure

  • Install the pcp-zeroconf package:

    # yum install pcp-zeroconf

Verification steps

  • Ensure that the pmlogger service is active, and starts archiving the metrics:

    # pcp | grep pmlogger
     pmlogger: primary logger: /var/log/pcp/pmlogger/localhost.localdomain/20200401.00.12

Additional resources

10.2. Setting up a grafana-server

Grafana generates graphs that are accessible from a browser. The grafana-server is a back-end server for the Grafana dashboard. It listens, by default, on all interfaces, and provides web services accessed through the web browser. The grafana-pcp plugin interacts with the pmproxy protocol in the backend.

This procedure describes how to set up a grafana-server.

Prerequisites

Procedure

  1. Install the following packages:

    # yum install grafana grafana-pcp
  2. Restart and enable the following service:

    # systemctl restart grafana-server
    # systemctl enable grafana-server

Verification steps

  • Ensure that the grafana-server is listening and responding to requests:

    # ss -ntlp | grep 3000
    LISTEN  0  128  *:3000  *:*  users:(("grafana-server",pid=19522,fd=7))
  • Ensure that the grafana-pcp plugin is installed:

    # grafana-cli plugins ls | grep performancecopilot-pcp-app
    
    performancecopilot-pcp-app @ 3.0.2

Additional resources

  • pmproxy(1) and grafana-server man pages

10.3. Accessing the Grafana web UI

This procedure describes how to access the Grafana web interface.

Using the Grafana web interface, you can:

  • add PCP Redis, PCP bpftrace, and PCP Vector data sources
  • create dashboard
  • view an overview of any useful metrics
  • create alerts in PCP Redis

Prerequisites

  1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
  2. The grafana-server is configured. For more information, see Setting up a grafana-server.

Procedure

  1. On the client system, open a browser and access the grafana-server on port 3000, using http://192.0.2.0:3000 link.

    Replace 192.0.2.0 with your machine IP.

  2. For the first login, enter admin in both the Email or username and Password field.

    Grafana prompts to set a New password to create a secured account. If you want to set it later, click Skip.

  3. From the menu, hover over the    grafana gear icon    Configuration icon and then click Plugins.
  4. In the Plugins tab, type performance co-pilot in the Search by name or type text box and then click Performance Co-Pilot (PCP) plugin.
  5. In the Plugins / Performance Co-Pilot pane, click Enable.
  6. Click Grafana    grafana home page whirl icon    icon. The Grafana Home page is displayed.

    Figure 10.1. Home Dashboard

    grafana home dashboard
    Note

    The top corner of the screen has a similar    grafana top corner settings icon    icon, but it controls the general Dashboard settings.

  7. In the Grafana Home page, click Add your first data source to add PCP Redis, PCP bpftrace, and PCP Vector data sources. For more information on adding data source, see:

  8. Optional: From the menu, hover over the admin profile    grafana logout option icon    icon to change the Preferences including Edit Profile, Change Password, or to Sign out.

Additional resources

  • grafana-cli and grafana-server man pages

10.4. Configuring PCP Redis

This section provides information for configuring PCP Redis data source.

Use the PCP Redis data source to:

  • View data archives
  • Query time series using pmseries language
  • Analyze data across multiple hosts

Prerequisites

  1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
  2. The grafana-server is configured. For more information, see Setting up a grafana-server.

Procedure

  1. Install the redis package:

    # yum install redis
  2. Start and enable the following services:

    # systemctl start pmproxy redis
    # systemctl enable pmproxy redis
  3. Mail transfer agent, for example, sendmail or postfix is installed and configured.
  4. Ensure that the allow_loading_unsigned_plugins parameter is set to PCP Redis database in the grafana.ini file:

    # vi /etc/grafana/grafana.ini
    
    allow_loading_unsigned_plugins = pcp-redis-datasource
  5. Restart the grafana-server:

    # systemctl restart grafana-server

Verification steps

  • Ensure that the pmproxy and redis are working:

    # pmseries disk.dev.read
    2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df

    This command does not return any data if the redis package is not installed.

Additional resources

  • pmseries(1) man page

10.5. Creating panels and alert in PCP Redis data source

After adding the PCP Redis data source, you can view the dashboard with an overview of useful metrics, add a query to visualize the load graph, and create alerts that help you to view the system issues after they occur.

Prerequisites

  1. The PCP Redis is configured. For more information, see Configuring PCP Redis.
  2. The grafana-server is accessible. For more information, see Accessing the Grafana web UI.

Procedure

  1. Log into the Grafana web UI.
  2. In the Grafana Home page, click Add your first data source.
  3. In the Add data source pane, type redis in the Filter by name or type text box and then click PCP Redis.
  4. In the Data Sources / PCP Redis pane, perform the following:

    1. Add http://localhost:44322 in the URL field and then click Save & Test.
    2. Click Dashboards tabImportPCP Redis: Host Overview to see a dashboard with an overview of any useful metrics.

      Figure 10.2. PCP Redis: Host Overview

      pcp redis host overview
  5. Add a new panel:

    1. From the menu, hover over the    grafana plus sign    Create iconDashboardAdd new panel icon to add a panel.
    2. In the Query tab, select the PCP Redis from the query list instead of the selected default option and in the text field of A, enter metric, for example, kernel.all.load to visualize the kernel load graph.
    3. Optional: Add Panel title and Description, and update other options from the Settings.
    4. Click Save to apply changes and save the dashboard. Add Dashboard name.
    5. Click Apply to apply changes and go back to the dashboard.

      Figure 10.3. PCP Redis query panel

      pcp redis query panel
  6. Create an alert rule:

    1. In the PCP Redis query panel, click    redis alert icon    Alert and then click Create Alert.
    2. Edit the Name, Evaluate query, and For fields from the Rule, and specify the Conditions for your alert.
    3. Click Save to apply changes and save the dashboard. Click Apply to apply changes and go back to the dashboard.

      Figure 10.4. Creating alerts in the PCP Redis panel

      pcp redis query alert panel
    4. Optional: In the same panel, scroll down and click Delete icon to delete the created rule.
    5. Optional: From the menu, click    alerting bell icon    Alerting icon to view the created alert rules with different alert statuses, to edit the alert rule, or to pause the existing rule from the Alert Rules tab.

      To add a notification channel for the created alert rule to receive an alert notification from Grafana, see Adding notification channels for alerts.

10.6. Adding notification channels for alerts

By adding notification channels, you can receive an alert notification from Grafana whenever the alert rule conditions are met and the system needs further monitoring.

You can receive these alerts after selecting any one type from the supported list of notifiers, which includes DingDing, Discord, Email, Google Hangouts Chat, HipChat, Kafka REST Proxy, LINE, Microsoft Teams, OpsGenie, PagerDuty, Prometheus Alertmanager, Pushover, Sensu, Slack, Telegram, Threema Gateway, VictorOps, and webhook.

Prerequisites

  1. The grafana-server is accessible. For more information, see Accessing the Grafana web UI.
  2. An alert rule is created. For more information, see Creating panels and alert in PCP Redis data source.
  3. Configure SMTP and add a valid sender’s email address in the grafana/grafana.ini file:

    # vi /etc/grafana/grafana.ini
    
    [smtp]
    enabled = true
    from_address = abc@gmail.com

    Replace abc@gmail.com by a valid email address.

Procedure

  1. From the menu, hover over the    alerting bell icon    Alerting iconclick Notification channelsAdd channel.
  2. In the Add notification channel details pane, perform the following:

    1. Enter your name in the Name text box
    2. Select the communication Type, for example, Email and enter the email address. You can add multiple email addresses using the ; separator.
    3. Optional: Configure Optional Email settings and Notification settings.
  3. Click Save.
  4. Select a notification channel in the alert rule:

    1. From the menu, hover over the    alerting bell icon    Alerting icon and then click Alert rules.
    2. From the Alert Rules tab, click the created alert rule.
    3. On the Notifications tab, select your notification channel name from the Send to option, and then add an alert message.
    4. Click Apply.

10.7. Setting up authentication between PCP components

You can setup authentication using the scram-sha-256 authentication mechanism, which is supported by PCP through the Simple Authentication Security Layer (SASL) framework.

Note

From Red Hat Enterprise Linux 8.3, PCP supports the scram-sha-256 authentication mechanism.

Procedure

  1. Install the sasl framework for the scram-sha-256 authentication mechanism:

    # yum install cyrus-sasl-scram cyrus-sasl-lib
  2. Specify the supported authentication mechanism and the user database path in the pmcd.conf file:

    # vi /etc/sasl2/pmcd.conf
    
    mech_list: scram-sha-256
    
    sasldb_path: /etc/pcp/passwd.db
  3. Create a new user:

    # useradd -r metrics

    Replace metrics by your user name.

  4. Add the created user in the user database:

    # saslpasswd2 -a pmcd metrics
    
    Password:
    Again (for verification):

    To add the created user, you are required to enter the metrics account password.

  5. Set the permissions of the user database:

    # chown root:pcp /etc/pcp/passwd.db
    # chmod 640 /etc/pcp/passwd.db
  6. Restart the pmcd service:

    # systemctl restart pmcd

Verification steps

  • Verify the sasl configuration:

    # pminfo -f -h "pcp://127.0.0.1?username=metrics" disk.dev.read
    Password:
    disk.dev.read
    inst [0 or "sda"] value 19540

Additional resources

10.8. Installing PCP bpftrace

Install the PCP bpftrace agent to introspect a system and to gather metrics from the kernel and user-space tracepoints.

The bpftrace agent uses bpftrace scripts to gather the metrics. The bpftrace scripts use the enhanced Berkeley Packet Filter (eBPF).

This procedure describes how to install a pcp bpftrace.

Prerequisites

  1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
  2. The grafana-server is configured. For more information, see Setting up a grafana-server.
  3. The scram-sha-256 authentication mechanism is configured. For more information, see Setting up authentication between PCP components.

Procedure

  1. Install the pcp-pmda-bpftrace package:

    # yum install pcp-pmda-bpftrace
  2. Edit the bpftrace.conf file and add the user that you have created in the {setting-up-authentication-between-pcp-components}:

    # vi /var/lib/pcp/pmdas/bpftrace/bpftrace.conf
    
    [dynamic_scripts]
    enabled = true
    auth_enabled = true
    allowed_users = root,metrics

    Replace metrics by your user name.

  3. Install bpftrace PMDA:

    # cd /var/lib/pcp/pmdas/bpftrace/
    # ./Install
    Updating the Performance Metrics Name Space (PMNS) ...
    Terminate PMDA if already installed ...
    Updating the PMCD control file, and notifying PMCD ...
    Check bpftrace metrics have appeared ... 7 metrics and 6 values

    The pmda-bpftrace is now installed, and can only be used after authenticating your user. For more information, see Viewing the PCP bpftrace System Analysis dashboard.

Additional resources

  • pmdabpftrace(1) and bpftrace man pages

10.9. Viewing the PCP bpftrace System Analysis dashboard

Using the PCP bpftrace data source, you can access the live data from sources which are not available as normal data from the pmlogger or archives

In the PCP bpftrace data source, you can view the dashboard with an overview of useful metrics.

Prerequisites

  1. The PCP bpftrace is installed. For more information, see Installing PCP bpftrace.
  2. The grafana-server is accessible. For more information, see Accessing the Grafana web UI.

Procedure

  1. Log into the Grafana web UI.
  2. In the Grafana Home page, click Add your first data source.
  3. In the Add data source pane, type bpftrace in the Filter by name or type text box and then click PCP bpftrace.
  4. In the Data Sources / PCP bpftrace pane, perform the following:

    1. Add http://localhost:44322 in the URL field.
    2. Toggle the Basic Auth option and add the created user credentials in the User and Password field.
    3. Click Save & Test.

      Figure 10.5. Adding PCP bpftrace in the data source

      bpftrace auth
    4. Click Dashboards tabImportPCP bpftrace: System Analysis to see a dashboard with an overview of any useful metrics.

      Figure 10.6. PCP bpftrace: System Analysis

      pcp bpftrace bpftrace system analysis

10.10. Installing PCP Vector

This procedure describes how to install a pcp vector.

Prerequisites

  1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
  2. The grafana-server is configured. For more information, see Setting up a grafana-server.

Procedure

  1. Install the pcp-pmda-bcc package:

    # yum install pcp-pmda-bcc
  2. Install the bcc PMDA:

    # cd /var/lib/pcp/pmdas/bcc
    # ./Install
    [Wed Apr  1 00:27:48] pmdabcc(22341) Info: Initializing, currently in 'notready' state.
    [Wed Apr  1 00:27:48] pmdabcc(22341) Info: Enabled modules:
    [Wed Apr  1 00:27:48] pmdabcc(22341) Info: ['biolatency', 'sysfork',
    [...]
    Updating the Performance Metrics Name Space (PMNS) ...
    Terminate PMDA if already installed ...
    Updating the PMCD control file, and notifying PMCD ...
    Check bcc metrics have appeared ... 1 warnings, 1 metrics and 0 values

Additional resources

  • pmdabcc(1) man page

10.11. Viewing the PCP Vector Checklist

The PCP Vector data source displays live metrics and uses the pcp metrics. It analyzes data for individual hosts.

After adding the PCP Vector data source, you can view the dashboard with an overview of useful metrics and view the related troubleshooting or reference links in the checklist.

Prerequisites

  1. The PCP Vector is installed. For more information, see Installing PCP Vector.
  2. The grafana-server is accessible. For more information, see Accessing the Grafana web UI.

Procedure

  1. Log into the Grafana web UI.
  2. In the Grafana Home page, click Add your first data source.
  3. In the Add data source pane, type vector in the Filter by name or type text box and then click PCP Vector.
  4. In the Data Sources / PCP Vector pane, perform the following:

    1. Add http://localhost:44322 in the URL field and then click Save & Test.
    2. Click Dashboards tabImportPCP Vector: Host Overview to see a dashboard with an overview of any useful metrics.

      Figure 10.7. PCP Vector: Host Overview

      pcp vector host overview
  5. From the menu, hover over the    pcp plugin in grafana    Performance Co-Pilot plugin and then click PCP Vector Checklist.

    In the PCP checklist, click    pcp vector checklist troubleshooting doc    help or    pcp vector checklist warning    warning icon to view the related troubleshooting or reference links.

    Figure 10.8. Performance Co-Pilot / PCP Vector Checklist

    pcp vector checklist

10.12. Troubleshooting Grafana issues

This section describes how to troubleshoot Grafana issues, such as, Grafana does not display any data, the dashboard is black, or similar issues.

Procedure

  • Verify that the pmlogger service is up and running by executing the following command:

    $ systemctl status pmlogger
  • Verify if files were created or modified to the disk by executing the following command:

    $ ls /var/log/pcp/pmlogger/$(hostname)/ -rlt
    total 4024
    -rw-r--r--. 1 pcp pcp   45996 Oct 13  2019 20191013.20.07.meta.xz
    -rw-r--r--. 1 pcp pcp     412 Oct 13  2019 20191013.20.07.index
    -rw-r--r--. 1 pcp pcp   32188 Oct 13  2019 20191013.20.07.0.xz
    -rw-r--r--. 1 pcp pcp   44756 Oct 13  2019 20191013.20.30-00.meta.xz
    [..]
  • Verify that the pmproxy service is running by executing the following command:

    $ systemctl status pmproxy
  • Verify that pmproxy is running, time series support is enabled, and a connection to Redis is established by viewing the /var/log/pcp/pmproxy/pmproxy.log file and ensure that it contains the following text:

    pmproxy(1716) Info: Redis slots, command keys, schema version setup

    Here, 1716 is the PID of pmproxy, which will be different for every invocation of pmproxy.

  • Verify if the Redis database contains any keys by executing the following command:

    $ redis-cli dbsize
    (integer) 34837
  • Verify if any PCP metrics are in the Redis database and pmproxy is able to access them by executing the following commands:

    $ pmseries disk.dev.read
    2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
    
    $ pmseries "disk.dev.read[count:10]"
    2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
        [Mon Jul 26 12:21:10.085468000 2021] 117971 70e83e88d4e1857a3a31605c6d1333755f2dd17c
        [Mon Jul 26 12:21:00.087401000 2021] 117758 70e83e88d4e1857a3a31605c6d1333755f2dd17c
        [Mon Jul 26 12:20:50.085738000 2021] 116688 70e83e88d4e1857a3a31605c6d1333755f2dd17c
    [...]
    $ redis-cli --scan --pattern "*$(pmseries 'disk.dev.read')"
    
    pcp:metric.name:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
    pcp:values:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
    pcp:desc:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
    pcp:labelvalue:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
    pcp:instances:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
    pcp:labelflags:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
  • Verify if there are any errors in the Grafana logs by executing the following command:

    $ journalctl -e -u grafana-server
    -- Logs begin at Mon 2021-07-26 11:55:10 IST, end at Mon 2021-07-26 12:30:15 IST. --
    Jul 26 11:55:17 localhost.localdomain systemd[1]: Starting Grafana instance...
    Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530 lvl=info msg="Starting Grafana" logger=server version=7.3.6 c>
    Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530 lvl=info msg="Config loaded from" logger=settings file=/usr/s>
    Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530 lvl=info msg="Config loaded from" logger=settings file=/etc/g>
    [...]

Chapter 11. Optimizing the system performance using the web console

Learn how to set a performance profile in the RHEL 8 web console to optimize the performance of the system for a selected task.

11.1. Performance tuning options in the web console

Red Hat Enterprise Linux 8 provides several performance profiles that optimize the system for the following tasks:

  • Systems using the desktop
  • Throughput performance
  • Latency performance
  • Network performance
  • Low power consumption
  • Virtual machines

The tuned service optimizes system options to match the selected profile.

In the web console, you can set which performance profile your system uses.

Additional resources

11.2. Setting a performance profile in the web console

This procedure uses the web console to optimize the system performance for a selected task.

Prerequisites

Procedure

  1. Log into the RHEL 8 web console. For details, see Logging in to the web console.
  2. Click Overview.
  3. In the Performance Profile field, click the current performance profile.

    cockpit performance profile pf4

  4. In the Change Performance Profile dialog box, change the profile if necessary.
  5. Click Change Profile.

    cockpit performance profile change pf4

Verification steps

  • The Overview tab now shows the selected performance profile.

11.3. Monitoring performance using the web console

Red Hat’s web console uses the Utilization Saturation and Errors (USE) Method for troubleshooting. The new performance metrics page has a historical view of your data organized chronologically with the newest data at the top.

Here, you can view the events, errors, and graphical representation for resource utilization and saturation.

Prerequisites

  1. Make sure the web console is installed and accessible. For details, see Installing the web console.
  2. Install the cockpit-pcp package, which enables collecting the performance metrics:

    # yum install cockpit-pcp

Procedure

  1. Log into the RHEL 8 web console. For details, see Logging in to the web console.
  2. Click Overview.

    Web console Overview

  3. Click View details and history to view the Performance Metrics.

    View details and history

    Performance metrics in Web console

Chapter 12. Setting the disk scheduler

The disk scheduler is responsible for ordering the I/O requests submitted to a storage device.

You can configure the scheduler in several different ways:

Note

In Red Hat Enterprise Linux 8, block devices support only multi-queue scheduling. This enables the block layer performance to scale well with fast solid-state drives (SSDs) and multi-core systems.

The traditional, single-queue schedulers, which were available in Red Hat Enterprise Linux 7 and earlier versions, have been removed.

12.1. Available disk schedulers

The following multi-queue disk schedulers are supported in Red Hat Enterprise Linux 8:

none
Implements a first-in first-out (FIFO) scheduling algorithm. It merges requests at the generic block layer through a simple last-hit cache.
mq-deadline

Attempts to provide a guaranteed latency for requests from the point at which requests reach the scheduler.

The mq-deadline scheduler sorts queued I/O requests into a read or write batch and then schedules them for execution in increasing logical block addressing (LBA) order. By default, read batches take precedence over write batches, because applications are more likely to block on read I/O operations. After mq-deadline processes a batch, it checks how long write operations have been starved of processor time and schedules the next read or write batch as appropriate.

This scheduler is suitable for most use cases, but particularly those in which the write operations are mostly asynchronous.

bfq

Targets desktop systems and interactive tasks.

The bfq scheduler ensures that a single application is never using all of the bandwidth. In effect, the storage device is always as responsive as if it was idle. In its default configuration, bfq focuses on delivering the lowest latency rather than achieving the maximum throughput.

bfq is based on cfq code. It does not grant the disk to each process for a fixed time slice but assigns a budget measured in number of sectors to the process.

This scheduler is suitable while copying large files and the system does not become unresponsive in this case.

kyber

The scheduler tunes itself to achieve a latency goal by calculating the latencies of every I/O request submitted to the block I/O layer. You can configure the target latencies for read, in the case of cache-misses, and synchronous write requests.

This scheduler is suitable for fast devices, for example NVMe, SSD, or other low latency devices.

12.2. Different disk schedulers for different use cases

Depending on the task that your system performs, the following disk schedulers are recommended as a baseline prior to any analysis and tuning tasks:

Table 12.1. Disk schedulers for different use cases

Use caseDisk scheduler

Traditional HDD with a SCSI interface

Use mq-deadline or bfq.

High-performance SSD or a CPU-bound system with fast storage

Use none, especially when running enterprise applications. Alternatively, use kyber.

Desktop or interactive tasks

Use bfq.

Virtual guest

Use mq-deadline. With a host bus adapter (HBA) driver that is multi-queue capable, use none.

12.3. The default disk scheduler

Block devices use the default disk scheduler unless you specify another scheduler.

Note

For non-volatile Memory Express (NVMe) block devices specifically, the default scheduler is none and Red Hat recommends not changing this.

The kernel selects a default disk scheduler based on the type of device. The automatically selected scheduler is typically the optimal setting. If you require a different scheduler, Red Hat recommends to use udev rules or the Tuned application to configure it. Match the selected devices and switch the scheduler only for those devices.

12.4. Determining the active disk scheduler

This procedure determines which disk scheduler is currently active on a given block device.

Procedure

  • Read the content of the /sys/block/device/queue/scheduler file:

    # cat /sys/block/device/queue/scheduler
    
    [mq-deadline] kyber bfq none

    In the file name, replace device with the block device name, for example sdc.

    The active scheduler is listed in square brackets ([ ]).

12.5. Setting the disk scheduler using Tuned

This procedure creates and enables a Tuned profile that sets a given disk scheduler for selected block devices. The setting persists across system reboots.

In the following commands and configuration, replace:

  • device with the name of the block device, for example sdf
  • selected-scheduler with the disk scheduler that you want to set for the device, for example bfq

Prerequisites

Procedure

  1. Optional: Select an existing Tuned profile on which your profile will be based. For a list of available profiles, see Tuned profiles distributed with RHEL.

    To see which profile is currently active, use:

    $ tuned-adm active
  2. Create a new directory to hold your Tuned profile:

    # mkdir /etc/tuned/my-profile
  3. Find the system unique identifier of the selected block device:

    $ udevadm info --query=property --name=/dev/device | grep -E '(WWN|SERIAL)'
    
    ID_WWN=0x5002538d00000000_
    ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0
    ID_SERIAL_SHORT=20120501030900000
    Note

    The command in the this example will return all values identified as a World Wide Name (WWN) or serial number associated with the specified block device. Although it is preferred to use a WWN, the WWN is not always available for a given device and any values returned by the example command are acceptable to use as the device system unique ID.

  4. Create the /etc/tuned/my-profile/tuned.conf configuration file. In the file, set the following options:

    1. Optional: Include an existing profile:

      [main]
      include=existing-profile
    2. Set the selected disk scheduler for the device that matches the WWN identifier:

      [disk]
      devices_udev_regex=IDNAME=device system unique id
      elevator=selected-scheduler

      Here:

      • Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
      • Replace device system unique id with the value of the chosen identifier (for example, 0x5002538d00000000).

        To match multiple devices in the devices_udev_regex option, enclose the identifiers in parentheses and separate them with vertical bars:

        devices_udev_regex=(ID_WWN=0x5002538d00000000)|(ID_WWN=0x1234567800000000)
  5. Enable your profile:

    # tuned-adm profile my-profile

Verification steps

  • Verify that the Tuned profile is active and applied:

    $ tuned-adm active
    
    Current active profile: my-profile
    $ tuned-adm verify
    
    Verification succeeded, current system settings match the preset profile.
    See tuned log file ('/var/log/tuned/tuned.log') for details.

Additional resources

12.6. Setting the disk scheduler using udev rules

This procedure sets a given disk scheduler for specific block devices using udev rules. The setting persists across system reboots.

In the following commands and configuration, replace:

  • device with the name of the block device, for example sdf
  • selected-scheduler with the disk scheduler that you want to set for the device, for example bfq

Procedure

  1. Find the system unique identifier of the block device:

    $ udevadm info --name=/dev/device | grep -E '(WWN|SERIAL)'
    E: ID_WWN=0x5002538d00000000
    E: ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0
    E: ID_SERIAL_SHORT=20120501030900000
    Note

    The command in the this example will return all values identified as a World Wide Name (WWN) or serial number associated with the specified block device. Although it is preferred to use a WWN, the WWN is not always available for a given device and any values returned by the example command are acceptable to use as the device system unique ID.

  2. Configure the udev rule. Create the /etc/udev/rules.d/99-scheduler.rules file with the following content:

    ACTION=="add|change", SUBSYSTEM=="block", ENV{IDNAME}=="device system unique id", ATTR{queue/scheduler}="selected-scheduler"

    Here:

    • Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
    • Replace device system unique id with the value of the chosen identifier (for example, 0x5002538d00000000).
  3. Reload udev rules:

    # udevadm control --reload-rules
  4. Apply the scheduler configuration:

    # udevadm trigger --type=devices --action=change

Verification steps

  • Verify the active scheduler:

    # cat /sys/block/device/queue/scheduler

12.7. Temporarily setting a scheduler for a specific disk

This procedure sets a given disk scheduler for specific block devices. The setting does not persist across system reboots.

Procedure

  • Write the name of the selected scheduler to the /sys/block/device/queue/scheduler file:

    # echo selected-scheduler > /sys/block/device/queue/scheduler

    In the file name, replace device with the block device name, for example sdc.

Verification steps

  • Verify that the scheduler is active on the device:

    # cat /sys/block/device/queue/scheduler

Chapter 13. Tuning the performance of a Samba server

This chapter describes what settings can improve the performance of Samba in certain situations, and which settings can have a negative performance impact.

Parts of this section were adopted from the Performance Tuning documentation published in the Samba Wiki. License: CC BY 4.0. Authors and contributors: See the history tab on the Wiki page.

Prerequisites

13.1. Setting the SMB protocol version

Each new SMB version adds features and improves the performance of the protocol. The recent Windows and Windows Server operating systems always supports the latest protocol version. If Samba also uses the latest protocol version, Windows clients connecting to Samba benefit from the performance improvements. In Samba, the default value of the server max protocol is set to the latest supported stable SMB protocol version.

Note

To always have the latest stable SMB protocol version enabled, do not set the server max protocol parameter. If you set the parameter manually, you will need to modify the setting with each new version of the SMB protocol, to have the latest protocol version enabled.

The following procedure explains how to use the default value in the server max protocol parameter.

Procedure

  1. Remove the server max protocol parameter from the [global] section in the /etc/samba/smb.conf file.
  2. Reload the Samba configuration

    # smbcontrol all reload-config

13.2. Tuning shares with directories that contain a large number of files

Linux supports case-sensitive file names. For this reason, Samba needs to scan directories for uppercase and lowercase file names when searching or accessing a file. You can configure a share to create new files only in lowercase or uppercase, which improves the performance.

Prerequisites

  • Samba is configured as a file server

Procedure

  1. Rename all files on the share to lowercase.

    Note

    Using the settings in this procedure, files with names other than in lowercase will no longer be displayed.

  2. Set the following parameters in the share’s section:

    case sensitive = true
    default case = lower
    preserve case = no
    short preserve case = no

    For details about the parameters, see their descriptions in the smb.conf(5) man page.

  3. Verify the /etc/samba/smb.conf file:

    # testparm
  4. Reload the Samba configuration:

    # smbcontrol all reload-config

After you applied these settings, the names of all newly created files on this share use lowercase. Because of these settings, Samba no longer needs to scan the directory for uppercase and lowercase, which improves the performance.

13.3. Settings that can have a negative performance impact

By default, the kernel in Red Hat Enterprise Linux is tuned for high network performance. For example, the kernel uses an auto-tuning mechanism for buffer sizes. Setting the socket options parameter in the /etc/samba/smb.conf file overrides these kernel settings. As a result, setting this parameter decreases the Samba network performance in most cases.

To use the optimized settings from the Kernel, remove the socket options parameter from the [global] section in the /etc/samba/smb.conf.

Chapter 14. Optimizing virtual machine performance

Virtual machines (VMs) always experience some degree of performance deterioration in comparison to the host. The following sections explain the reasons for this deterioration and provide instructions on how to minimize the performance impact of virtualization in RHEL 8, so that your hardware infrastructure resources can be used as efficiently as possible.

14.1. What influences virtual machine performance

VMs are run as user-space processes on the host. The hypervisor therefore needs to convert the host’s system resources so that the VMs can use them. As a consequence, a portion of the resources is consumed by the conversion, and the VM therefore cannot achieve the same performance efficiency as the host.

The impact of virtualization on system performance

More specific reasons for VM performance loss include:

  • Virtual CPUs (vCPUs) are implemented as threads on the host, handled by the Linux scheduler.
  • VMs do not automatically inherit optimization features, such as NUMA or huge pages, from the host kernel.
  • Disk and network I/O settings of the host might have a significant performance impact on the VM.
  • Network traffic typically travels to a VM through a software-based bridge.
  • Depending on the host devices and their models, there might be significant overhead due to emulation of particular hardware.

The severity of the virtualization impact on the VM performance is influenced by a variety factors, which include:

  • The number of concurrently running VMs.
  • The amount of virtual devices used by each VM.
  • The device types used by the VMs.

Reducing VM performance loss

RHEL 8 provides a number of features you can use to reduce the negative performance effects of virtualization. Notably:

Important

Tuning VM performance can have adverse effects on other virtualization functions. For example, it can make migrating the modified VM more difficult.

14.2. Optimizing virtual machine performance using tuned

The tuned utility is a tuning profile delivery mechanism that adapts RHEL for certain workload characteristics, such as requirements for CPU-intensive tasks or storage-network throughput responsiveness. It provides a number of tuning profiles that are pre-configured to enhance performance and reduce power consumption in a number of specific use cases. You can edit these profiles or create new profiles to create performance solutions tailored to your environment, including virtualized environments.

Red Hat recommends using the following profiles when using virtualization in RHEL 8:

  • For RHEL 8 virtual machines, use the virtual-guest profile. It is based on the generally applicable throughput-performance profile, but also decreases the swappiness of virtual memory.
  • For RHEL 8 virtualization hosts, use the virtual-host profile. This enables more aggressive writeback of dirty memory pages, which benefits the host performance.

Prerequisites

Procedure

To enable a specific tuned profile:

  1. List the available tuned profiles.

    # tuned-adm list
    
    Available profiles:
    - balanced             - General non-specialized tuned profile
    - desktop              - Optimize for the desktop use-case
    [...]
    - virtual-guest        - Optimize for running inside a virtual guest
    - virtual-host         - Optimize for running KVM guests
    Current active profile: balanced
  2. Optional: Create a new tuned profile or edit an existing tuned profile.

    For more information, see Customizing tuned profiles.

  3. Activate a tuned profile.

    # tuned-adm profile selected-profile
    • To optimize a virtualization host, use the virtual-host profile.

      # tuned-adm profile virtual-host
    • On a RHEL guest operating system, use the virtual-guest profile.

      # tuned-adm profile virtual-guest

Additional resources

14.3. Configuring virtual machine memory

To improve the performance of a virtual machine (VM), you can assign additional host RAM to the VM. Similarly, you can decrease the amount of memory allocated to a VM so the host memory can be allocated to other VMs or tasks.

To perform these actions, you can use the web console or the command-line interface.

14.3.1. Adding and removing virtual machine memory using the web console

To improve the performance of a virtual machine (VM) or to free up the host resources it is using, you can use the web console to adjust amount of memory allocated to the VM.

Prerequisites

  • The guest OS is running the memory balloon drivers. To verify this is the case:

    1. Ensure the VM’s configuration includes the memballoon device:

      # virsh dumpxml testguest | grep memballoon
      <memballoon model='virtio'>
          </memballoon>

      If this commands displays any output and the model is not set to none, the memballoon device is present.

    2. Ensure the ballon drivers are running in the guest OS.

  • To use the web console to manage VMs, install the web console VM plug-in.

Procedure

  1. Optional: Obtain the information about the maximum memory and currently used memory for a VM. This will serve as a baseline for your changes, and also for verification.

    # virsh dominfo testguest
    Max memory:     2097152 KiB
    Used memory:    2097152 KiB
  2. In the Virtual Machines interface, click a row with the name of the VMs for which you want to view and adjust the allocated memory.

    The row expands to reveal the Overview pane with basic information about the selected VMs.

  3. Click the value of the Memory line in the Overview pane.

    The Memory Adjustment dialog appears.

    virt memory cockpit
  4. Configure the virtual CPUs for the selected VM.

    • Maximum allocation - Sets the maximum amount of host memory that the VM can use for its processes. Increasing this value improves the performance potential of the VM, and reducing the value lowers the performance footprint the VM has on your host.

      Adjusting maximum memory allocation is only possible on a shut-off VM.

    • Current allocation - Sets the actual amount of memory allocated to the VM. You can adjust the value to regulate the memory available to the VM for its processes. This value cannot exceed the maximum allocation value.
  5. Click Save.

    The memory allocation of the VM is adjusted.

Additional resources

14.3.2. Adding and removing virtual machine memory using the command-line interface

To improve the performance of a virtual machine (VM) or to free up the host resources it is using, you can use the CLI to adjust amount of memory allocated to the VM.

Prerequisites

  • The guest OS is running the memory balloon drivers. To verify this is the case:

    1. Ensure the VM’s configuration includes the memballoon device:

      # virsh dumpxml testguest | grep memballoon
      <memballoon model='virtio'>
          </memballoon>

      If this commands displays any output and the model is not set to none, the memballoon device is present.

    2. Ensure the ballon drivers are running in the guest OS.

Procedure

  1. Optional: Obtain the information about the maximum memory and currently used memory for a VM. This will serve as a baseline for your changes, and also for verification.

    # virsh dominfo testguest
    Max memory:     2097152 KiB
    Used memory:    2097152 KiB
  2. Adjust the maximum memory allocated to a VM. Increasing this value improves the performance potential of the VM, and reducing the value lowers the performance footprint the VM has on your host. Note that this change can only be performed on a shut-off VM, so adjusting a running VM requires a reboot to take effect.

    For example, to change the maximum memory that the testguest VM can use to 4096 MiB:

    # virt-xml testguest --edit --memory memory=4096,currentMemory=4096
    Domain 'testguest' defined successfully.
    Changes will take effect after the domain is fully powered off.
  1. Optional: You can also adjust the memory currently used by the VM, up to the maximum allocation. This regulates the memory load that the VM has on the host until the next reboot, without changing the maximum VM allocation.

    # virsh setmem testguest --current 2048

Verification

  1. Confirm that the memory used by the VM has been updated:

    # virsh dominfo testguest
    Max memory:     4194304 KiB
    Used memory:    2097152 KiB
  2. Optional: If you adjusted the current VM memory, you can obtain the memory balloon statistics of the VM to evaluate how effectively it regulates its memory use.

     # virsh domstats --balloon testguest
    Domain: 'testguest'
      balloon.current=365624
      balloon.maximum=4194304
      balloon.swap_in=0
      balloon.swap_out=0
      balloon.major_fault=306
      balloon.minor_fault=156117
      balloon.unused=3834448
      balloon.available=4035008
      balloon.usable=3746340
      balloon.last-update=1587971682
      balloon.disk_caches=75444
      balloon.hugetlb_pgalloc=0
      balloon.hugetlb_pgfail=0
      balloon.rss=1005456

Additional resources

14.3.3. Additional resources

  • * To increase the maximum memory of a running VM, you can attach a memory device to the VM. This is also referred to as memory hot plug. For details, see Attaching devices to virtual machines.

    Note that removing a memory device from a VM, also known as memory hot unplug, is not supported in RHEL 8, and Red Hat highly discourages its use.

14.4. Optimizing virtual machine I/O performance

The input and output (I/O) capabilities of a virtual machine (VM) can significantly limit the VM’s overall efficiency. To address this, you can optimize a VM’s I/O by configuring block I/O parameters.

14.4.1. Tuning block I/O in virtual machines

When multiple block devices are being used by one or more VMs, it might be important to adjust the I/O priority of specific virtual devices by modifying their I/O weights.

Increasing the I/O weight of a device increases its priority for I/O bandwidth, and therefore provides it with more host resources. Similarly, reducing a device’s weight makes it consume less host resources.

Note

Each device’s weight value must be within the 100 to 1000 range. Alternatively, the value can be 0, which removes that device from per-device listings.

Procedure

To display and set a VM’s block I/O parameters:

  1. Display the current <blkio> parameters for a VM:

    # virsh dumpxml VM-name

    <domain>
      [...]
      <blkiotune>
        <weight>800</weight>
        <device>
          <path>/dev/sda</path>
          <weight>1000</weight>
        </device>
        <device>
          <path>/dev/sdb</path>
          <weight>500</weight>
        </device>
      </blkiotune>
      [...]
    </domain>
  2. Edit the I/O weight of a specified device:

    # virsh blkiotune VM-name --device-weights device, I/O-weight

    For example, the following changes the weight of the /dev/sda device in the liftrul VM to 500.

    # virsh blkiotune liftbrul --device-weights /dev/sda, 500

14.4.2. Disk I/O throttling in virtual machines

When several VMs are running simultaneously, they can interfere with system performance by using excessive disk I/O. Disk I/O throttling in KVM virtualization provides the ability to set a limit on disk I/O requests sent from the VMs to the host machine. This can prevent a VM from over-utilizing shared resources and impacting the performance of other VMs.

To enable disk I/O throttling, set a limit on disk I/O requests sent from each block device attached to VMs to the host machine.

Procedure

  1. Use the virsh domblklist command to list the names of all the disk devices on a specified VM.

    # virsh domblklist rollin-coal
    Target     Source
    ------------------------------------------------
    vda        /var/lib/libvirt/images/rollin-coal.qcow2
    sda        -
    sdb        /home/horridly-demanding-processes.iso
  2. Find the host block device where the virtual disk that you want to throttle is mounted.

    For example, if you want to throttle the sdb virtual disk from the previous step, the following output shows that the disk is mounted on the /dev/nvme0n1p3 partition.

    $ lsblk
    NAME                                          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
    zram0                                         252:0    0     4G  0 disk  [SWAP]
    nvme0n1                                       259:0    0 238.5G  0 disk
    ├─nvme0n1p1                                   259:1    0   600M  0 part  /boot/efi
    ├─nvme0n1p2                                   259:2    0     1G  0 part  /boot
    └─nvme0n1p3                                   259:3    0 236.9G  0 part
      └─luks-a1123911-6f37-463c-b4eb-fxzy1ac12fea 253:0    0 236.9G  0 crypt /home
  3. Set I/O limits for the block device using the virsh blkiotune command.

    # virsh blkiotune VM-name --parameter device,limit

    The following example throttles the sdb disk on the rollin-coal VM to 1000 read and write I/O operations per second and to 50 MB per second read and write throughput.

    # virsh blkiotune rollin-coal --device-read-iops-sec /dev/nvme0n1p3,1000 --device-write-iops-sec /dev/nvme0n1p3,1000 --device-write-bytes-sec /dev/nvme0n1p3,52428800 --device-read-bytes-sec /dev/nvme0n1p3,52428800

Additional information

  • Disk I/O throttling can be useful in various situations, for example when VMs belonging to different customers are running on the same host, or when quality of service guarantees are given for different VMs. Disk I/O throttling can also be used to simulate slower disks.
  • I/O throttling can be applied independently to each block device attached to a VM and supports limits on throughput and I/O operations.
  • Red Hat does not support using the virsh blkdeviotune command to configure I/O throttling in VMs. For more information on unsupported features when using RHEL 8 as a VM host, see Unsupported features in RHEL 8 virtualization.

14.4.3. Enabling multi-queue virtio-scsi

When using virtio-scsi storage devices in your virtual machines (VMs), the multi-queue virtio-scsi feature provides improved storage performance and scalability. It enables each virtual CPU (vCPU) to have a separate queue and interrupt to use without affecting other vCPUs.

Procedure

  • To enable multi-queue virtio-scsi support for a specific VM, add the following to the VM’s XML configuration, where N is the total number of vCPU queues:

    <controller type='scsi' index='0' model='virtio-scsi'>
       <driver queues='N' />
    </controller>

14.5. Optimizing virtual machine CPU performance

Much like physical CPUs in host machines, vCPUs are critical to virtual machine (VM) performance. As a result, optimizing vCPUs can have a significant impact on the resource efficiency of your VMs. To optimize your vCPU:

  1. Adjust how many host CPUs are assigned to the VM. You can do this using the CLI or the web console.
  2. Ensure that the vCPU model is aligned with the CPU model of the host. For example, to set the testguest1 VM to use the CPU model of the host:

    # virt-xml testguest1 --edit --cpu host-model
  3. Deactivate kernel same-page merging (KSM).
  4. If your host machine uses Non-Uniform Memory Access (NUMA), you can also configure NUMA for its VMs. This maps the host’s CPU and memory processes onto the CPU and memory processes of the VM as closely as possible. In effect, NUMA tuning provides the vCPU with a more streamlined access to the system memory allocated to the VM, which can improve the vCPU processing effectiveness.

    For details, see Section 14.5.3, “Configuring NUMA in a virtual machine” and Section 14.5.4, “Sample vCPU performance tuning scenario”.

14.5.1. Adding and removing virtual CPUs using the command-line interface

To increase or optimize the CPU performance of a virtual machine (VM), you can add or remove virtual CPUs (vCPUs) assigned to the VM.

When performed on a running VM, this is also referred to as vCPU hot plugging and hot unplugging. However, note that vCPU hot unplug is not supported in RHEL 8, and Red Hat highly discourages its use.

Prerequisites

  • Optional: View the current state of the vCPUs in the targeted VM. For example, to display the number of vCPUs on the testguest VM:

    # virsh vcpucount testguest
    maximum      config         4
    maximum      live           2
    current      config         2
    current      live           1

    This output indicates that testguest is currently using 1 vCPU, and 1 more vCPu can be hot plugged to it to increase the VM’s performance. However, after reboot, the number of vCPUs testguest uses will change to 2, and it will be possible to hot plug 2 more vCPUs.

Procedure

  1. Adjust the maximum number of vCPUs that can be attached to a VM, which takes effect on the VM’s next boot.

    For example, to increase the maximum vCPU count for the testguest VM to 8:

    # virsh setvcpus testguest 8 --maximum --config

    Note that the maximum may be limited by the CPU topology, host hardware, the hypervisor, and other factors.

  2. Adjust the current number of vCPUs attached to a VM, up to the maximum configured in the previous step. For example:

    • To increase the number of vCPUs attached to the running testguest VM to 4:

      # virsh setvcpus testguest 4 --live

      This increases the VM’s performance and host load footprint of testguest until the VM’s next boot.

    • To permanently decrease the number of vCPUs attached to the testguest VM to 1:

      # virsh setvcpus testguest 1 --config

      This decreases the VM’s performance and host load footprint of testguest after the VM’s next boot. However, if needed, additional vCPUs can be hot plugged to the VM to temporarily increase its performance.

Verification

  • Confirm that the current state of vCPU for the VM reflects your changes.

    # virsh vcpucount testguest
    maximum      config         8
    maximum      live           4
    current      config         1
    current      live           4

Additional resources

14.5.2. Managing virtual CPUs using the web console

Using the RHEL 8 web console, you can review and configure virtual CPUs used by virtual machines (VMs) to which the web console is connected.

Prerequisites

Procedure

  1. In the Virtual Machines interface, click a row with the name of the VMs for which you want to view and configure virtual CPU parameters.

    The row expands to reveal the Overview pane with basic information about the selected VMs, including the number of virtual CPUs, and controls for shutting down and deleting the VM.

  2. Click the number of vCPUs in the Overview pane.

    The vCPU details dialog appears.

    cockpit configure vCPUs
    Note

    The warning in the vCPU details dialog only appears after the virtual CPU settings are changed.

  3. Configure the virtual CPUs for the selected VM.

    • vCPU Count - The number of vCPUs currently in use.

      Note

      The vCPU count cannot be greater than the vCPU Maximum.

    • vCPU Maximum - The maximum number of virtual CPUs that can be configured for the VM. If this value is higher than the vCPU Count, additional vCPUs can be attached to the VM.
    • Sockets - The number of sockets to expose to the VM.
    • Cores per socket - The number of cores for each socket to expose to the VM.
    • Threads per core - The number of threads for each core to expose to the VM.

      Note that the Sockets, Cores per socket, and Threads per core options adjust the CPU topology of the VM. This may be beneficial for vCPU performance and may impact the functionality of certain software in the guest OS. If a different setting is not required by your deployment, Red Hat recommends keeping the default values.

  4. Click Apply.

    The virtual CPUs for the VM are configured.

    Note

    Changes to virtual CPU settings only take effect after the VM is restarted.

Additional resources:

14.5.3. Configuring NUMA in a virtual machine

The following methods can be used to configure Non-Uniform Memory Access (NUMA) settings of a virtual machine (VM) on a RHEL 8 host.

Prerequisites

  • The host is a NUMA-compatible machine. To detect whether this is the case, use the virsh nodeinfo command and see the NUMA cell(s) line:

    # virsh nodeinfo
    CPU model:           x86_64
    CPU(s):              48
    CPU frequency:       1200 MHz
    CPU socket(s):       1
    Core(s) per socket:  12
    Thread(s) per core:  2
    NUMA cell(s):        2
    Memory size:         67012964 KiB

    If the value of the line is 2 or greater, the host is NUMA-compatible.

Procedure

For ease of use, you can set up a VM’s NUMA configuration using automated utilities and services. However, manual NUMA setup is more likely to yield a significant performance improvement.

Automatic methods

  • Set the VM’s NUMA policy to Preferred. For example, to do so for the testguest5 VM:

    # virt-xml testguest5 --edit --vcpus placement=auto
    # virt-xml testguest5 --edit --numatune mode=preferred
  • Enable automatic NUMA balancing on the host:

    # echo 1 > /proc/sys/kernel/numa_balancing
  • Use the numad command to automatically align the VM CPU with memory resources.

    # numad

Manual methods

  1. Pin specific vCPU threads to a specific host CPU or range of CPUs. This is also possible on non-NUMA hosts and VMs, and is recommended as a safe method of vCPU performance improvement.

    For example, the following commands pin vCPU threads 0 to 5 of the testguest6 VM to host CPUs 1, 3, 5, 7, 9, and 11, respectively:

    # virsh vcpupin testguest6 0 1
    # virsh vcpupin testguest6 1 3
    # virsh vcpupin testguest6 2 5
    # virsh vcpupin testguest6 3 7
    # virsh vcpupin testguest6 4 9
    # virsh vcpupin testguest6 5 11

    Afterwards, you can verify whether this was successful:

    # virsh vcpupin testguest6
    VCPU   CPU Affinity
    ----------------------
    0      1
    1      3
    2      5
    3      7
    4      9
    5      11
  2. After pinning vCPU threads, you can also pin QEMU process threads associated with a specified VM to a specific host CPU or range of CPUs. For example, the following commands pin the QEMU process thread of testguest6 to CPUs 13 and 15, and verify this was successful:

    # virsh emulatorpin testguest6 13,15
    # virsh emulatorpin testguest6
    emulator: CPU Affinity
    ----------------------------------
           *: 13,15
  3. Finally, you can also specify which host NUMA nodes will be assigned specifically to a certain VM. This can improve the host memory usage by the VM’s vCPU. For example, the following commands set testguest6 to use host NUMA nodes 3 to 5, and verify this was successful:

    # virsh numatune testguest6 --nodeset 3-5
    # virsh numatune testguest6

Additional resources

14.5.4. Sample vCPU performance tuning scenario

To obtain the best vCPU performance possible, Red Hat recommends using manual vcpupin, emulatorpin, and numatune settings together, for example like in the following scenario.

Starting scenario

  • Your host has the following hardware specifics:

    • 2 NUMA nodes
    • 3 CPU cores on each node
    • 2 threads on each core

    The output of virsh nodeinfo of such a machine would look similar to:

    # virsh nodeinfo
    CPU model:           x86_64
    CPU(s):              12
    CPU frequency:       3661 MHz
    CPU socket(s):       2
    Core(s) per socket:  3
    Thread(s) per core:  2
    NUMA cell(s):        2
    Memory size:         31248692 KiB
  • You intend to modify an existing VM to have 8 vCPUs, which means that it will not fit in a single NUMA node.

    Therefore, you should distribute 4 vCPUs on each NUMA node and make the vCPU topology resemble the host topology as closely as possible. This means that vCPUs that run as sibling threads of a given physical CPU should be pinned to host threads on the same core. For details, see the Solution below:

Solution

  1. Obtain the information on the host topology:

    # virsh capabilities

    The output should include a section that looks similar to the following:

    <topology>
      <cells num="2">
        <cell id="0">
          <memory unit="KiB">15624346</memory>
          <pages unit="KiB" size="4">3906086</pages>
          <pages unit="KiB" size="2048">0</pages>
          <pages unit="KiB" size="1048576">0</pages>
          <distances>
            <sibling id="0" value="10" />
            <sibling id="1" value="21" />
          </distances>
          <cpus num="6">
            <cpu id="0" socket_id="0" core_id="0" siblings="0,3" />
            <cpu id="1" socket_id="0" core_id="1" siblings="1,4" />
            <cpu id="2" socket_id="0" core_id="2" siblings="2,5" />
            <cpu id="3" socket_id="0" core_id="0" siblings="0,3" />
            <cpu id="4" socket_id="0" core_id="1" siblings="1,4" />
            <cpu id="5" socket_id="0" core_id="2" siblings="2,5" />
          </cpus>
        </cell>
        <cell id="1">
          <memory unit="KiB">15624346</memory>
          <pages unit="KiB" size="4">3906086</pages>
          <pages unit="KiB" size="2048">0</pages>
          <pages unit="KiB" size="1048576">0</pages>
          <distances>
            <sibling id="0" value="21" />
            <sibling id="1" value="10" />
          </distances>
          <cpus num="6">
            <cpu id="6" socket_id="1" core_id="3" siblings="6,9" />
            <cpu id="7" socket_id="1" core_id="4" siblings="7,10" />
            <cpu id="8" socket_id="1" core_id="5" siblings="8,11" />
            <cpu id="9" socket_id="1" core_id="3" siblings="6,9" />
            <cpu id="10" socket_id="1" core_id="4" siblings="7,10" />
            <cpu id="11" socket_id="1" core_id="5" siblings="8,11" />
          </cpus>
        </cell>
      </cells>
    </topology>
  2. Optional: Test the performance of the VM using the applicable tools and utilities.
  3. Set up and mount 1 GiB huge pages on the host:

    1. Add the following line to the host’s kernel command line:

      default_hugepagesz=1G hugepagesz=1G
    2. Create the /etc/systemd/system/hugetlb-gigantic-pages.service file with the following content:

      [Unit]
      Description=HugeTLB Gigantic Pages Reservation
      DefaultDependencies=no
      Before=dev-hugepages.mount
      ConditionPathExists=/sys/devices/system/node
      ConditionKernelCommandLine=hugepagesz=1G
      
      [Service]
      Type=oneshot
      RemainAfterExit=yes
      ExecStart=/etc/systemd/hugetlb-reserve-pages.sh
      
      [Install]
      WantedBy=sysinit.target
    3. Create the /etc/systemd/hugetlb-reserve-pages.sh file with the following content:

      #!/bin/sh
      
      nodes_path=/sys/devices/system/node/
      if [ ! -d $nodes_path ]; then
      	echo "ERROR: $nodes_path does not exist"
      	exit 1
      fi
      
      reserve_pages()
      {
      	echo $1 > $nodes_path/$2/hugepages/hugepages-1048576kB/nr_hugepages
      }
      
      reserve_pages 4 node1
      reserve_pages 4 node2

      This reserves four 1GiB huge pages from node1 and four 1GiB huge pages from node2.

    4. Make the script created in the previous step executable:

      # chmod +x /etc/systemd/hugetlb-reserve-pages.sh
    5. Enable huge page reservation on boot:

      # systemctl enable hugetlb-gigantic-pages
  4. Use the virsh edit command to edit the XML configuration of the VM you wish to optimize, in this example super-VM:

    # virsh edit super-vm
  5. Adjust the XML configuration of the VM in the following way:

    1. Set the VM to use 8 static vCPUs. Use the <vcpu/> element to do this.
    2. Pin each of the vCPU threads to the corresponding host CPU threads that it mirrors in the topology. To do so, use the <vcpupin/> elements in the <cputune> section.

      Note that, as shown by the virsh capabilities utility above, host CPU threads are not ordered sequentially in their respective cores. In addition, the vCPU threads should be pinned to the highest available set of host cores on the same NUMA node. For a table illustration, see the Additional Resources section below.

      The XML configuration for steps a. and b. can look similar to:

      <cputune>
        <vcpupin vcpu='0' cpuset='1'/>
        <vcpupin vcpu='1' cpuset='4'/>
        <vcpupin vcpu='2' cpuset='2'/>
        <vcpupin vcpu='3' cpuset='5'/>
        <vcpupin vcpu='4' cpuset='7'/>
        <vcpupin vcpu='5' cpuset='10'/>
        <vcpupin vcpu='6' cpuset='8'/>
        <vcpupin vcpu='7' cpuset='11'/>
        <emulatorpin cpuset='6,9'/>
      </cputune>
    3. Set the VM to use 1 GiB huge pages:

      <memoryBacking>
        <hugepages>
          <page size='1' unit='GiB'/>
        </hugepages>
      </memoryBacking>
    4. Configure the VM’s NUMA nodes to use memory from the corresponding NUMA nodes on the host. To do so, use the <memnode/> elements in the <numatune/> section:

      <numatune>
        <memory mode="preferred" nodeset="1"/>
        <memnode cellid="0" mode="strict" nodeset="0"/>
        <memnode cellid="1" mode="strict" nodeset="1"/>
      </numatune>
    5. Ensure the CPU mode is set to host-passthrough, and that the CPU uses cache in passthrough mode:

      <cpu mode="host-passthrough">
        <topology sockets="2" cores="2" threads="2"/>
        <cache mode="passthrough"/>
  6. The resulting XML configuration of the VM should include a section similar to the following:

    [...]
      <memoryBacking>
        <hugepages>
          <page size='1' unit='GiB'/>
        </hugepages>
      </memoryBacking>
      <vcpu placement='static'>8</vcpu>
      <cputune>
        <vcpupin vcpu='0' cpuset='1'/>
        <vcpupin vcpu='1' cpuset='4'/>
        <vcpupin vcpu='2' cpuset='2'/>
        <vcpupin vcpu='3' cpuset='5'/>
        <vcpupin vcpu='4' cpuset='7'/>
        <vcpupin vcpu='5' cpuset='10'/>
        <vcpupin vcpu='6' cpuset='8'/>
        <vcpupin vcpu='7' cpuset='11'/>
        <emulatorpin cpuset='6,9'/>
      </cputune>
      <numatune>
        <memory mode="preferred" nodeset="1"/>
        <memnode cellid="0" mode="strict" nodeset="0"/>
        <memnode cellid="1" mode="strict" nodeset="1"/>
      </numatune>
      <cpu mode="host-passthrough">
        <topology sockets="2" cores="2" threads="2"/>
        <cache mode="passthrough"/>
        <numa>
          <cell id="0" cpus="0-3" memory="2" unit="GiB">
            <distances>
              <sibling id="0" value="10"/>
              <sibling id="1" value="21"/>
            </distances>
          </cell>
          <cell id="1" cpus="4-7" memory="2" unit="GiB">
            <distances>
              <sibling id="0" value="21"/>
              <sibling id="1" value="10"/>
            </distances>
          </cell>
        </numa>
      </cpu>
    </domain>
  7. Optional: Test the performance of the VM using the applicable tools and utilities to evaluate the impact of the VM’s optimization.

Additional resources

  • The following tables illustrate the connections between the vCPUs and the host CPUs they should be pinned to:

    Table 14.1. Host topology

    CPU threads

    0

    3

    1

    4

    2

    5

    6

    9

    7

    10

    8

    11

    Cores

    0

    1

    2

    3

    4

    5

    Sockets

    0

    1

    NUMA nodes

    0

    1

    Table 14.2. VM topology

    vCPU threads

    0

    1

    2

    3

    4

    5

    6

    7

    Cores

    0

    1

    2

    3

    Sockets

    0

    1

    NUMA nodes

    0

    1

    Table 14.3. Combined host and VM topology

    vCPU threads

     

    0

    1

    2

    3

     

    4

    5

    6

    7

    Host CPU threads

    0

    3

    1

    4

    2

    5

    6

    9

    7

    10

    8

    11

    Cores

    0

    1

    2

    3

    4

    5

    Sockets

    0

    1

    NUMA nodes

    0

    1

    In this scenario, there are 2 NUMA nodes and 8 vCPUs. Therefore, 4 vCPU threads should be pinned to each node.

    In addition, Red Hat recommends leaving at least a single CPU thread available on each node for host system operations.

    Because in this example, each NUMA node houses 3 cores, each with 2 host CPU threads, the set for node 0 translates as follows:

    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='4'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='5'/>

14.5.5. Deactivating kernel same-page merging

Although kernel same-page merging (KSM) improves memory density, it increases CPU utilization, and might adversely affect overall performance depending on the workload. In such cases, you can improve the virtual machine (VM) performance by deactivating KSM.

Depending on your requirements, you can either deactivate KSM for a single session or persistently.

Procedure

  • To deactivate KSM for a single session, use the systemctl utility to stop ksm and ksmtuned services.

    # systemctl stop ksm
    
    # systemctl stop ksmtuned
  • To deactivate KSM persistently, use the systemctl utility to disable ksm and ksmtuned services.

    # systemctl disable ksm
    Removed /etc/systemd/system/multi-user.target.wants/ksm.service.
    # systemctl disable ksmtuned
    Removed /etc/systemd/system/multi-user.target.wants/ksmtuned.service.
Note

Memory pages shared between VMs before deactivating KSM will remain shared. To stop sharing, delete all the PageKSM pages in the system using the following command:

# echo 2 > /sys/kernel/mm/ksm/run

After anonymous pages replace the KSM pages, the khugepaged kernel service will rebuild transparent hugepages on the VM’s physical memory.

14.6. Optimizing virtual machine network performance

Due to the virtual nature of a VM’s network interface card (NIC), the VM loses a portion of its allocated host network bandwidth, which can reduce the overall workload efficiency of the VM. The following tips can minimize the negative impact of virtualization on the virtual NIC (vNIC) throughput.

Procedure

Use any of the following methods and observe if it has a beneficial effect on your VM network performance:

Enable the vhost_net module

On the host, ensure the vhost_net kernel feature is enabled:

# lsmod | grep vhost
vhost_net              32768  1
vhost                  53248  1 vhost_net
tap                    24576  1 vhost_net
tun                    57344  6 vhost_net

If the output of this command is blank, enable the vhost_net kernel module:

# modprobe vhost_net
Set up multi-queue virtio-net

To set up the multi-queue virtio-net feature for a VM, use the virsh edit command to edit to the XML configuration of the VM. In the XML, add the following to the <devices> section, and replace N with the number of vCPUs in the VM, up to 16:

<interface type='network'>
      <source network='default'/>
      <model type='virtio'/>
      <driver name='vhost' queues='N'/>
</interface>

If the VM is running, restart it for the changes to take effect.

Batching network packets

In Linux VM configurations with a long transmission path, batching packets before submitting them to the kernel may improve cache utilization. To set up packet batching, use the following command on the host, and replace tap0 with the name of the network interface that the VMs use:

# ethtool -C tap0 rx-frames 128
SR-IOV
If your host NIC supports SR-IOV, use SR-IOV device assignment for your vNICs. For more information, see Managing SR-IOV devices.

Additional resources

14.7. Virtual machine performance monitoring tools

To identify what consumes the most VM resources and which aspect of VM performance needs optimization, performance diagnostic tools, both general and VM-specific, can be used.

Default OS performance monitoring tools

For standard performance evaluation, you can use the utilities provided by default by your host and guest operating systems:

  • On your RHEL 8 host, as root, use the top utility or the system monitor application, and look for qemu and virt in the output. This shows how much host system resources your VMs are consuming.

    • If the monitoring tool displays that any of the qemu or virt processes consume a large portion of the host CPU or memory capacity, use the perf utility to investigate. For details, see below.
    • In addition, if a vhost_net thread process, named for example vhost_net-1234, is displayed as consuming an excessive amount of host CPU capacity, consider using virtual network optimization features, such as multi-queue virtio-net.
  • On the guest operating system, use performance utilities and applications available on the system to evaluate which processes consume the most system resources.

    • On Linux systems, you can use the top utility.
    • On Windows systems, you can use the Task Manager application.

perf kvm

You can use the perf utility to collect and analyze virtualization-specific statistics about the performance of your RHEL 8 host. To do so:

  1. On the host, install the perf package:

    # yum install perf
  2. Use one of the perf kvm stat commands to display perf statistics for your virtualization host:

    • For real-time monitoring of your hypervisor, use the perf kvm stat live command.
    • To log the perf data of your hypervisor over a period of time, activate the logging using the perf kvm stat record command. After the command is canceled or interrupted, the data is saved in the perf.data.guest file, which can be analyzed using the perf kvm stat report command.
  3. Analyze the perf output for types of VM-EXIT events and their distribution. For example, the PAUSE_INSTRUCTION events should be infrequent, but in the following output, the high occurrence of this event suggests that the host CPUs are not handling the running vCPUs well. In such a scenario, consider shutting down some of your active VMs, removing vCPUs from these VMs, or tuning the performance of the vCPUs.

    # perf kvm stat report
    
    Analyze events for all VMs, all VCPUs:
    
    
                 VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time
    
      EXTERNAL_INTERRUPT     365634    31.59%    18.04%      0.42us  58780.59us    204.08us ( +-   0.99% )
               MSR_WRITE     293428    25.35%     0.13%      0.59us  17873.02us      1.80us ( +-   4.63% )
        PREEMPTION_TIMER     276162    23.86%     0.23%      0.51us  21396.03us      3.38us ( +-   5.19% )
       PAUSE_INSTRUCTION     189375    16.36%    11.75%      0.72us  29655.25us    256.77us ( +-   0.70% )
                     HLT      20440     1.77%    69.83%      0.62us  79319.41us  14134.56us ( +-   0.79% )
                  VMCALL      12426     1.07%     0.03%      1.02us   5416.25us      8.77us ( +-   7.36% )
           EXCEPTION_NMI         27     0.00%     0.00%      0.69us      1.34us      0.98us ( +-   3.50% )
           EPT_MISCONFIG          5     0.00%     0.00%      5.15us     10.85us      7.88us ( +-  11.67% )
    
    Total Samples:1157497, Total events handled time:413728274.66us.

    Other event types that can signal problems in the output of perf kvm stat include:

For more information on using perf to monitor virtualization performance, see the perf-kvm man page.

numastat

To see the current NUMA configuration of your system, you can use the numastat utility, which is provided by installing the numactl package.

The following shows a host with 4 running VMs, each obtaining memory from multiple NUMA nodes. This is not optimal for vCPU performance, and warrants adjusting:

# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
51722 (qemu-kvm)     68     16    357   6936      2      3    147    598  8128
51747 (qemu-kvm)    245     11      5     18   5172   2532      1     92  8076
53736 (qemu-kvm)     62    432   1661    506   4851    136     22    445  8116
53773 (qemu-kvm)   1393      3      1      2     12      0      0   6702  8114
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
Total              1769    463   2024   7462  10037   2672    169   7837 32434

In contrast, the following shows memory being provided to each VM by a single node, which is significantly more efficient.

# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
51747 (qemu-kvm)      0      0      7      0   8072      0      1      0  8080
53736 (qemu-kvm)      0      0      7      0      0      0   8113      0  8120
53773 (qemu-kvm)      0      0      7      0      0      0      1   8110  8118
59065 (qemu-kvm)      0      0   8050      0      0      0      0      0  8051
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
Total                 0      0   8072      0   8072      0   8114   8110 32368

Chapter 15. Importance of power management

Reducing the overall power consumption of computer systems helps to save cost. Effectively optimizing energy consumption of each system component includes studying different tasks that your system performs, and configuring each component to ensure that its performance is correct for that job. Lowering the power consumption of a specific component or of the system as a whole leads to lower heat and performance.

Proper power management results in:

  • heat reduction for servers and computing centers
  • reduced secondary costs, including cooling, space, cables, generators, and uninterruptible power supplies (UPS)
  • extended battery life for laptops
  • lower carbon dioxide output
  • meeting government regulations or legal requirements regarding Green IT, for example, Energy Star
  • meeting company guidelines for new systems

This section describes the information regarding power management of your Red Hat Enterprise Linux systems.

15.1. Power management basics

Effective power management is built on the following principles:

An idle CPU should only wake up when needed

Since Red Hat Enterprise Linux 6, the kernel runs tickless, which means the previous periodic timer interrupts have been replaced with on-demand interrupts. Therefore, idle CPUs are allowed to remain idle until a new task is queued for processing, and CPUs that have entered lower power states can remain in these states longer. However, benefits from this feature can be offset if your system has applications that create unnecessary timer events. Polling events, such as checks for volume changes or mouse movement, are examples of such events.

Red Hat Enterprise Linux includes tools using which you can identify and audit applications on the basis of their CPU usage. For more information see, Audit and analysis overview and Tools for auditing.

Unused hardware and devices should be disabled completely
This is true for devices that have moving parts, for example, hard disks. In addition to this, some applications may leave an unused but enabled device "open"; when this occurs, the kernel assumes that the device is in use, which can prevent the device from going into a power saving state.
Low activity should translate to low wattage

In many cases, however, this depends on modern hardware and correct BIOS configuration or UEFI on modern systems, including non-x86 architectures. Make sure that you are using the latest official firmware for your systems and that in the power management or device configuration sections of the BIOS the power management features are enabled. Some features to look for include:

  • Collaborative Processor Performance Controls (CPPC) support for ARM64
  • PowerNV support for IBM Power Systems
  • SpeedStep
  • PowerNow!
  • Cool’n’Quiet
  • ACPI (C-state)
  • Smart

    If your hardware has support for these features and they are enabled in the BIOS, Red Hat Enterprise Linux uses them by default.

Different forms of CPU states and their effects

Modern CPUs together with Advanced Configuration and Power Interface (ACPI) provide different power states. The three different states are:

  • Sleep (C-states)
  • Frequency and voltage (P-states)
  • Heat output (T-states or thermal states)

    A CPU running on the lowest sleep state, consumes the least amount of watts, but it also takes considerably more time to wake it up from that state when needed. In very rare cases this can lead to the CPU having to wake up immediately every time it just went to sleep. This situation results in an effectively permanently busy CPU and loses some of the potential power saving if another state had been used.

A turned off machine uses the least amount of power
One of the best ways to save power is to turn off systems. For example, your company can develop a corporate culture focused on "green IT" awareness with a guideline to turn off machines during lunch break or when going home. You also might consolidate several physical servers into one bigger server and virtualize them using the virtualization technology, which is shipped with Red Hat Enterprise Linux.

15.2. Audit and analysis overview

The detailed manual audit, analysis, and tuning of a single system is usually the exception because the time and cost spent to do so typically outweighs the benefits gained from these last pieces of system tuning.

However, performing these tasks once for a large number of nearly identical systems where you can reuse the same settings for all systems can be very useful. For example, consider the deployment of thousands of desktop systems, or an HPC cluster where the machines are nearly identical. Another reason to do auditing and analysis is to provide a basis for comparison against which you can identify regressions or changes in system behavior in the future. The results of this analysis can be very helpful in cases where hardware, BIOS, or software updates happen regularly and you want to avoid any surprises with regard to power consumption. Generally, a thorough audit and analysis gives you a much better idea of what is really happening on a particular system.

Auditing and analyzing a system with regard to power consumption is relatively hard, even with the most modern systems available. Most systems do not provide the necessary means to measure power use via software. Exceptions exist though:

  • iLO management console of Hewlett Packard server systems has a power management module that you can access through the web.
  • IBM provides a similar solution in their BladeCenter power management module.
  • On some Dell systems, the IT Assistant offers power monitoring capabilities as well.

Other vendors are likely to offer similar capabilities for their server platforms, but as can be seen there is no single solution available that is supported by all vendors. Direct measurements of power consumption are often only necessary to maximize savings as far as possible.

15.3. Tools for auditing

Red Hat Enterprise Linux 8 offers tools using which you can perform system auditing and analysis. Most of them can be used as supplementary sources of information in case you want to verify what you have discovered already or in case you need more in-depth information on certain parts.

Many of these tools are used for performance tuning as well, which include:

PowerTOP
It identifies specific components of kernel and user-space applications that frequently wake up the CPU. Use the powertop command as root to start the PowerTop tool and powertop --calibrate to calibrate the power estimation engine. For more information on PowerTop, see Managing power consumption with PowerTOP.
Diskdevstat and netdevstat

They are SystemTap tools that collect detailed information about the disk activity and network activity of all applications running on a system. Using the collected statistics by these tools, you can identify applications that waste power with many small I/O operations rather than fewer, larger operations. Using the yum install tuned-utils-systemtap kernel-debuginfo command as root, install the diskdevstat and netdevstat tool.

To view the detailed information about the disk and network activity, use:

# diskdevstat

PID   UID   DEV   WRITE_CNT   WRITE_MIN   WRITE_MAX   WRITE_AVG   READ_CNT   READ_MIN   READ_MAX   READ_AVG   COMMAND

3575  1000  dm-2   59          0.000      0.365        0.006        5         0.000        0.000      0.000      mozStorage #5
3575  1000  dm-2    7          0.000      0.000        0.000        0         0.000        0.000      0.000      localStorage DB
[...]


# netdevstat

PID   UID   DEV       XMIT_CNT   XMIT_MIN   XMIT_MAX   XMIT_AVG   RECV_CNT   RECV_MIN   RECV_MAX   RECV_AVG   COMMAND
3572  991  enp0s31f6    40       0.000      0.882       0.108        0         0.000       0.000       0.000     openvpn
3575  1000 enp0s31f6    27       0.000      1.363       0.160        0         0.000       0.000       0.000     Socket Thread
[...]

With these commands, you can specify three parameters: update_interval, total_duration, and display_histogram.

Tuned
It is a profile-based system tuning tool that uses the udev device manager to monitor connected devices, and enables both static and dynamic tuning of system settings. You can use the tuned-adm recommend command to determine which profile Red Hat recommends as the most suitable for a particular product. For more information on Tuned, see Getting started with Tuned and Customizing Tuned profiles. Using the powertop2tuned utility, you can create custom Tuned profiles from PowerTOP suggestions. For information on the powertop2tuned utility, see Optimizing power consumption.
Virtual memory statistics (vmstat)

It is provided by the procps-ng package. Using this tool, you can view the detailed information about processes, memory, paging, block I/O, traps, and CPU activity.

To view this information, use:

$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b  swpd  free    buff   cache   si   so  bi   bo   in  cs  us  sy id  wa  st
1  0   0   5805576 380856 4852848   0    0  119  73  814  640  2   2 96   0   0

Using the vmstat -a command, you can display active and inactive memory. For more information on other vmstat options, see the vmstat man page.

iostat

It is provided by the sysstat package. This tool is similar to vmstat, but only for monitoring I/O on block devices. It also provides more verbose output and statistics.

To monitor the system I/O, use:

$ iostat
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.05    0.46    1.55    0.26    0.00   95.67

Device     tps     kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1    53.54     899.48     616.99      3445229     2363196
dm-0       42.84     753.72     238.71      2886921      914296
dm-1        0.03       0.60       0.00         2292           0
dm-2       24.15     143.12     379.80       548193     1454712
blktrace

It provides detailed information about how time is spent in the I/O subsystem.

To view this information in human readable format, use:

# blktrace -d /dev/dm-0 -o - | blkparse -i -

253,0   1    1   0.000000000  17694  Q   W 76423384 + 8 [kworker/u16:1]
253,0   2    1   0.001926913     0   C   W 76423384 + 8 [0]
[...]

Here, The first column, 253,0 is the device major and minor tuple. The second column, 1, gives information about the CPU, followed by columns for timestamps and PID of the process issuing the IO process.

The sixth column, Q, shows the event type, the 7th column, W for write operation, the 8th column, 76423384, is the block number, and the + 8 is the number of requested blocks.

The last field, [kworker/u16:1], is the process name.

By default, the blktrace command runs forever until the process is explicitly killed. Use the -w option to specify the run-time duration.

turbostat

It is provided by the kernel-tools package. It reports on processor topology, frequency, idle power-state statistics, temperature, and power usage on x86-64 processors.

To view this summary, use:

# turbostat

CPUID(0): GenuineIntel 0x16 CPUID levels; 0x80000008 xlevels; family:model:stepping 0x6:8e:a (6:142:10)
CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM HT TM
CPUID(6): APERF, TURBO, DTS, PTM, HWP, HWPnotify, HWPwindow, HWPepp, No-HWPpkg, EPB
[...]

By default, turbostat prints a summary of counter results for the entire screen, followed by counter results every 5 seconds. Specify a different period between counter results with the -i option, for example, execute turbostat -i 10 to print results every 10 seconds instead.

Turbostat is also useful for identifying servers that are inefficient in terms of power usage or idle time. It also helps to identify the rate of system management interrupts (SMIs) occurring on the system. It can also be used to verify the effects of power management tuning.

cpupower

IT is a collection of tools to examine and tune power saving related features of processors. Use the cpupower command with the frequency-info, frequency-set, idle-info, idle-set, set, info, and monitor options to display and set processor related values.

For example, to view available cpufreq governors, use:

$ cpupower frequency-info --governors
analyzing CPU 0:
  available cpufreq governors: performance powersave

For more information about cpupower, see Viewing CPU related information.

GNOME Power Manager
It is a daemon that is installed as part of the GNOME desktop environment. GNOME Power Manager notifies you of changes in your system’s power status; for example, a change from battery to AC power. It also reports battery status, and warns you when battery power is low.

Additional resources

  • powertop(1), diskdevstat(8), netdevstat(8), tuned(8), vmstat(8), iostat(1), blktrace(8), blkparse(8), and turbostat(8) man pages
  • cpupower(1), cpupower-set(1), cpupower-info(1), cpupower-idle(1), cpupower-frequency-set(1), cpupower-frequency-info(1), and cpupower-monitor(1) man pages

Chapter 16. Managing power consumption with PowerTOP

As a system administrator, you can use the PowerTOP tool to analyze and manage power consumption.

16.1. The purpose of PowerTOP

PowerTOP is a program that diagnoses issues related to power consumption and provides suggestions on how to extend battery lifetime.

The PowerTOP tool can provide an estimate of the total power usage of the system and also individual power usage for each process, device, kernel worker, timer, and interrupt handler. The tool can also identify specific components of kernel and user-space applications that frequently wake up the CPU.

Red Hat Enterprise Linux 8 uses version 2.x of PowerTOP.

16.2. Using PowerTOP

Prerequisites

  • To be able to use PowerTOP, make sure that the powertop package has been installed on your system:

    # yum install powertop

16.2.1. Starting PowerTOP

Procedure

  • To run PowerTOP, use the following command:

    # powertop
Important

Laptops should run on battery power when running the powertop command.

16.2.2. Calibrating PowerTOP

Procedure

  1. On a laptop, you can calibrate the power estimation engine by running the following command:

    # powertop --calibrate
  2. Let the calibration finish without interacting with the machine during the process.

    Calibration takes time because the process performs various tests, cycles through brightness levels and switches devices on and off.

  3. When the calibration process is completed, PowerTOP starts as normal. Let it run for approximately an hour to collect data.

    When enough data is collected, power estimation figures will be displayed in the first column of the output table.

Note

Note that powertop --calibrate can only be used on laptops.

16.2.3. Setting the measuring interval

By default, PowerTOP takes measurements in 20 seconds intervals.

If you want to change this measuring frequency, use the following procedure:

Procedure

  • Run the powertop command with the --time option:

    # powertop --time=time in seconds

16.3. PowerTOP statistics

While it runs, PowerTOP gathers statistics from the system.

PowerTOP's output provides multiple tabs:

  • Overview
  • Idle stats
  • Frequency stats
  • Device stats
  • Tunables

You can use the Tab and Shift+Tab keys to cycle through these tabs.

16.3.1. The Overview tab

In the Overview tab, you can view a list of the components that either send wakeups to the CPU most frequently or consume the most power. The items within the Overview tab, including processes, interrupts, devices, and other resources, are sorted according to their utilization.

The adjacent columns within the Overview tab provide the following pieces of information:

Usage
Power estimation of how the resource is being used.
Events/s
Wakeups per second. The number of wakeups per second indicates how efficiently the services or the devices and drivers of the kernel are performing. Less wakeups means that less power is consumed. Components are ordered by how much further their power usage can be optimized.
Category
Classification of the component; such as process, device, or timer.
Description
Description of the component.

If properly calibrated, a power consumption estimation for every listed item in the first column is shown as well.

Apart from this, the Overview tab includes the line with summary statistics such as:

  • Total power consumption
  • Remaining battery life (only if applicable)
  • Summary of total wakeups per second, GPU operations per second, and virtual file system operations per second

16.3.2. The Idle stats tab

The Idle stats tab shows usage of C-states for all processors and cores, while the Frequency stats tab shows usage of P-states including the Turbo mode, if applicable, for all processors and cores. The duration of C- or P-states is an indication of how well the CPU usage has been optimized. The longer the CPU stays in the higher C- or P-states (for example C4 is higher than C3), the better the CPU usage optimization is. Ideally, residency is 90% or more in the highest C- or P-state when the system is idle.

16.3.3. The Device stats tab

The Device stats tab provides similar information to the Overview tab but only for devices.

16.3.4. The Tunables tab

The Tunables tab contains PowerTOP's suggestions for optimizing the system for lower power consumption.

Use the up and down keys to move through suggestions, and the enter key to toggle the suggestion on or off.

Figure 16.1. PowerTOP output

powertop output n

Additional resources

For more details on PowerTOP, see PowerTOP’s home page.

16.4. Why Powertop does not display Frequency stats values in some instances

While using the Intel P-State driver, PowerTOP only displays values in the Frequency Stats tab if the driver is in passive mode. But, even in this case, the values may be incomplete.

In total, there are three possible modes of the Intel P-State driver:

  • Active mode with Hardware P-States (HWP)
  • Active mode without HWP
  • Passive mode

Switching to the ACPI CPUfreq driver results in complete information being displayed by PowerTOP. However, it is recommended to keep your system on the default settings.

To see what driver is loaded and in what mode, run:

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
  • intel_pstate is returned if the Intel P-State driver is loaded and in active mode.
  • intel_cpufreq is returned if the Intel P-State driver is loaded and in passive mode.
  • acpi-cpufreq is returned if the ACPI CPUfreq driver is loaded.

While using the Intel P-State driver, add the following argument to the kernel boot command line to force the driver to run in passive mode:

intel_pstate=passive

To disable the Intel P-State driver and use, instead, the ACPI CPUfreq driver, add the following argument to the kernel boot command line:

intel_pstate=disable

16.5. Generating an HTML output

Apart from the powertop’s output in terminal, you can also generate an HTML report.

Procedure

  • Run the powertop command with the --html option:

    # powertop --html=htmlfile.html

    Replace the htmlfile.html parameter with the required name for the output file.

16.6. Optimizing power consumption

To optimize power consumption, you can use either the powertop service or the powertop2tuned utility.

16.6.1. Optimizing power consumption using the powertop service

You can use the powertop service to automatically enable all PowerTOP's suggestions from the Tunables tab on the boot:

Procedure

  • Enable the powertop service:

    # systemctl enable powertop

16.6.2. The powertop2tuned utility

The powertop2tuned utility allows you to create custom Tuned profiles from PowerTOP suggestions.

By default, powertop2tuned creates profiles in the /etc/tuned/ directory, and bases the custom profile on the currently selected Tuned profile. For safety reasons, all PowerTOP tunings are initially disabled in the new profile.

To enable the tunings, you can:

  • Uncomment them in the /etc/tuned/profile_name/tuned.conf file.
  • Use the --enable or -e option to generate a new profile that enables most of the tunings suggested by PowerTOP.

    Certain potentially problematic tunings, such as the USB autosuspend, are disabled by default and need to be uncommented manually.

16.6.3. Optimizing power consumption using the powertop2tuned utility

Prerequisites

  • The powertop2tuned utility is installed on the system:

    # yum install tuned-utils

Procedure

  1. Create a custom profile:

    # powertop2tuned new_profile_name
  2. Activate the new profile:

    # tuned-adm profile new_profile_name

Additional information

  • For a complete list of options that powertop2tuned supports, use:

    $ powertop2tuned --help

16.6.4. Comparison of powertop.service and powertop2tuned

Optimizing power consumption with powertop2tuned is preferred over powertop.service for the following reasons:

  • The powertop2tuned utility represents integration of PowerTOP into Tuned, which enables to benefit of advantages of both tools.
  • The powertop2tuned utility allows for fine-grained control of enabled tuning.
  • With powertop2tuned, potentially dangerous tuning are not automatically enabled.
  • With powertop2tuned, rollback is possible without reboot.

Chapter 17. Tuning CPU frequency to optimize energy consumption

This section describes how to optimize the power consumption of your system by using the available cpupower commands to set CPU speed on a system as per your requirements after setting up the required CPUfreq governor.

17.1. Supported cpupower tool commands

The cpupower tool is a collection of tools to examine and tune power saving related features of processors.

The cpupower tool supports the following commands:

idle-info
Displays the available idle states and other statistics for the CPU idle driver using the cpupower idle-info command. For more information, see CPU Idle States.
idle-set
Enables or disables specific CPU idle state using the cpupower idle-set command as root. Use -d to disable and -e to enable a specific CPU idle state.
frequency-info
Displays the current cpufreq driver and available cpufreq governors using the cpupower frequency-info command. For more information, see CPUfreq drivers, Core CPUfreq Governors, and Intel P-state CPUfreq governors.
frequency-set
Sets the cpufreq and governors using the cpupower frequency-set command as root. For more information, see Setting up CPUfreq governor.
set

Sets processor power saving policies using the cpupower set command as root.

Using the --perf-bias option, you can enable software on supported Intel processors to determine the balance between optimum performance and saving power. Assigned values range from 0 to 15, where 0 is optimum performance and 15 is optimum power efficiency. By default, the --perf-bias option applies to all cores. To apply it only to individual cores, add the --cpu cpulist option.

info

Displays processor power related and hardware configurations, which you have enabled using the cpupower set command. For example, if you assign the --perf-bias value as 5:

# cpupower set --perf-bias 5
# cpupower info
analyzing CPU 0:
perf-bias: 5
monitor

Displays the idle statistics and CPU demands using the cpupower monitor command.

# cpupower monitor
 | Nehalem       || Mperf    ||Idle_Stats
 CPU| C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq || POLL | C1   | C1E  | C3   | C6   | C7s  | C8   | C9   | C10
   0|  1.95| 55.12|  0.00|  0.00||  4.21| 95.79|  3875||  0.00|  0.68|  2.07|  3.39| 88.77|  0.00|  0.00|  0.00| 0.00
[...]

Using the -l option, you can list all available monitors on your system and the -m option to display information related to specific monitors. For example, to monitor information related to the Mperf monitor, use the cpupower monitor -m Mperf command as root.

Additional resources

  • cpupower(1), cpupower-idle-info(1), cpupower-idle-set(1), cpupower-frequency-set(1), cpupower-frequency-info(1), cpupower-set(1), cpupower-info(1), and cpupower-monitor(1) man pages

17.2. CPU Idle States

CPUs with the x86 architecture support various states, such as, few parts of the CPU are deactivated or using lower performance settings, known as C-states.

With this state, you can save power by partially deactivating CPUs that are not in use. There is no need to configure the C-state, unlike P-states that require a governor and potentially some set up to avoid undesirable power or performance issues. C-states are numbered from C0 upwards, with higher numbers representing decreased CPU functionality and greater power saving. C-states of a given number are broadly similar across processors, although the exact details of the specific feature sets of the state may vary between processor families. C-states 0–3 are defined as follows:

C0
In this state, the CPU is working and not idle at all.
C1, Halt
In this state, the processor is not executing any instructions but is typically not in a lower power state. The CPU can continue processing with practically no delay. All processors offering C-states need to support this state. Pentium 4 processors support an enhanced C1 state called C1E that actually is a state for lower power consumption.
C2, Stop-Clock
In this state, the clock is frozen for this processor but it keeps the complete state for its registers and caches, so after starting the clock again it can immediately start processing again. This is an optional state.
C3, Sleep
In this state, the processor goes to sleep and does not need to keep its cache up to date. Due to this reason, waking up from this state needs considerably more time than from the C2 state. This is an optional state.

You can view the available idle states and other statistics for the CPUidle driver using the following command:

$ cpupower idle-info
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 9
Available idle states: POLL C1 C1E C3 C6 C7s C8 C9 C10
[...]

Intel CPUs with the "Nehalem" microarchitecture features a C6 state, which can reduce the voltage supply of a CPU to zero, but typically reduces power consumption by between 80% and 90%. The kernel in Red Hat Enterprise Linux 8 includes optimizations for this new C-state.

Additional resources

  • cpupower(1) and cpupower-idle(1) man pages

17.3. Overview of CPUfreq

One of the most effective ways to reduce power consumption and heat output on your system is CPUfreq, which is supported by x86 and ARM64 architectures in Red Hat Enterprise Linux 8. CPUfreq, also referred to as CPU speed scaling, is the infrastructure in the Linux kernel that enables it to scale the CPU frequency in order to save power.

CPU scaling can be done automatically depending on the system load, in response to Advanced Configuration and Power Interface (ACPI) events, or manually by user-space programs, and it allows the clock speed of the processor to be adjusted on the fly. This enables the system to run at a reduced clock speed to save power. The rules for shifting frequencies, whether to a faster or slower clock speed and when to shift frequencies, are defined by the CPUfreq governor.

You can view the cpufreq information using the cpupower frequency-info command as root.

17.3.1. CPUfreq drivers

Using the cpupower frequency-info --driver command as root, you can view the current CPUfreq driver.

The following are the two available drivers for CPUfreq that can be used:

ACPI CPUfreq
Advanced Configuration and Power Interface (ACPI) CPUfreq driver is a kernel driver that controls the frequency of a particular CPU through ACPI, which ensures the communication between the kernel and the hardware.
Intel P-state

In Red Hat Enterprise Linux 8, Intel P-state driver is supported. The driver provides an interface for controlling the P-state selection on processors based on the Intel Xeon E series architecture or newer architectures.

Currently, Intel P-state is used by default for supported CPUs. You can switch to using ACPI CPUfreq by adding the intel_pstate=disable command to the kernel command line.

Intel P-state implements the setpolicy() callback. The driver decides what P-state to use based on the policy requested from the cpufreq core. If the processor is capable of selecting its next P-state internally, the driver offloads this responsibility to the processor. If not, the driver implements algorithms to select the next P-state.

Intel P-state provides its own sysfs files to control the P-state selection. These files are located in the /sys/devices/system/cpu/intel_pstate/ directory. Any changes made to the files are applicable to all CPUs.

This directory contains the following files that are used for setting P-state parameters:

  • max_perf_pct limits the maximum P-state requested by the driver expressed in a percentage of available performance. The available P-state performance can be reduced by the no_turbo setting.
  • min_perf_pct limits the minimum P-state requested by the driver, expressed in a percentage of the maximum no-turbo performance level.
  • no_turbo limits the driver to selecting P-state below the turbo frequency range.
  • turbo_pct displays the percentage of the total performance supported by hardware that is in the turbo range. This number is independent of whether turbo has been disabled or not.
  • num_pstates displays the number of P-states that are supported by hardware. This number is independent of whether turbo has been disabled or not.

Additional resources

  • cpupower-frequency-info(1) man page

17.3.2. Core CPUfreq governors

A CPUfreq governor defines the power characteristics of the system CPU, which in turn affects the CPU performance. Each governor has its own unique behavior, purpose, and suitability in terms of workload. Using the cpupower frequency-info --governor command as root, you can view the available CPUfreq governors.

Red Hat Enterprise Linux 8 includes multiple core CPUfreq governors:

cpufreq_performance
It forces the CPU to use the highest possible clock frequency. This frequency is statically set and does not change. As such, this particular governor offers no power saving benefit. It is only suitable for hours of a heavy workload, and only during times wherein the CPU is rarely or never idle.
cpufreq_powersave
It forces the CPU to use the lowest possible clock frequency. This frequency is statically set and does not change. This governor offers maximum power savings, but at the cost of the lowest CPU performance. The term "powersave" can sometimes be deceiving though, since in principle a slow CPU on full load consumes more power than a fast CPU that is not loaded. As such, while it may be advisable to set the CPU to use the powersave governor during times of expected low activity, any unexpected high loads during that time can cause the system to actually consume more power. The Powersave governor is more of a speed limiter for the CPU than a power saver. It is most useful in systems and environments where overheating can be a problem.
cpufreq_ondemand
It is a dynamic governor, using which you can enable the CPU to achieve maximum clock frequency when the system load is high, and also minimum clock frequency when the system is idle. While this allows the system to adjust power consumption accordingly with respect to system load, it does so at the expense of latency between frequency switching. As such, latency can offset any performance or power saving benefits offered by the ondemand governor if the system switches between idle and heavy workloads too often. For most systems, the ondemand governor can provide the best compromise between heat emission, power consumption, performance, and manageability. When the system is only busy at specific times of the day, the ondemand governor automatically switches between maximum and minimum frequency depending on the load without any further intervention.
cpufreq_userspace
It allows user-space programs, or any process running as root, to set the frequency. Of all the governors, userspace is the most customizable and depending on how it is configured, it can offer the best balance between performance and consumption for your system.
cpufreq_conservative
Similar to the ondemand governor, the conservative governor also adjusts the clock frequency according to usage. However, the conservative governor switches between frequencies more gradually. This means that the conservative governor adjusts to a clock frequency that it considers best for the load, rather than simply choosing between maximum and minimum. While this can possibly provide significant savings in power consumption, it does so at an ever greater latency than the ondemand governor.
Note

You can enable a governor using cron jobs. This allows you to automatically set specific governors during specific times of the day. As such, you can specify a low-frequency governor during idle times, for example, after work hours, and return to a higher-frequency governor during hours of heavy workload.

For instructions on how to enable a specific governor, see Setting up CPUfreq governor.

17.3.3. Intel P-state CPUfreq governors

By default, the Intel P-state driver operates in active mode with or without Hardware p-state (HWP) depending on whether the CPU supports HWP.

Using the cpupower frequency-info --governor command as root, you can view the available CPUfreq governors.

Note

The functionality of performance and powersave Intel P-state CPUfreq governors is different compared to core CPUfreq governors of the same names.

The Intel P-state driver can operate in the following three different modes:

Active mode with hardware-managed P-states

When active mode with HWP is used, the Intel P-state driver instructs the CPU to perform the P-state selection. The driver can provide frequency hints. However, the final selection depends on CPU internal logic. In active mode with HWP, the Intel P-state driver provides two P-state selection algorithms:

  • performance: With the performance governor, the driver instructs internal CPU logic to be performance-oriented. The range of allowed P-states is restricted to the upper boundary of the range that the driver is allowed to use.
  • powersave: With the powersave governor, the driver instructs internal CPU logic to be powersave-oriented.
Active mode without hardware-managed P-states

When active mode without HWP is used, the Intel P-state driver provides two P-state selection algorithms:

  • performance: With the performance governor, the driver chooses the maximum P-state it is allowed to use.
  • powersave: With the powersave governor, the driver chooses P-states proportional to the current CPU utilization. The behavior is similar to the ondemand CPUfreq core governor.
Passive mode
When the passive mode is used, the Intel P-state driver functions the same as the traditional CPUfreq scaling driver. All available generic CPUFreq core governors can be used.

17.3.4. Setting up CPUfreq governor

All CPUfreq drivers are built in as part of the kernel-tools package, and selected automatically. To set up CPUfreq, you need to select a governor.

Prerequisites

  • To use cpupower, install the kernel-tools package:

    # yum install kernel-tools

Procedure

  1. View which governors are available for use for a specific CPU:

    # cpupower frequency-info --governors
    analyzing CPU 0:
      available cpufreq governors: performance powersave
  2. Enable one of the governors on all CPUs:

    # cpupower frequency-set --governor performance

    Replace the performance governor with the cpufreq governor name as per your requirement.

    To only enable a governor on specific cores, use -c with a range or comma-separated list of CPU numbers. For example, to enable the userspace governor for CPUs 1-3 and 5, use:

    # cpupower -c 1-3,5 frequency-set --governor cpufreq_userspace
Note

If the kernel-tools package is not installed, the CPUfreq settings can be viewed in the /sys/devices/system/cpu/cpuid/cpufreq/ directory. Settings and values can be changed by writing to these tunables. For example, to set the minimum clock speed of cpu0 to 360 MHz, use:

# echo 360000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq

Verification

  • Verify that the governor is enabled:

    # cpupower frequency-info
    analyzing CPU 0:
      driver: intel_pstate
      CPUs which run at the same hardware frequency: 0
      CPUs which need to have their frequency coordinated by software: 0
      maximum transition latency:  Cannot determine or is not supported.
      hardware limits: 400 MHz - 4.20 GHz
      available cpufreq governors: performance powersave
      current policy: frequency should be within 400 MHz and 4.20 GHz.
            The governor "performance" may decide which speed to use within this range.
      current CPU frequency: Unable to call hardware
      current CPU frequency: 3.88 GHz (asserted by call to kernel)
      boost state support:
        Supported: yes
        Active: yes

    The current policy displays the recently enabled cpufreq governor. In this case, it is performance.

Additional resources

  • cpupower-frequency-info(1) and cpupower-frequency-set(1) man pages

Chapter 18. Getting started with perf

As a system administrator, you can use the perf tool to collect and analyze performance data of your system.

18.1. Introduction to perf

The perf user-space tool interfaces with the kernel-based subsystem Performance Counters for Linux (PCL). perf is a powerful tool that uses the Performance Monitoring Unit (PMU) to measure, record, and monitor a variety of hardware and software events. perf also supports tracepoints, kprobes, and uprobes.

18.2. Installing perf

This procedure installs the perf user-space tool.

Procedure

  • Install the perf tool:

    # yum install perf

18.3. Common perf commands

This section provides an overview of commonly used perf commands.

Commonly used perf commands

perf stat
This command provides overall statistics for common performance events, including instructions executed and clock cycles consumed. Options allow for selection of events other than the default measurement events.
perf record
This command records performance data into a file, perf.data, which can be later analyzed using the perf report command.
perf report
This command reads and displays the performance data from the perf.data file created by perf record.
perf list
This command lists the events available on a particular machine. These events will vary based on performance monitoring hardware and software configuration of the system.
perf top
This command performs a similar function to the top utility. It generates and displays a performance counter profile in realtime.
perf trace
This command performs a similar function to the strace tool. It monitors the system calls used by a specified thread or process and all signals received by that application.
perf help
This command displays a complete list of perf commands.

Additional resources

  • Add the --help option to a subcommand to open the man page.

Chapter 19. Profiling CPU usage in real time with perf top

You can use the perf top command to measure CPU usage of different functions in real time.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

19.1. The purpose of perf top

The perf top command is used for real time system profiling and functions similarly to the top utility. However, where the top utility generally shows you how much CPU time a given process or thread is using, perf top shows you how much CPU time each specific function uses. In its default state, perf top tells you about functions being used across all CPUs in both the user-space and the kernel-space. To use perf top you need root access.

19.2. Profiling CPU usage with perf top

This procedure activates perf top and profiles CPU usage in real time.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.
  • You have root access

Procedure

  • Start the perf top monitoring interface:

    # perf top

    The monitoring interface looks similar to the following:

    --------------------------------------------------------------------
    PerfTop:   20806 irqs/sec  kernel:57.3%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 8 CPUs)
    ---------------------------------------------------------------------
    Overhead  Shared Object       Symbol
       2.20%  [kernel]            [k] do_syscall_64
       2.17%  [kernel]            [k] module_get_kallsym
       1.49%  [kernel]            [k] copy_user_enhanced_fast_string
       1.37%  libpthread-2.29.so  [.] pthread_mutex_lock 1.31% [unknown] [.] 0000000000000000 1.07% [kernel] [k] psi_task_change 1.04% [kernel] [k] switch_mm_irqs_off 0.94% [kernel] [k] fget
       0.74%  [kernel]            [k] entry_SYSCALL_64
       0.69%  [kernel]            [k] syscall_return_via_sysret
       0.69%  libxul.so           [.] 0x000000000113f9b0
       0.67%  [kernel]            [k] kallsyms_expand_symbol.constprop.0
       0.65%  firefox             [.] moz_xmalloc
       0.65%  libpthread-2.29.so  [.] __pthread_mutex_unlock_usercnt
       0.60%  firefox             [.] free
       0.60%  libxul.so           [.] 0x000000000241d1cd
       0.60%  [kernel]            [k] do_sys_poll
       0.58%  [kernel]            [k] menu_select
       0.56%  [kernel]            [k] _raw_spin_lock_irqsave
       0.55%  perf                [.] 0x00000000002ae0f3

    In this example, the kernel function do_syscall_64 is using the most CPU time.

Additional resources

  • perf-top(1) man page

19.3. Interpretation of perf top output

The perf top monitoring interface displays the data in several columns:

The "Overhead" column
Displays the percent of CPU a given function is using.
The "Shared Object" column
Displays name of the program or library which is using the function.
The "Symbol" column
Displays the function name or symbol. Functions executed in the kernel-space are identified by [k] and functions executed in the user-space are identified by [.].

19.4. Why perf displays some function names as raw function addresses

For kernel functions, perf uses the information from the /proc/kallsyms file to map the samples to their respective function names or symbols. For functions executed in the user space, however, you might see raw function addresses because the binary is stripped.

The debuginfo package of the executable must be installed or, if the executable is a locally developed application, the application must be compiled with debugging information turned on (the -g option in GCC) to display the function names or symbols in such a situation.

Note

It is not necessary to re-run the perf record command after installing the debuginfo associated with an executable. Simply re-run the perf report command.

19.5. Enabling debug and source repositories

A standard installation of Red Hat Enterprise Linux does not enable the debug and source repositories. These repositories contain information needed to debug the system components and measure their performance.

Procedure

  • Enable the source and debug information package channels:

    # subscription-manager repos --enable rhel-8-for-$(uname -i)-baseos-debug-rpms
    # subscription-manager repos --enable rhel-8-for-$(uname -i)-baseos-source-rpms
    # subscription-manager repos --enable rhel-8-for-$(uname -i)-appstream-debug-rpms
    # subscription-manager repos --enable rhel-8-for-$(uname -i)-appstream-source-rpms

    The $(uname -i) part is automatically replaced with a matching value for architecture of your system:

    Architecture nameValue

    64-bit Intel and AMD

    x86_64

    64-bit ARM

    aarch64

    IBM POWER

    ppc64le

    64-bit IBM Z

    s390x

19.6. Getting debuginfo packages for an application or library using GDB

Debugging information is required to debug code. For code that is installed from a package, the GNU Debugger (GDB) automatically recognizes missing debug information, resolves the package name and provides concrete advice on how to get the package.

Prerequisites

  • The application or library you want to debug must be installed on the system.
  • GDB and the debuginfo-install tool must be installed on the system. For details, see Setting up to debug applications.
  • Channels providing debuginfo and debugsource packages must be configured and enabled on the system.

Procedure

  1. Start GDB attached to the application or library you want to debug. GDB automatically recognizes missing debugging information and suggests a command to run.

    $ gdb -q /bin/ls
    Reading symbols from /bin/ls...Reading symbols from .gnu_debugdata for /usr/bin/ls...(no debugging symbols found)...done.
    (no debugging symbols found)...done.
    Missing separate debuginfos, use: dnf debuginfo-install coreutils-8.30-6.el8.x86_64
    (gdb)
  2. Exit GDB: type q and confirm with Enter.

    (gdb) q
  3. Run the command suggested by GDB to install the required debuginfo packages:

    # dnf debuginfo-install coreutils-8.30-6.el8.x86_64

    The dnf package management tool provides a summary of the changes, asks for confirmation and once you confirm, downloads and installs all the necessary files.

  4. In case GDB is not able to suggest the debuginfo package, follow the procedure described in Getting debuginfo packages for an application or library manually.

Chapter 20. Counting events during process execution with perf stat

You can use the perf stat command to count hardware and software events during process execution.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

20.1. The purpose of perf stat

The perf stat command executes a specified command, keeps a running count of hardware and software event occurrences during the commands execution, and generates statistics of these counts. If you do not specify any events, then perf stat counts a set of common hardware and software events.

20.2. Counting events with perf stat

You can use perf stat to count hardware and software event occurrences during command execution and generate statistics of these counts. By default, perf stat operates in per-thread mode.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

Procedure

  • Count the events.

    • Running the perf stat command without root access will only count events occurring in the user space:

      $ perf stat ls

      Example 20.1. Output of perf stat ran without root access

      Desktop  Documents  Downloads  Music  Pictures  Public  Templates  Videos
      
       Performance counter stats for 'ls':
      
                    1.28 msec task-clock:u               #    0.165 CPUs utilized
                       0      context-switches:u         #    0.000 M/sec
                       0      cpu-migrations:u           #    0.000 K/sec
                     104      page-faults:u              #    0.081 M/sec
               1,054,302      cycles:u                   #    0.823 GHz
               1,136,989      instructions:u             #    1.08  insn per cycle
                 228,531      branches:u                 #  178.447 M/sec
                  11,331      branch-misses:u            #    4.96% of all branches
      
             0.007754312 seconds time elapsed
      
             0.000000000 seconds user
             0.007717000 seconds sys

      As you can see in the previous example, when perf stat runs without root access the event names are followed by :u, indicating that these events were counted only in the user-space.

    • To count both user-space and kernel-space events, you must have root access when running perf stat:

      # perf stat ls

      Example 20.2. Output of perf stat ran with root access

      Desktop  Documents  Downloads  Music  Pictures  Public  Templates  Videos
      
       Performance counter stats for 'ls':
      
                    3.09 msec task-clock                #    0.119 CPUs utilized
                      18      context-switches          #    0.006 M/sec
                       3      cpu-migrations            #    0.969 K/sec
                     108      page-faults               #    0.035 M/sec
               6,576,004      cycles                    #    2.125 GHz
               5,694,223      instructions              #    0.87  insn per cycle
               1,092,372      branches                  #  352.960 M/sec
                  31,515      branch-misses             #    2.89% of all branches
      
             0.026020043 seconds time elapsed
      
             0.000000000 seconds user
             0.014061000 seconds sys
      • By default, perf stat operates in per-thread mode. To change to CPU-wide event counting, pass the -a option to perf stat. To count CPU-wide events, you need root access:

        # perf stat -a ls

Additional resources

  • perf-stat(1) man page

20.3. Interpretation of perf stat output

perf stat executes a specified command and counts event occurrences during the commands execution and displays statistics of these counts in three columns:

  1. The number of occurrences counted for a given event
  2. The name of the event that was counted
  3. When related metrics are available, a ratio or percentage is displayed after the hash sign (#) in the right-most column.

    For example, when running in default mode, perf stat counts both cycles and instructions and, therefore, calculates and displays instructions per cycle in the right-most column. You can see similar behavior with regard to branch-misses as a percent of all branches since both events are counted by default.

20.4. Attaching perf stat to a running process

You can attach perf stat to a running process. This will instruct perf stat to count event occurrences only in the specified processes during the execution of a command.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

Procedure

  • Attach perf stat to a running process:

    $ perf stat -p ID1,ID2 sleep seconds

    The previous example counts events in the processes with the IDs of ID1 and ID2 for a time period of seconds seconds as dictated by using the sleep command.

Additional resources

  • perf-stat(1) man page

Chapter 21. Recording and analyzing performance profiles with perf

The perf tool allows you to record performance data and analyze it at a later time.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

21.1. The purpose of perf record

The perf record command samples performance data and stores it in a file, perf.data, which can be read and visualized with other perf commands. perf.data is generated in the current directory and can be accessed at a later time, possibly on a different machine.

If you do not specify a command for perf record to record during, it will record until you manually stop the process by pressing Ctrl+C. You can attach perf record to specific processes by passing the -p option followed by one or more process IDs. You can run perf record without root access, however, doing so will only sample performance data in the user space. In the default mode, perf record uses CPU cycles as the sampling event and operates in per-thread mode with inherit mode enabled.

21.2. Recording a performance profile without root access

You can use perf record without root access to sample and record performance data in the user-space only.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

Procedure

  • Sample and record the performance data:

    $ perf record command

    Replace command with the command you want to sample data during. If you do not specify a command, then perf record will sample data until you manually stop it by pressing Ctrl+C.

Additonal resources

  • perf-record(1) man page

21.3. Recording a performance profile with root access

You can use perf record with root access to sample and record performance data in both the user-space and the kernel-space simultaneously.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.
  • You have root access.

Procedure

  • Sample and record the performance data:

    # perf record command

    Replace command with the command you want to sample data during. If you do not specify a command, then perf record will sample data until you manually stop it by pressing Ctrl+C.

Additonal resources

  • perf-record(1) man page

21.4. Recording a performance profile in per-CPU mode

You can use perf record in per-CPU mode to sample and record performance data in both and user-space and the kernel-space simultaneously across all threads on a monitored CPU. By default, per-CPU mode monitors all online CPUs.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

Procedure

  • Sample and record the performance data:

    # perf record -a command

    Replace command with the command you want to sample data during. If you do not specify a command, then perf record will sample data until you manually stop it by pressing Ctrl+C.

Additional resources

  • perf-record(1) man page

21.5. Capturing call graph data with perf record

You can configure the perf record tool so that it records which function is calling other functions in the performance profile. This helps to identify a bottleneck if several processes are calling the same function.

Prerequisites

  • The perf user space tool is installed. For more information, see Installing perf.

Procedure

  • Sample and record performance data with the --call-graph option:

    $ perf record --call-graph method command
    • Replace command with the command you want to sample data during. If you do not specify a command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
    • Replace method with one of the following unwinding methods:

      fp
      Uses the frame pointer method. Depending on compiler optimization, such as with binaries built with the GCC option --fomit-frame-pointer, this may not be able to unwind the stack.
      dwarf
      Uses DWARF Call Frame Information to unwind the stack.
      lbr
      Uses the last branch record hardware on Intel processors.

Additional resources

  • perf-record(1) man page

21.6. Analyzing perf.data with perf report

You can use perf report to display and analyze a perf.data file.

Prerequisites

  • The perf user space tool is installed. For more information, see Installing perf.
  • There is a perf.data file in the current directory.
  • If the perf.data file was created with root access, you need to run perf report with root access too.

Procedure

  • Display the contents of the perf.data file for further analysis:

    # perf report

    This command displays output similar to the following:

    Samples: 2K of event 'cycles', Event count (approx.): 235462960
    Overhead  Command          Shared Object                     Symbol
       2.36%  kswapd0          [kernel.kallsyms]                 [k] page_vma_mapped_walk
       2.13%  sssd_kcm         libc-2.28.so                      [.] memset_avx2_erms 2.13% perf [kernel.kallsyms] [k] smp_call_function_single 1.53% gnome-shell libc-2.28.so [.] strcmp_avx2
       1.17%  gnome-shell      libglib-2.0.so.0.5600.4           [.] g_hash_table_lookup
       0.93%  Xorg             libc-2.28.so                      [.] memmove_avx_unaligned_erms 0.89% gnome-shell libgobject-2.0.so.0.5600.4 [.] g_object_unref 0.87% kswapd0 [kernel.kallsyms] [k] page_referenced_one 0.86% gnome-shell libc-2.28.so [.] memmove_avx_unaligned_erms
       0.83%  Xorg             [kernel.kallsyms]                 [k] alloc_vmap_area
       0.63%  gnome-shell      libglib-2.0.so.0.5600.4           [.] g_slice_alloc
       0.53%  gnome-shell      libgirepository-1.0.so.1.0.0      [.] g_base_info_unref
       0.53%  gnome-shell      ld-2.28.so                        [.] _dl_find_dso_for_object
       0.49%  kswapd0          [kernel.kallsyms]                 [k] vma_interval_tree_iter_next
       0.48%  gnome-shell      libpthread-2.28.so                [.] pthread_getspecific 0.47% gnome-shell libgirepository-1.0.so.1.0.0 [.] 0x0000000000013b1d 0.45% gnome-shell libglib-2.0.so.0.5600.4 [.] g_slice_free1 0.45% gnome-shell libgobject-2.0.so.0.5600.4 [.] g_type_check_instance_is_fundamentally_a 0.44% gnome-shell libc-2.28.so [.] malloc 0.41% swapper [kernel.kallsyms] [k] apic_timer_interrupt 0.40% gnome-shell ld-2.28.so [.] _dl_lookup_symbol_x 0.39% kswapd0 [kernel.kallsyms] [k] raw_callee_save___pv_queued_spin_unlock

Additonal resources

  • perf-report(1) man page

21.7. Interpretation of perf report output

The table displayed by running the perf report command sorts the data into several columns:

The 'Overhead' column
Indicates what percentage of overall samples were collected in that particular function.
The 'Command' column
Tells you which process the samples were collected from.
The 'Shared Object' column
Displays the name of the ELF image where the samples come from (the name [kernel.kallsyms] is used when the samples come from the kernel).
The 'Symbol' column
Displays the function name or symbol.

In default mode, the functions are sorted in descending order with those with the highest overhead displayed first.

21.8. Generating a perf.data file that is readable on a different device

You can use the perf tool to record performance data into a perf.data file to be analyzed on a different device.

Prerequisites

Procedure

  1. Capture performance data you are interested in investigating further:

    # perf record -a --call-graph fp sleep seconds

    This example would generate a perf.data over the entire system for a period of seconds seconds as dictated by the use of the sleep command. It would also capture call graph data using the frame pointer method.

  2. Generate an archive file containing debug symbols of the recorded data:

    # perf archive

Verification steps

  • Verify that the archive file has been generated in your current active directory:

    # ls perf.data*

    The output will display every file in your current directory that begins with perf.data. The archive file will be named either:

    perf.data.tar.gz

    or

    perf data.tar.bz2

21.9. Analyzing a perf.data file that was created on a different device

You can use the perf tool to analyze a perf.data file that was generated on a different device.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.
  • A perf.data file and associated archive file generated on a different device are present on the current device being used.

Procedure

  1. Copy both the perf.data file and the archive file into your current active directory.
  2. Extract the archive file into ~/.debug:

    # mkdir -p ~/.debug
    # tar xf perf.data.tar.bz2 -C ~/.debug
    Note

    The archive file might also be named perf.data.tar.gz.

  3. Open the perf.data file for further analysis:

    # perf report

21.10. Why perf displays some function names as raw function addresses

For kernel functions, perf uses the information from the /proc/kallsyms file to map the samples to their respective function names or symbols. For functions executed in the user space, however, you might see raw function addresses because the binary is stripped.

The debuginfo package of the executable must be installed or, if the executable is a locally developed application, the application must be compiled with debugging information turned on (the -g option in GCC) to display the function names or symbols in such a situation.

Note

It is not necessary to re-run the perf record command after installing the debuginfo associated with an executable. Simply re-run the perf report command.

21.11. Enabling debug and source repositories

A standard installation of Red Hat Enterprise Linux does not enable the debug and source repositories. These repositories contain information needed to debug the system components and measure their performance.

Procedure

  • Enable the source and debug information package channels:

    # subscription-manager repos --enable rhel-8-for-$(uname -i)-baseos-debug-rpms
    # subscription-manager repos --enable rhel-8-for-$(uname -i)-baseos-source-rpms
    # subscription-manager repos --enable rhel-8-for-$(uname -i)-appstream-debug-rpms
    # subscription-manager repos --enable rhel-8-for-$(uname -i)-appstream-source-rpms

    The $(uname -i) part is automatically replaced with a matching value for architecture of your system:

    Architecture nameValue

    64-bit Intel and AMD

    x86_64

    64-bit ARM

    aarch64

    IBM POWER

    ppc64le

    64-bit IBM Z

    s390x

21.12. Getting debuginfo packages for an application or library using GDB

Debugging information is required to debug code. For code that is installed from a package, the GNU Debugger (GDB) automatically recognizes missing debug information, resolves the package name and provides concrete advice on how to get the package.

Prerequisites

  • The application or library you want to debug must be installed on the system.
  • GDB and the debuginfo-install tool must be installed on the system. For details, see Setting up to debug applications.
  • Channels providing debuginfo and debugsource packages must be configured and enabled on the system.

Procedure

  1. Start GDB attached to the application or library you want to debug. GDB automatically recognizes missing debugging information and suggests a command to run.

    $ gdb -q /bin/ls
    Reading symbols from /bin/ls...Reading symbols from .gnu_debugdata for /usr/bin/ls...(no debugging symbols found)...done.
    (no debugging symbols found)...done.
    Missing separate debuginfos, use: dnf debuginfo-install coreutils-8.30-6.el8.x86_64
    (gdb)
  2. Exit GDB: type q and confirm with Enter.

    (gdb) q
  3. Run the command suggested by GDB to install the required debuginfo packages:

    # dnf debuginfo-install coreutils-8.30-6.el8.x86_64

    The dnf package management tool provides a summary of the changes, asks for confirmation and once you confirm, downloads and installs all the necessary files.

  4. In case GDB is not able to suggest the debuginfo package, follow the procedure described in Getting debuginfo packages for an application or library manually.

Chapter 22. Investigating busy CPUs with perf

When investigating performance issues on a system, you can use the perf tool to identify and monitor the busiest CPUs in order to focus your efforts.

22.1. Displaying which CPU events were counted on with perf stat

You can use perf stat to display which CPU events were counted on by disabling CPU count aggregation. You must count events in system-wide mode by using the -a flag in order to use this functionality.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

Procedure

  • Count the events with CPU count aggregation disabled:

    # perf stat -a -A sleep seconds

    The previous example displays counts of a default set of common hardware and software events recorded over a time period of seconds seconds, as dictated by using the sleep command, over each individual CPU in ascending order, starting with CPU0. As such, it may be useful to specify an event such as cycles:

    # perf stat -a -A -e cycles sleep seconds

22.2. Displaying which CPU samples were taken on with perf report

The perf record command samples performance data and stores this data in a perf.data file which can be read with the perf report command. The perf record command always records which CPU samples were taken on. You can configure perf report to display this information.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.
  • There is a perf.data file created with perf record in the current directory. If the perf.data file was created with root access, you need to run perf report with root access too.

Procedure

  • Display the contents of the perf.data file for further analysis while sorting by CPU:

    # perf report --sort cpu
    • You can sort by CPU and command to display more detailed information about where CPU time is being spent:

      # perf report --sort cpu,comm

      This example will list commands from all monitored CPUs by total overhead in descending order of overhead usage and identify the CPU the command was executed on.

22.3. Displaying specific CPUs during profiling with perf top

You can configure perf top to display specific CPUs and their relative usage while profiling your system in real time.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

Procedure

  • Start the perf top interface while sorting by CPU:

    # perf top --sort cpu

    This example will list CPUs and their respective overhead in descending order of overhead usage in real time.

    • You can sort by CPU and command for more detailed information of where CPU time is being spent:

      # perf top --sort cpu,comm

      This example will list commands by total overhead in descending order of overhead usage and identify the CPU the command was executed on in real time.

22.4. Monitoring specific CPUs with perf record and perf report

You can configure perf record to only sample specific CPUs of interest and analyze the generated perf.data file with perf report for further analysis.

Prerequisites

  • You have the perf user space tool installed as described in Installing perf.

Procedure

  1. Sample and record the performance data in the specific CPU’s, generating a perf.data file:

    • Using a comma separated list of CPUs:

      # perf record -C 0,1 sleep seconds

      The previous example samples and records data in CPUs 0 and 1 for a period of seconds seconds as dictated by the use of the sleep command.

    • Using a range of CPUs:

      # perf record -C 0-2 sleep seconds

      The previous example samples and records data in all CPUs from CPU 0 to 2 for a period of seconds seconds as dictated by the use of the sleep command.

  2. Display the contents of the perf.data file for further analysis:

    # perf report

    This example will display the contents of perf.data. If you are monitoring several CPUs and want to know which CPU data was sampled on, see Displaying which CPU samples were taken on with perf report.

Chapter 23. Monitoring application performance with perf

This section describes how to use the perf tool to monitor application performance.

23.1. Attaching perf record to a running process

You can attach perf record to a running process. This will instruct perf record to only sample and record performance data in the specified processes.

Prerequisites

  • The perf user space tool is installed. For more information, see Installing perf.

Procedure

  • Attach perf record to a running process:

    $ perf record -p ID1,ID2 sleep seconds

    The previous example samples and records performance data of the processes with the process ID’s ID1 and ID2 for a time period of seconds seconds as dictated by using the sleep command. You can also configure perf to record events in specific threads:

    $ perf record -t ID1,ID2 sleep seconds
    Note

    When using the -t flag and stipulating thread ID’s, perf disables inheritance by default. You can enable inheritance by adding the --inherit option.

23.2. Capturing call graph data with perf record

You can configure the perf record tool so that it records which function is calling other functions in the performance profile. This helps to identify a bottleneck if several processes are calling the same function.

Prerequisites

  • The perf user space tool is installed. For more information, see Installing perf.

Procedure

  • Sample and record performance data with the --call-graph option:

    $ perf record --call-graph method command
    • Replace command with the command you want to sample data during. If you do not specify a command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
    • Replace method with one of the following unwinding methods:

      fp
      Uses the frame pointer method. Depending on compiler optimization, such as with binaries built with the GCC option --fomit-frame-pointer, this may not be able to unwind the stack.
      dwarf
      Uses DWARF Call Frame Information to unwind the stack.
      lbr
      Uses the last branch record hardware on Intel processors.

Additional resources

  • perf-record(1) man page

23.3. Analyzing perf.data with perf report

You can use perf report to display and analyze a perf.data file.

Prerequisites

  • The perf user space tool is installed. For more information, see Installing perf.
  • There is a perf.data file in the current directory.
  • If the perf.data file was created with root access, you need to run perf report with root access too.

Procedure

  • Display the contents of the perf.data file for further analysis:

    # perf report

    This command displays output similar to the following:

    Samples: 2K of event 'cycles', Event count (approx.): 235462960
    Overhead  Command          Shared Object                     Symbol
       2.36%  kswapd0          [kernel.kallsyms]                 [k] page_vma_mapped_walk
       2.13%  sssd_kcm         libc-2.28.so                      [.] memset_avx2_erms 2.13% perf [kernel.kallsyms] [k] smp_call_function_single 1.53% gnome-shell libc-2.28.so [.] strcmp_avx2
       1.17%  gnome-shell      libglib-2.0.so.0.5600.4           [.] g_hash_table_lookup
       0.93%  Xorg             libc-2.28.so                      [.] memmove_avx_unaligned_erms 0.89% gnome-shell libgobject-2.0.so.0.5600.4 [.] g_object_unref 0.87% kswapd0 [kernel.kallsyms] [k] page_referenced_one 0.86% gnome-shell libc-2.28.so [.] memmove_avx_unaligned_erms
       0.83%  Xorg             [kernel.kallsyms]                 [k] alloc_vmap_area
       0.63%  gnome-shell      libglib-2.0.so.0.5600.4           [.] g_slice_alloc
       0.53%  gnome-shell      libgirepository-1.0.so.1.0.0      [.] g_base_info_unref
       0.53%  gnome-shell      ld-2.28.so                        [.] _dl_find_dso_for_object
       0.49%  kswapd0          [kernel.kallsyms]                 [k] vma_interval_tree_iter_next
       0.48%  gnome-shell      libpthread-2.28.so                [.] pthread_getspecific 0.47% gnome-shell libgirepository-1.0.so.1.0.0 [.] 0x0000000000013b1d 0.45% gnome-shell libglib-2.0.so.0.5600.4 [.] g_slice_free1 0.45% gnome-shell libgobject-2.0.so.0.5600.4 [.] g_type_check_instance_is_fundamentally_a 0.44% gnome-shell libc-2.28.so [.] malloc 0.41% swapper [kernel.kallsyms] [k] apic_timer_interrupt 0.40% gnome-shell ld-2.28.so [.] _dl_lookup_symbol_x 0.39% kswapd0 [kernel.kallsyms] [k] raw_callee_save___pv_queued_spin_unlock

Additonal resources

  • perf-report(1) man page

Chapter 24. Profiling memory accesses with perf mem

You can use the perf mem command to sample memory accesses on your system.

24.1. The purpose of perf mem

The mem subcommand of the perf tool enables the sampling of memory accesses (loads and stores). The perf mem command provides information about memory latency, types of memory accesses, functions causing cache hits and misses, and, by recording the data symbol, the memory locations where these hits and misses occur.

24.2. Sampling memory access with perf mem

This procedure describes how to use the perf mem command to sample memory accesses on your system. The command takes the same options as perf record and perf report as well as some options exclusive to the mem subcommand. The recorded data is stored in a perf.data file in the current directory for later analysis.

Prerequisites

  • The perf user space tool is installed. For more information, see Installing perf.

Procedure

  1. Sample the memory accesses:

    # perf mem record -a sleep seconds

    This example samples memory accesses across all CPUs for a period of seconds seconds as dictated by the sleep command. You can replace the sleep command for any command during which you want to sample memory access data. By default, perf mem samples both memory loads and stores. You can select only one memory operation by using the -t option and specifying either "load" or "store" between perf mem and record. For loads, information over the memory hierarchy level, TLB memory accesses, bus snoops, and memory locks is captured.

  2. Open the perf.data file for analysis:

    # perf mem report

    If you have used the example commands, the output is:

    Available samples
    35k cpu/mem-loads,ldlat=30/P
    54k cpu/mem-stores/P

    The cpu/mem-loads,ldlat=30/P line denotes data collected over memory loads and the cpu/mem-stores/P line denotes data collected over memory stores. Highlight the category of interest and press Enter to view the data:

    Samples: 35K of event 'cpu/mem-loads,ldlat=30/P', Event count (approx.): 4067062
    Overhead       Samples  Local Weight  Memory access             Symbol                                                                 Shared Object                 Data Symbol                                                     Data Object                            Snoop         TLB access              Locked
       0.07%            29  98            L1 or L1 hit              [.] 0x000000000000a255                                                 libspeexdsp.so.1.5.0          [.] 0x00007f697a3cd0f0                                          anon                                   None          L1 or L2 hit            No
       0.06%            26  97            L1 or L1 hit              [.] 0x000000000000a255                                                 libspeexdsp.so.1.5.0          [.] 0x00007f697a3cd0f0                                          anon                                   None          L1 or L2 hit            No
       0.06%            25  96            L1 or L1 hit              [.] 0x000000000000a255                                                 libspeexdsp.so.1.5.0          [.] 0x00007f697a3cd0f0                                          anon                                   None          L1 or L2 hit            No
       0.06%             1  2325          Uncached or N/A hit       [k] pci_azx_readl                                                      [kernel.kallsyms]             [k] 0xffffb092c06e9084                                          [kernel.kallsyms]                      None          L1 or L2 hit            No
       0.06%             1  2247          Uncached or N/A hit       [k] pci_azx_readl                                                      [kernel.kallsyms]             [k] 0xffffb092c06e8164                                          [kernel.kallsyms]                      None          L1 or L2 hit            No
       0.05%             1  2166          L1 or L1 hit              [.] 0x00000000038140d6                                                 libxul.so                     [.] 0x00007ffd7b84b4a8                                          [stack]                                None          L1 or L2 hit            No
       0.05%             1  2117          Uncached or N/A hit       [k] check_for_unclaimed_mmio                                           [kernel.kallsyms]             [k] 0xffffb092c1842300                                          [kernel.kallsyms]                      None          L1 or L2 hit            No
       0.05%            22  95            L1 or L1 hit              [.] 0x000000000000a255                                                 libspeexdsp.so.1.5.0          [.] 0x00007f697a3cd0f0                                          anon                                   None          L1 or L2 hit            No
       0.05%             1  1898          L1 or L1 hit              [.] 0x0000000002a30e07                                                 libxul.so                     [.] 0x00007f610422e0e0                                          anon                                   None          L1 or L2 hit            No
       0.05%             1  1878          Uncached or N/A hit       [k] pci_azx_readl                                                      [kernel.kallsyms]             [k] 0xffffb092c06e8164                                          [kernel.kallsyms]                      None          L2 miss                 No
       0.04%            18  94            L1 or L1 hit              [.] 0x000000000000a255                                                 libspeexdsp.so.1.5.0          [.] 0x00007f697a3cd0f0                                          anon                                   None          L1 or L2 hit            No
       0.04%             1  1593          Local RAM or RAM hit      [.] 0x00000000026f907d                                                 libxul.so                     [.] 0x00007f3336d50a80                                          anon                                   Hit           L2 miss                 No
       0.03%             1  1399          L1 or L1 hit              [.] 0x00000000037cb5f1                                                 libxul.so                     [.] 0x00007fbe81ef5d78                                          libxul.so                              None          L1 or L2 hit            No
       0.03%             1  1229          LFB or LFB hit            [.] 0x0000000002962aad                                                 libxul.so                     [.] 0x00007fb6f1be2b28                                          anon                                   None          L2 miss                 No
       0.03%             1  1202          LFB or LFB hit            [.] __pthread_mutex_lock                                               libpthread-2.29.so            [.] 0x00007fb75583ef20                                          anon                                   None          L1 or L2 hit            No
       0.03%             1  1193          Uncached or N/A hit       [k] pci_azx_readl                                                      [kernel.kallsyms]             [k] 0xffffb092c06e9164                                          [kernel.kallsyms]                      None          L2 miss                 No
       0.03%             1  1191          L1 or L1 hit              [k] azx_get_delay_from_lpib                                            [kernel.kallsyms]             [k] 0xffffb092ca7efcf0                                          [kernel.kallsyms]                      None          L1 or L2 hit            No

    Alternatively, you can sort your results to investigate different aspects of interest when displaying the data. For example, to sort data over memory loads by type of memory accesses occurring during the sampling period in descending order of overhead they account for:

    # perf mem -t load report --sort=mem

    For example, the output can be:

    Samples: 35K of event 'cpu/mem-loads,ldlat=30/P', Event count (approx.): 40670
    Overhead       Samples  Memory access
      31.53%          9725  LFB or LFB hit
      29.70%         12201  L1 or L1 hit
      23.03%          9725  L3 or L3 hit
      12.91%          2316  Local RAM or RAM hit
       2.37%           743  L2 or L2 hit
       0.34%             9  Uncached or N/A hit
       0.10%            69  I/O or N/A hit
       0.02%           825  L3 miss

Additional resources

  • perf-mem(1) man page.

24.3. Interpretation of perf mem report output

The table displayed by running the perf mem report command without any modifiers sorts the data into several columns:

The 'Overhead' column
Indicates percentage of overall samples collected in that particular function.
The 'Samples' column
Displays the number of samples accounted for by that row.
The 'Local Weight' column
Displays the access latency in processor core cycles.
The 'Memory Access' column
Displays the type of memory access that occurred.
The 'Symbol' column
Displays the function name or symbol.
The 'Shared Object' column
Displays the name of the ELF image where the samples come from (the name [kernel.kallsyms] is used when the samples come from the kernel).
The 'Data Symbol' column
Displays the address of the memory location that row was targeting.
Important

Oftentimes, due to dynamic allocation of memory or stack memory being accessed, the 'Data Symbol' column will display a raw address.

The "Snoop" column
Displays bus transactions.
The 'TLB Access' column
Displays TLB memory accesses.
The 'Locked' column
Indicates if a function was or was not memory locked.

In default mode, the functions are sorted in descending order with those with the highest overhead displayed first.

Chapter 25. Detecting false sharing

False sharing occurs when a processor core on a Symmetric Multi Processing (SMP) system modifies data items on the same cache line that is in use by other processors to access other data items that are not being shared between the processors.

This initial modification requires that the other processors using the cache line invalidate their copy and request an updated one despite the processors not needing, or even necessarily having access to, an updated version of the modified data item.

You can use the perf c2c command to detect false sharing.

25.1. The purpose of perf c2c

The c2c subcommand of the perf tool enables Shared Data Cache-to-Cache (C2C) analysis. You can use the perf c2c command to inspect cache-line contention to detect both true and false sharing.

Cache-line contention occurs when a processor core on a Symmetric Multi Processing (SMP) system modifies data items on the same cache line that is in use by other processors. All other processors using this cache-line must then invalidate their copy and request an updated one. This can lead to degraded performance.

The perf c2c command provides the following information:

  • Cache lines where contention has been detected
  • Processes reading and writing the data
  • Instructions causing the contention
  • The Non-Uniform Memory Access (NUMA) nodes involved in the contention

25.2. Detecting cache-line contention with perf c2c

Use the perf c2c command to detect cache-line contention in a system.

The perf c2c command supports the same options as perf record as well as some options exclusive to the c2c subcommand. The recorded data is stored in a perf.data file in the current directory for later analysis.

Prerequisites

  • The perf user space tool is installed. For more information, see installing perf.

Procedure

  • Use perf c2c to detect cache-line contention:

    # perf c2c record -a sleep seconds

    This example samples and records cache-line contention data across all CPU’s for a period of seconds as dictated by the sleep command. You can replace the sleep command with any command you want to collect cache-line contention data over.

Additional resources

  • perf-c2c(1) man page

25.3. Visualizing a perf.data file recorded with perf c2c record

This procedure describes how to visualize the perf.data file, which is recorded using the perf c2c command.

Prerequisites

Procedure

  1. Open the perf.data file for further analysis:

    # perf c2c report --stdio

    This command visualizes the perf.data file into several graphs within the terminal:

    =================================================
               Trace Event Information
    =================================================
     Total records                     :     329219
     Locked Load/Store Operations      :      14654
     Load Operations                   :      69679
     Loads - uncacheable               :          0
     Loads - IO                        :          0
     Loads - Miss                      :       3972
     Loads - no mapping                :          0
     Load Fill Buffer Hit              :      11958
     Load L1D hit                      :      17235
     Load L2D hit                      :         21
     Load LLC hit                      :      14219
     Load Local HITM                   :       3402
     Load Remote HITM                  :      12757
     Load Remote HIT                   :       5295
     Load Local DRAM                   :        976
     Load Remote DRAM                  :       3246
     Load MESI State Exclusive         :       4222
     Load MESI State Shared            :          0
     Load LLC Misses                   :      22274
     LLC Misses to Local DRAM          :        4.4%
     LLC Misses to Remote DRAM         :       14.6%
     LLC Misses to Remote cache (HIT)  :       23.8%
     LLC Misses to Remote cache (HITM) :       57.3%
     Store Operations                  :     259539
     Store - uncacheable               :          0
     Store - no mapping                :         11
     Store L1D Hit                     :     256696
     Store L1D Miss                    :       2832
     No Page Map Rejects               :       2376
     Unable to parse data source       :          1
    
    =================================================
       Global Shared Cache Line Event Information
    =================================================
     Total Shared Cache Lines          :         55
     Load HITs on shared lines         :      55454
     Fill Buffer Hits on shared lines  :      10635
     L1D hits on shared lines          :      16415
     L2D hits on shared lines          :          0
     LLC hits on shared lines          :       8501
     Locked Access on shared lines     :      14351
     Store HITs on shared lines        :     109953
     Store L1D hits on shared lines    :     109449
     Total Merged records              :     126112
    
    =================================================
                     c2c details
    =================================================
     Events                            : cpu/mem-loads,ldlat=30/P
    	                                    : cpu/mem-stores/P
     Cachelines sort on                : Remote HITMs
     Cacheline data groupping          : offset,pid,iaddr
    
    =================================================
    	   Shared Data Cache Line Table
    =================================================
    #
    #                              Total      Rmt  ----- LLC Load Hitm -----  ---- Store Reference ----  --- Load Dram ----      LLC    Total  ----- Core Load Hit -----  -- LLC Load Hit --
    # Index           Cacheline  records     Hitm    Total      Lcl      Rmt    Total    L1Hit   L1Miss       Lcl       Rmt  Ld Miss    Loads       FB       L1       L2       Llc       Rmt
    # .....  ..................  .......  .......  .......  .......  .......  .......  .......  .......  ........  ........  .......  .......  .......  .......  .......  ........  ........
    #
          0            0x602180   149904   77.09%    12103     2269     9834   109504   109036      468       727      2657    13747    40400     5355    16154        0      2875       529
          1            0x602100    12128   22.20%     3951     1119     2832        0        0        0        65       200     3749    12128     5096      108        0      2056       652
          2  0xffff883ffb6a7e80      260    0.09%       15        3       12      161      161        0         1         1       15       99       25       50        0         6         1
          3  0xffffffff81aec000      157    0.07%        9        0        9        1        0        1         0         7       20      156       50       59        0        27         4
          4  0xffffffff81e3f540      179    0.06%        9        1        8      117       97       20         0        10       25       62       11        1        0        24         7
    
    =================================================
          Shared Cache Line Distribution Pareto
    =================================================
    #
    #        ----- HITM -----  -- Store Refs --        Data address                               ---------- cycles ----------       cpu                                     Shared
    #   Num      Rmt      Lcl   L1 Hit  L1 Miss              Offset      Pid        Code address  rmt hitm  lcl hitm      load       cnt               Symbol                Object                  Source:Line  Node{cpu list}
    # .....  .......  .......  .......  .......  ..................  .......  ..................  ........  ........  ........  ........  ...................  ....................  ...........................  ....
    #
      -------------------------------------------------------------
          0     9834     2269   109036      468            0x602180
      -------------------------------------------------------------
              65.51%   55.88%   75.20%    0.00%                 0x0    14604            0x400b4f     27161     26039     26017         9  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:144   0{0-1,4}  1{24-25,120}  2{48,54}  3{169}
    	   0.41%    0.35%    0.00%    0.00%                 0x0    14604            0x400b56     18088     12601     26671         9  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:145   0{0-1,4}  1{24-25,120}  2{48,54}  3{169}
    	   0.00%    0.00%   24.80%  100.00%                 0x0    14604            0x400b61         0         0         0         9  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:145   0{0-1,4}  1{24-25,120}  2{48,54}  3{169}
    	   7.50%    9.92%    0.00%    0.00%                0x20    14604            0x400ba7      2470      1729      1897         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:154   1{122}  2{144}
    	  17.61%   20.89%    0.00%    0.00%                0x28    14604            0x400bc1      2294      1575      1649         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:158   2{53}  3{170}
    	   8.97%   12.96%    0.00%    0.00%                0x30    14604            0x400bdb      2325      1897      1828         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:162   0{96}  3{171}
    
      -------------------------------------------------------------
          1     2832     1119        0        0            0x602100
      -------------------------------------------------------------
    	  29.13%   36.19%    0.00%    0.00%                0x20    14604            0x400bb3      1964      1230      1788         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:155   1{122}  2{144}
    	  43.68%   34.41%    0.00%    0.00%                0x28    14604            0x400bcd      2274      1566      1793         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:159   2{53}  3{170}
    	  27.19%   29.40%    0.00%    0.00%                0x30    14604            0x400be7      2045      1247      2011         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:163   0{96}  3{171}

25.4. Interpretation of perf c2c report output

This section describes how to interpret the output of the perf c2c report command.

The visualization displayed by running the perf c2c report --stdio command sorts the data into several tables:

Trace Events Information
This table provides a high level summary of all the load and store samples, which are collected by the perf c2c record command.
Global Shared Cache Line Event Information
This table provides statistics over the shared cache lines.
c2c Details
This table provides information about what events were sampled and how the perf c2c report data is organized within the visualization.
Shared Data Cache Line Table
This table provides a one line summary for the hottest cache lines where false sharing is detected and is sorted in descending order by the amount of remote Hitm detected per cache line by default.
Shared Cache Line Distribution Pareto

This tables provides a variety of information about each cache line experiencing contention:

  • The cache lines are numbered in the NUM column, starting at 0.
  • The virtual address of each cache line is contained in the Data address Offset column and followed subsequently by the offset into the cache line where different accesses occurred.
  • The Pid column contains the process ID.
  • The Code Address column contains the instruction pointer code address.
  • The columns under the cycles label show average load latencies.
  • The cpu cnt column displays how many different CPUs samples came from (essentially, how many different CPUs were waiting for the data indexed at that given location).
  • The Symbol column displays the function name or symbol.
  • The Shared Object column displays the name of the ELF image where the samples come from (the name [kernel.kallsyms] is used when the samples come from the kernel).
  • The Source:Line column displays the source file and line number.
  • The Node{cpu list} column displays which specific CPUs samples came from for each node.

25.5. Detecting false sharing with perf c2c

This procedure describes how to detect false sharing using the perf c2c command.

Prerequisites

Procedure

  1. Open the perf.data file for further analysis:

    # perf c2c report --stdio

    This opens the perf.data file in the terminal.

  2. In the "Trace Event Information" table, locate the row containing the values for LLC Misses to Remote Cache (HITM):

    The percentage in the value column of the LLC Misses to Remote Cache (HITM) row represents the percentage of LLC misses that were occurring across NUMA nodes in modified cache-lines and is a key indicator false sharing has occurred.

    =================================================
                Trace Event Information
    =================================================
      Total records                     :     329219
      Locked Load/Store Operations      :      14654
      Load Operations                   :      69679
      Loads - uncacheable               :          0
      Loads - IO                        :          0
      Loads - Miss                      :       3972
      Loads - no mapping                :          0
      Load Fill Buffer Hit              :      11958
      Load L1D hit                      :      17235
      Load L2D hit                      :         21
      Load LLC hit                      :      14219
      Load Local HITM                   :       3402
      Load Remote HITM                  :      12757
      Load Remote HIT                   :       5295
      Load Local DRAM                   :        976
      Load Remote DRAM                  :       3246
      Load MESI State Exclusive         :       4222
      Load MESI State Shared            :          0
      Load LLC Misses                   :      22274
      LLC Misses to Local DRAM          :        4.4%
      LLC Misses to Remote DRAM         :       14.6%
      LLC Misses to Remote cache (HIT)  :       23.8%
      LLC Misses to Remote cache (HITM) : 57.3%
      Store Operations                  :     259539
      Store - uncacheable               :          0
      Store - no mapping                :         11
      Store L1D Hit                     :     256696
      Store L1D Miss                    :       2832
      No Page Map Rejects               :       2376
      Unable to parse data source       :          1
  3. Inspect the Rmt column of the LLC Load Hitm field of the the Shared Data Cache Line Table:

      =================================================
                 Shared Data Cache Line Table
      =================================================
      #
      #                              Total      Rmt  ----- LLC Load Hitm -----  ---- Store Reference ----  --- Load Dram ----      LLC    Total  ----- Core Load Hit -----  -- LLC Load Hit --
      # Index           Cacheline  records     Hitm    Total      Lcl      Rmt    Total    L1Hit   L1Miss       Lcl       Rmt  Ld Miss    Loads       FB       L1       L2       Llc       Rmt
      # .....  ..................  .......  .......  .......  .......  .......  .......  .......  .......  ........  ........  .......  .......  .......  .......  .......  ........  ........
      #
            0            0x602180   149904   77.09%    12103     2269     9834   109504   109036      468       727      2657    13747    40400     5355    16154        0      2875       529
            1            0x602100    12128   22.20%     3951     1119     2832        0        0        0        65       200     3749    12128     5096      108        0      2056       652
            2  0xffff883ffb6a7e80      260    0.09%       15        3       12      161      161        0         1         1       15       99       25       50        0         6         1
            3  0xffffffff81aec000      157    0.07%        9        0        9        1        0        1         0         7       20      156       50       59        0        27         4
            4  0xffffffff81e3f540      179    0.06%        9        1        8      117       97       20         0        10       25       62       11        1        0        24         7

    This table is sorted in descending order by the amount of remote Hitm detected per cache line. A high number in the Rmt column of the LLC Load Hitm section indicates false sharing and requires further inspection of the cache line on which it occurred to debug the false sharing activity.

Chapter 26. Getting started with flamegraphs

As a system administrator, you can use flamegraphs to to create visualizations of system performance data recorded with the perf tool. As a software developer, you can use flamegraphs to create visualizations of application performance data recorded with the perf tool.

Sampling stack traces is a common technique for profiling CPU performance with the perf tool. Unfortunately, the results of profiling stack traces with perf can be extremely verbose and labor-intensive to analyze. flamegraphs are visualizations created from data recorded with perf to make identifying hot code-paths faster and easier.

26.1. Installing flamegraphs

To begin using flamegraphs, install the required package.

Procedure

  • Install the flamegraphs package:

    # yum install js-d3-flame-graph

26.2. Creating flamegraphs over the entire system

This procedure describes how to visualize performance data recorded over an entire system using flamegraphs.

Prerequisites

Procedure

  • Record the data and create the visualization:

    # perf script flamegraph -a -F 99 sleep 60

    This command samples and records performance data over the entire system for 60 seconds, as stipulated by use of the sleep command, and then constructs the visualization which will be stored in the current active directory as flamegraph.html. The command samples call-graph data by default and takes the same arguments as the perf tool, in this particular case:

    -a
    Stipulates to record data over the entire system.
    -F
    To set the sampling frequency per second.

Verification steps

  • For analysis, view the generated visualization:

    # xdg-open flamegraph.html

    This command opens the visualization in the default browser:

    flamegraph allcpus

26.3. Creating flamegraphs over specific processes

You can use flamegraphs to visualize performance data recorded over specific running processes.

Prerequisites

Procedure

  • Record the data and create the visualization:

    # perf script flamegraph -a -F 99 -p ID1,ID2 sleep 60

    This command samples and records performance data of the processes with the process ID’s ID1 and ID2 for 60 seconds, as stipulated by use of the sleep command, and then constructs the visualization which will be stored in the current active directory as flamegraph.html. The command samples call-graph data by default and takes the same arguments as the perf tool, in this particular case:

    -a
    Stipulates to record data over the entire system.
    -F
    To set the sampling frequency per second.
    -p
    To stipulate specific process ID’s to sample and record data over.

Verification steps

  • For analysis, view the generated visualization:

    # xdg-open flamegraph.html

    This command opens the visualization in the default browser:

    flamegraph

26.4. Interpreting flamegraphs

Each box in the flamegraph represents a different function in the stack. The y-axis shows the depth of the stack with the topmost box in each stack being the function that was actually on-CPU and everything below it being ancestry. The x-axis displays the population of the sampled call-graph data.

The children of a stack in a given row are displayed based on the number of samples taken of each respective function in descending order along the x-axis; the x-axis does not represent the passing of time. The wider an individual box is, the more frequent it was on-CPU or part of an on-CPU ancestry at the time the data was being sampled.

Procedure

  • To reveal the names of functions which may have not been displayed previously and further investigate the data click on a box within the flamegraph to zoom into the stack at that given location:

    zoomed in flamegraph

  • To return to the default view of the flamegraph, click Reset Zoom.
Important

Boxes representing user-space functions may be labeled as Unknown in flamegraphs because the binary of the function is stripped. The debuginfo package of the executable must be installed or, if the executable is a locally developed application, the application must be compiled with debugging information. Use the -g option in GCC, to display the function names or symbols in such a situation.

flamegraph

Chapter 27. Monitoring processes for performance bottlenecks using perf circular buffers

You can create circular buffers that take event-specific snapshots of data with the perf tool in order to monitor performance bottlenecks in specific processes or parts of applications running on your system. In such cases, perf only writes data to a perf.data file for later analysis if a specified event is detected.

27.1. Circular buffers and event-specific snapshots with perf

When investigating performance issues in a process or application with perf, it may not be affordable or appropriate to record data for hours preceding a specific event of interest occurring. In such cases, you can use perf record to create custom circular buffers that take snapshots after specific events.

The --overwrite option makes perf record store all data in an overwritable circular buffer. When the buffer gets full, perf record automatically overwrites the oldest records which, therefore, never get written to a perf.data file.

Using the --overwrite and --switch-output-event options together configures a circular buffer that records and dumps data continuously until it detects the --switch-output-event trigger event. The trigger event signals to perf record that something of interest to the user has occurred and to write the data in the circular buffer to a perf.data file. This collects specific data you are interested in while simultaneously reducing the overhead of the running perf process by not writing data you do not want to a perf.data file.

27.2. Collecting specific data to monitor for performance bottlenecks using perf circular buffers

With the perf tool, you can create circular buffers that are triggered by events you specify in order to only collect data you are interested in. To create circular buffers that collect event-specific data, use the --overwrite and --switch-output-event options for perf.

Prerequisites

  • The perf user space tool is installed. For more information, see Installing perf.
  • You have placed a uprobe in the process or application you are interested in monitoring at a location of interest within the process or application:

    # perf probe -x /path/to/executable -a function
    Added new event:
      probe_executable:function   (on function in /path/to/executable)
    
    You can now use it in all perf tools, such as:
    
            perf record -e probe_executable:function -aR sleep 1

Procedure

  • Create the circular buffer with the uprobe as the trigger event:

    # perf record --overwrite -e cycles --switch-output-event probe_executable:function ./executable
    [ perf record: dump data: Woken up 1 times ]
    [ perf record: Dump perf.data.2021021012231959 ]
    [ perf record: dump data: Woken up 1 times ]
    [ perf record: Dump perf.data.2021021012232008 ]
    ^C[ perf record: dump data: Woken up 1 times ]
    [ perf record: Dump perf.data.2021021012232082 ]
    [ perf record: Captured and wrote 5.621 MB perf.data.<timestamp> ]

    This example initiates the executable and collects cpu cycles, specified after the -e option, until perf detects the uprobe, the trigger event specified after the --switch-output-event option. At that point, perf takes a snapshot of all the data in the circular buffer and stores it in a unique perf.data file identified by timestamp. This example produced a total of 2 snapshots, the last perf.data file was forced by pressing Ctrl+c.

Chapter 28. Adding and removing tracepoints from a running perf collector without stopping or restarting perf

By using the control pipe interface to enable and disable different tracepoints in a running perf collector, you can dynamically adjust what data you are collecting without having to stop or restart perf. This ensures you do not lose performance data that would have otherwise been recorded during the stopping or restarting process.

28.1. Adding tracepoints to a running perf collector without stopping or restarting perf

Add tracepoints to a running perf collector using the control pipe interface to adjust the data you are recording without having to stop perf and losing performance data.

Prerequisites

  • The perf user space tool is installed. For more information, see Installing perf.

Procedure

  1. Configure the control pipe interface:

    # mkfifo control ack perf.pipe
  2. Run perf record with the control file setup and events you are interested in enabling:

    # perf record --control=fifo:control,ack -D -1 --no-buffering -e 'sched:*' -o - > perf.pipe

    In this example, declaring 'sched:*' after the -e option starts perf record with scheduler events.

  3. In a second terminal, start the read side of the control pipe:

    # cat perf.pipe | perf --no-pager script -i -

    Starting the read side of the control pipe triggers the following message in the first terminal:

    Events disabled
  4. In a third terminal, enable a tracepoint using the control file:

    # echo 'enable sched:sched_process_fork' > control

    This command triggers perf to scan the current event list in the control file for the declared event. If the event is present, the tracepoint is enabled and the following message appears in the first terminal:

    event sched:sched_process_fork enabled

    Once the tracepoint is enabled, the second terminal displays the output from perf detecting the tracepoint:

    bash 33349 [034] 149587.674295: sched:sched_process_fork: comm=bash pid=33349 child_comm=bash child_pid=34056

28.2. Removing tracepoints from a running perf collector without stopping or restarting perf

Remove tracepoints from a running perf collector using the control pipe interface to reduce the scope of data you are collecting without having to stop perf and losing performance data.

Prerequisites

Procedure

  • Remove the tracepoint:

    # echo 'disable sched:sched_process_fork' > control
    Note

    This example assumes you have previously loaded scheduler events into the control file and enabled the tracepoint sched:sched_process_fork.

    This command triggers perf to scan the current event list in the control file for the declared event. If the event is present, the tracepoint is disabled and the following message appears in the terminal used to configure the control pipe:

    event sched:sched_process_fork disabled

Chapter 29. Profiling memory allocation with numastat

With the numastat tool, you can display statistics over memory allocations in a system.

The numastat tool displays data for each NUMA node separately. You can use this information to investigate memory performance of your system or the effectiveness of different memory policies on your system.

29.1. Default numastat statistics

By default, the numastat tool displays statistics over these categories of data for each NUMA node:

numa_hit
The number of pages that were successfully allocated to this node.
numa_miss
The number of pages that were allocated on this node because of low memory on the intended node. Each numa_miss event has a corresponding numa_foreign event on another node.
numa_foreign
The number of pages initially intended for this node that were allocated to another node instead. Each numa_foreign event has a corresponding numa_miss event on another node.
interleave_hit
The number of interleave policy pages successfully allocated to this node.
local_node
The number of pages successfully allocated on this node by a process on this node.
other_node
The number of pages allocated on this node by a process on another node.
Note

High numa_hit values and low numa_miss values (relative to each other) indicate optimal performance.

29.2. Viewing memory allocation with numastat

You can view the memory allocation of the system by using the numastat tool.

Prerequisites

  • Install the numactl package:

    # yum install numactl

Procedure

  • View the memory allocation of your system:

    $ numastat
                                 node0         node1
    numa_hit                  76557759      92126519
    numa_miss                 30772308      30827638
    numa_foreign              30827638      30772308
    interleave_hit              106507        103832
    local_node                76502227      92086995
    other_node                30827840      30867162

Additional resources

  • numastat(8) man page

Chapter 30. Configuring an operating system to optimize CPU utilization

This section describes how to configure the operating system to optimize CPU utilization across their workloads.

30.1. Tools for monitoring and diagnosing processor issues

The following are the tools available in Red Hat Enterprise Linux 8 to monitor and diagnose processor-related performance issues:

  • turbostat tool prints counter results at specified intervals to help administrators identify unexpected behavior in servers, such as excessive power usage, failure to enter deep sleep states, or system management interrupts (SMIs) being created unnecessarily.
  • numactl utility provides a number of options to manage processor and memory affinity. The numactl package includes the libnuma library which offers a simple programming interface to the NUMA policy supported by the kernel, and can be used for more fine-grained tuning than the numactl application.
  • numastat tool displays per-NUMA node memory statistics for the operating system and its processes, and shows administrators whether the process memory is spread throughout a system or is centralized on specific nodes. This tool is provided by the numactl package.
  • numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management.
  • /proc/interrupts file displays the interrupt request (IRQ) number, the number of similar interrupt requests handled by each processor in the system, the type of interrupt sent, and a comma-separated list of devices that respond to the listed interrupt request.
  • pqos utility is available in the intel-cmt-cat package. It monitors CPU cache and memory bandwidth on recent Intel processors. It monitors:

    • The instructions per cycle (IPC).
    • The count of last level cache MISSES.
    • The size in kilobytes that the program executing in a given CPU occupies in the LLC.
    • The bandwidth to local memory (MBL).
    • The bandwidth to remote memory (MBR).
  • x86_energy_perf_policy tool allows administrators to define the relative importance of performance and energy efficiency. This information can then be used to influence processors that support this feature when they select options that trade off between performance and energy efficiency.
  • taskset tool is provided by the util-linux package. It allows administrators to retrieve and set the processor affinity of a running process, or launch a process with a specified processor affinity.

Additional resources

  • turbostat(8), numactl(8), numastat(8), numa(7), numad(8), pqos(8), x86_energy_perf_policy(8), and taskset(1) man pages

30.2. Types of system topology

In modern computing, the idea of a CPU is a misleading one, as most modern systems have multiple processors. The topology of the system is the way these processors are connected to each other and to other system resources. This can affect system and application performance, and the tuning considerations for a system.

The following are the two primary types of topology used in modern computing:

Symmetric Multi-Processor (SMP) topology
SMP topology allows all processors to access memory in the same amount of time. However, because shared and equal memory access inherently forces serialized memory accesses from all the CPUs, SMP system scaling constraints are now generally viewed as unacceptable. For this reason, practically all modern server systems are NUMA machines.
Non-Uniform Memory Access (NUMA) topology

NUMA topology was developed more recently than SMP topology. In a NUMA system, multiple processors are physically grouped on a socket. Each socket has a dedicated area of memory and processors that have local access to that memory, these are referred to collectively as a node. Processors on the same node have high speed access to that node’s memory bank, and slower access to memory banks not on their node.

Therefore, there is a performance penalty when accessing non-local memory. Thus, performance sensitive applications on a system with NUMA topology should access memory that is on the same node as the processor executing the application, and should avoid accessing remote memory wherever possible.

Multi-threaded applications that are sensitive to performance may benefit from being configured to execute on a specific NUMA node rather than a specific processor. Whether this is suitable depends on your system and the requirements of your application. If multiple application threads access the same cached data, then configuring those threads to execute on the same processor may be suitable. However, if multiple threads that access and cache different data execute on the same processor, each thread may evict cached data accessed by a previous thread. This means that each thread 'misses' the cache and wastes execution time fetching data from memory and replacing it in the cache. Use the perf tool to check for an excessive number of cache misses.

30.2.1. Displaying system topologies

There are a number of commands that help understand the topology of a system. This procedure describes how to determine the system topology.

Procedure

  • To display an overview of your system topology:

    $ numactl --hardware
    available: 4 nodes (0-3)
    node 0 cpus: 0 4 8 12 16 20 24 28 32 36
    node 0 size: 65415 MB
    node 0 free: 43971 MB
    [...]
  • To gather the information about the CPU architecture, such as the number of CPUs, threads, cores, sockets, and NUMA nodes:

    $ lscpu
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                40
    On-line CPU(s) list:   0-39
    Thread(s) per core:    1
    Core(s) per socket:    10
    Socket(s):             4
    NUMA node(s):          4
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 47
    Model name:            Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
    Stepping:              2
    CPU MHz:               2394.204
    BogoMIPS:              4787.85
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              30720K
    NUMA node0 CPU(s):     0,4,8,12,16,20,24,28,32,36
    NUMA node1 CPU(s):     2,6,10,14,18,22,26,30,34,38
    NUMA node2 CPU(s):     1,5,9,13,17,21,25,29,33,37
    NUMA node3 CPU(s):     3,7,11,15,19,23,27,31,35,39
  • To view a graphical representation of your system:

    # yum install hwloc-gui
    # lstopo

    Figure 30.1. The lstopo output

    lstopo
  • To view the detailed textual output:

    # yum install hwloc
    # lstopo-no-graphics
    Machine (15GB)
      Package L#0 + L3 L#0 (8192KB)
        L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
            PU L#0 (P#0)
            PU L#1 (P#4)
           HostBridge L#0
        PCI 8086:5917
            GPU L#0 "renderD128"
            GPU L#1 "controlD64"
            GPU L#2 "card0"
        PCIBridge
            PCI 8086:24fd
              Net L#3 "wlp61s0"
        PCIBridge
            PCI 8086:f1a6
        PCI 8086:15d7
            Net L#4 "enp0s31f6"

Additional resources

  • numactl(8), lscpu(1), and lstopo(1) man pages

30.3. Configuring kernel tick time

By default, Red Hat Enterprise Linux 8 uses a tickless kernel, which does not interrupt idle CPUs in order to reduce power usage and allow new processors to take advantage of deep sleep states.

Red Hat Enterprise Linux 8 also offers a dynamic tickless option, which is useful for latency-sensitive workloads, such as high performance computing or realtime computing. By default, the dynamic tickless option is disabled. Red Hat recommends using the cpu-partitioning Tuned profile to enable the dynamic tickless option for cores specified as isolated_cores.

This procedure describes how to manually persistently enable dynamic tickless behavior.

Procedure

  1. To enable dynamic tickless behavior in certain cores, specify those cores on the kernel command line with the nohz_full parameter. On a 16 core system, append this parameter on the GRUB_CMDLINE_LINUX option in the /etc/default/grub file:

    nohz_full=1-15

    This enables dynamic tickless behavior on cores 1 through 15, moving all timekeeping to the only unspecified core (core 0).

  2. To persistently enable the dynamic tickless behavior, regenerate the GRUB2 configuration using the edited default file. On systems with BIOS firmware, execute the following command:

    # grub2-mkconfig -o /etc/grub2.cfg

    On systems with UEFI firmware, execute the following command:

    # grub2-mkconfig -o /etc/grub2-efi.cfg
  3. When the system boots, manually move the rcu threads to the non-latency-sensitive core, in this case core 0:

    # for i in `pgrep rcu[^c]` ; do taskset -pc 0 $i ; done
  4. Optional: Use the isolcpus parameter on the kernel command line to isolate certain cores from user-space tasks.
  5. Optional: Set the CPU affinity for the kernel’s write-back bdi-flush threads to the housekeeping core:

    echo 1 > /sys/bus/workqueue/devices/writeback/cpumask

Verification steps

  • Once the system is rebooted, verify if dynticks are enabled:

    # journalctl -xe | grep dynticks
    Mar 15 18:34:54 rhel-server kernel: NO_HZ: Full dynticks CPUs: 1-15.
  • Verify that the dynamic tickless configuration is working correctly:

    # perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 sleep 3

    This command measures ticks on CPU 1 while telling CPU 1 to sleep for 3 seconds.

  • The default kernel timer configuration shows around 3100 ticks on a regular CPU:

    # perf stat -C 0 -e irq_vectors:local_timer_entry taskset -c 0 sleep 3
    
     Performance counter stats for 'CPU(s) 0':
    
                 3,107      irq_vectors:local_timer_entry
    
           3.001342790 seconds time elapsed
  • With the dynamic tickless kernel configured, you should see around 4 ticks instead:

    # perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 sleep 3
    
     Performance counter stats for 'CPU(s) 1':
    
                     4      irq_vectors:local_timer_entry
    
           3.001544078 seconds time elapsed

30.4. Overview of an interrupt request

An interrupt request or IRQ is a signal for immediate attention sent from a piece of hardware to a processor. Each device in a system is assigned one or more IRQ numbers which allow it to send unique interrupts. When interrupts are enabled, a processor that receives an interrupt request immediately pauses execution of the current application thread in order to address the interrupt request.

Because interrupt halts normal operation, high interrupt rates can severely degrade system performance. It is possible to reduce the amount of time taken by interrupts by configuring interrupt affinity or by sending a number of lower priority interrupts in a batch (coalescing a number of interrupts).

Interrupt requests have an associated affinity property, smp_affinity, which defines the processors that handle the interrupt request. To improve application performance, assign interrupt affinity and process affinity to the same processor, or processors on the same core. This allows the specified interrupt and application threads to share cache lines.

On systems that support interrupt steering, modifying the smp_affinity property of an interrupt request sets up the hardware so that the decision to service an interrupt with a particular processor is made at the hardware level with no intervention from the kernel.

30.4.1. Balancing interrupts manually

If your BIOS exports its NUMA topology, the irqbalance service can automatically serve interrupt requests on the node that is local to the hardware requesting service.

Procedure

  1. Check which devices correspond to the interrupt requests that you want to configure.
  2. Find the hardware specification for your platform. Check if the chipset on your system supports distributing interrupts.

    1. If it does, you can configure interrupt delivery as described in the following steps. Additionally, check which algorithm your chipset uses to balance interrupts. Some BIOSes have options to configure interrupt delivery.
    2. If it does not, your chipset always routes all interrupts to a single, static CPU. You cannot configure which CPU is used.
  3. Check which Advanced Programmable Interrupt Controller (APIC) mode is in use on your system:

    $ journalctl --dmesg | grep APIC

    Here,

    • If your system uses a mode other than flat, you can see a line similar to Setting APIC routing to physical flat.
    • If you can see no such message, your system uses flat mode.

      If your system uses x2apic mode, you can disable it by adding the nox2apic option to the kernel command line in the bootloader configuration.

      Only non-physical flat mode (flat) supports distributing interrupts to multiple CPUs. This mode is available only for systems that have up to 8 CPUs.

  4. Calculate the smp_affinity mask. For more information on how to calculate the smp_affinity mask, see Setting the smp_affinity mask.

Additional resources

  • journalctl(1) and taskset(1) man pages

30.4.2. Setting the smp_affinity mask

The smp_affinity value is stored as a hexadecimal bit mask representing all processors in the system. Each bit configures a different CPU. The least significant bit is CPU 0.

The default value of the mask is f, which means that an interrupt request can be handled on any processor in the system. Setting this value to 1 means that only processor 0 can handle the interrupt.

Procedure

  1. In binary, use the value 1 for CPUs that handle the interrupts. For example, to set CPU 0 and CPU 7 to handle interrupts, use 0000000010000001 as the binary code:

    Table 30.1. Binary Bits for CPUs

    CPU

    15

    14

    13

    12

    11

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    0

    Binary

    0

    0

    0

    0

    0

    0

    0

    0

    1

    0

    0

    0

    0

    0

    0

    1

  2. Convert the binary code to hexadecimal:

    For example, to convert the binary code using Python:

    >>> hex(int('0000000010000001', 2))
    
    '0x81'

    On systems with more than 32 processors, you must delimit the smp_affinity values for discrete 32 bit groups. For example, if you want only the first 32 processors of a 64 processor system to service an interrupt request, use 0xffffffff,00000000.

  3. The interrupt affinity value for a particular interrupt request is stored in the associated /proc/irq/irq_number/smp_affinity file. Set the smp_affinity mask in this file:

    # echo mask > /proc/irq/irq_number/smp_affinity

Additional resources

  • journalctl(1), irqbalance(1), and taskset(1) man pages

Chapter 31. Tuning scheduling policy

In Red Hat Enterprise Linux, the smallest unit of process execution is called a thread. The system scheduler determines which processor runs a thread, and for how long the thread runs. However, because the scheduler’s primary concern is to keep the system busy, it may not schedule threads optimally for application performance.

For example, say an application on a NUMA system is running on Node A when a processor on Node B becomes available. To keep the processor on Node B busy, the scheduler moves one of the application’s threads to Node B. However, the application thread still requires access to memory on Node A. But, this memory will take longer to access because the thread is now running on Node B and Node A memory is no longer local to the thread. Thus, it may take longer for the thread to finish running on Node B than it would have taken to wait for a processor on Node A to become available, and then to execute the thread on the original node with local memory access.

31.1. Categories of scheduling policies

Performance sensitive applications often benefit from the designer or administrator determining where threads are run. The Linux scheduler implements a number of scheduling policies which determine where and for how long a thread runs.

The following are the two major categories of scheduling policies:

Normal policies
Normal threads are used for tasks of normal priority.
Realtime policies

Realtime policies are used for time-sensitive tasks that must complete without interruptions. Realtime threads are not subject to time slicing. This means the thread runs until they block, exit, voluntarily yield, or are preempted by a higher priority thread.

The lowest priority realtime thread is scheduled before any thread with a normal policy. For more information, see Static priority scheduling with SCHED_FIFO and Round robin priority scheduling with SCHED_RR.

Additional resources

  • sched(7), sched_setaffinity(2), sched_getaffinity(2), sched_setscheduler(2), and sched_getscheduler(2) man pages

31.2. Static priority scheduling with SCHED_FIFO

The SCHED_FIFO, also called static priority scheduling, is a realtime policy that defines a fixed priority for each thread. This policy allows administrators to improve event response time and reduce latency. It is recommended to not execute this policy for an extended period of time for time sensitive tasks.

When SCHED_FIFO is in use, the scheduler scans the list of all the SCHED_FIFO threads in order of priority and schedules the highest priority thread that is ready to run. The priority level of a SCHED_FIFO thread can be any integer from 1 to 99, where 99 is treated as the highest priority. Red Hat recommends starting with a lower number and increasing priority only when you identify latency issues.

Warning

Because realtime threads are not subject to time slicing, Red Hat does not recommend setting a priority as 99. This keeps your process at the same priority level as migration and watchdog threads; if your thread goes into a computational loop and these threads are blocked, they will not be able to run. Systems with a single processor will eventually hang in this situation.

Administrators can limit SCHED_FIFO bandwidth to prevent realtime application programmers from initiating realtime tasks that monopolize the processor.

The following are some of the parameters used in this policy:

/proc/sys/kernel/sched_rt_period_us
This parameter defines the time period, in microseconds, that is considered to be one hundred percent of the processor bandwidth. The default value is 1000000 μs, or 1 second.
/proc/sys/kernel/sched_rt_runtime_us
This parameter defines the time period, in microseconds, that is devoted to running real-time threads. The default value is 950000 μs, or 0.95 seconds.

31.3. Round robin priority scheduling with SCHED_RR

The SCHED_RR is a round-robin variant of the SCHED_FIFO. This policy is useful when multiple threads need to run at the same priority level.

Like SCHED_FIFO, SCHED_RR is a realtime policy that defines a fixed priority for each thread. The scheduler scans the list of all SCHED_RR threads in order of priority and schedules the highest priority thread that is ready to run. However, unlike SCHED_FIFO, threads that have the same priority are scheduled in a round-robin style within a certain time slice.

You can set the value of this time slice in milliseconds with the sched_rr_timeslice_ms kernel parameter in the /proc/sys/kernel/sched_rr_timeslice_ms file. The lowest value is 1 millisecond.

31.4. Normal scheduling with SCHED_OTHER

The SCHED_OTHER is the default scheduling policy in Red Hat Enterprise Linux 8. This policy uses the Completely Fair Scheduler (CFS) to allow fair processor access to all threads scheduled with this policy. This policy is most useful when there are a large number of threads or when data throughput is a priority, as it allows more efficient scheduling of threads over time.

When this policy is in use, the scheduler creates a dynamic priority list based partly on the niceness value of each process thread. Administrators can change the niceness value of a process, but cannot change the scheduler’s dynamic priority list directly.

31.5. Setting scheduler policies

Check and adjust scheduler policies and priorities by using the chrt command line tool. It can start new processes with the desired properties, or change the properties of a running process. It can also be used for setting the policy at runtime.

Procedure

  1. View the process ID (PID) of the active processes:

    # ps

    Use the --pid or -p option with the ps command to view the details of the particular PID.

  2. Check the scheduling policy, PID, and priority of a particular process:

    # chrt -p 468
    pid 468's current scheduling policy: SCHED_FIFO
    pid 468's current scheduling priority: 85
    
    # chrt -p 476
    pid 476's current scheduling policy: SCHED_OTHER
    pid 476's current scheduling priority: 0

    Here, 468 and 476 are PID of a process.

  3. Set the scheduling policy of a process:

    1. For example, to set the process with PID 1000 to SCHED_FIFO, with a priority of 50:

      # chrt -f -p 50 1000
    2. For example, to set the process with PID 1000 to SCHED_OTHER, with a priority of 0:

      # chrt -o -p 0 1000
    3. For example, to set the process with PID 1000 to SCHED_RR, with a priority of 10:

      # chrt -r -p 10 1000
    4. To start a new application with a particular policy and priority, specify the name of the application:

      # chrt -f 36 /bin/my-app

31.6. Policy options for the chrt command

Using the chrt command, you can view and set the scheduling policy of a process.

The following table describes the appropriate policy options, which can be used to set the scheduling policy of a process.

Table 31.1. Policy Options for the chrt Command

Short optionLong optionDescription

-f

--fifo

Set schedule to SCHED_FIFO

-o

--other

Set schedule to SCHED_OTHER

-r

--rr

Set schedule to SCHED_RR

31.7. Changing the priority of services during the boot process

Using the systemd service, it is possible to set up real-time priorities for services launched during the boot process. The unit configuration directives are used to change the priority of a service during the boot process.

The boot process priority change is done by using the following directives in the service section:

CPUSchedulingPolicy=
Sets the CPU scheduling policy for executed processes. It is used to set other, fifo, and rr policies.
CPUSchedulingPriority=
Sets the CPU scheduling priority for executed processes. The available priority range depends on the selected CPU scheduling policy. For real-time scheduling policies, an integer between 1 (lowest priority) and 99 (highest priority) can be used.

The following procedure describes how to change the priority of a service, during the boot process, using the mcelog service.

Prerequisites

  1. Install the tuned package:

    # yum install tuned
  2. Enable and start the tuned service:

    # systemctl enable --now tuned

Procedure

  1. View the scheduling priorities of running threads:

    # tuna --show_threads
                          thread       ctxt_switches
        pid SCHED_ rtpri affinity voluntary nonvoluntary             cmd
      1      OTHER     0     0xff      3181          292         systemd
      2      OTHER     0     0xff       254            0        kthreadd
      3      OTHER     0     0xff         2            0          rcu_gp
      4      OTHER     0     0xff         2            0      rcu_par_gp
      6      OTHER     0        0         9            0 kworker/0:0H-kblockd
      7      OTHER     0     0xff      1301            1 kworker/u16:0-events_unbound
      8      OTHER     0     0xff         2            0    mm_percpu_wq
      9      OTHER     0        0       266            0     ksoftirqd/0
    [...]
  2. Create a supplementary mcelog service configuration directory file and insert the policy name and priority in this file:

    # cat <<-EOF > /etc/systemd/system/mcelog.system.d/priority.conf
    
    >
    [SERVICE]
    CPUSchedulingPolicy=_fifo_
    CPUSchedulingPriority=_20_
    EOF
  3. Reload the systemd scripts configuration:

    # systemctl daemon-reload
  4. Restart the mcelog service:

    # systemctl restart mcelog

Verification steps

  • Display the mcelog priority set by systemd issue:

    # tuna -t mcelog -P
    thread       ctxt_switches
      pid SCHED_ rtpri affinity voluntary nonvoluntary             cmd
    826     FIFO    20  0,1,2,3        13            0          mcelog

Additional resources

31.8. Priority map

Priorities are defined in groups, with some groups dedicated to certain kernel functions. For real-time scheduling policies, an integer between 1 (lowest priority) and 99 (highest priority) can be used.

The following table describes the priority range, which can be used while setting the scheduling policy of a process.

Table 31.2. Description of the priority range

PriorityThreadsDescription

1

Low priority kernel threads

This priority is usually reserved for the tasks that need to be just above SCHED_OTHER.

2 - 49

Available for use

The range used for typical application priorities.

50

Default hard-IRQ value

 

51 - 98

High priority threads

Use this range for threads that execute periodically and must have quick response times. Do not use this range for CPU-bound threads as you will starve interrupts.

99

Watchdogs and migration

System threads that must run at the highest priority.

31.9. cpu-partitioning profile

The cpu-partitioning profile is used to isolate CPUs from system level interruptions. Once you have isolated these CPUs, you can allocate them for specific applications. This is very useful in low-latency environments or in environments where you wish to extract the maximum performance from your hardware.

This profile also lets you designate housekeeping CPUs. A housekeeping CPU is used to run all services, daemons, shell processes, and kernel threads.

You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf file using the following configuration options:

isolated_cores=cpu-list
Lists CPUs to isolate. The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as 3-5. This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping CPU.
no_balance_cores=cpu-list
Lists CPUs which are not considered by the kernel during system wide process load-balancing. This option is optional. This is usually the same list as isolated_cores.

Additional resources

  • tuned-profiles-cpu-partitioning(7) man page

Chapter 32. Factors affecting I/O and file system performance

The appropriate settings for storage and file system performance are highly dependent on the storage purpose.

I/O and file system performance can be affected by any of the following factors:

  • Data write or read patterns
  • Sequential or random
  • Buffered or Direct IO
  • Data alignment with underlying geometry
  • Block size
  • File system size
  • Journal size and location
  • Recording access times
  • Ensuring data reliability
  • Pre-fetching data
  • Pre-allocating disk space
  • File fragmentation
  • Resource contention

32.1. Tools for monitoring and diagnosing I/O and file system issues

The following tools are available in Red Hat Enterprise Linux 8 for monitoring system performance and diagnosing performance problems related to I/O, file systems, and their configuration:

  • vmstat tool reports on processes, memory, paging, block I/O, interrupts, and CPU activity across the entire system. It can help administrators determine whether the I/O subsystem is responsible for any performance issues. If analysis with vmstat shows that the I/O subsystem is responsible for reduced performance, administrators can use the iostat tool to determine the responsible I/O device.
  • iostat reports on I/O device load in your system. It is provided by the sysstat package.
  • blktrace provides detailed information about how time is spent in the I/O subsystem. The companion utility blkparse reads the raw output from blktrace and produces a human readable summary of input and output operations recorded by blktrace.
  • btt analyzes blktrace output and displays the amount of time that data spends in each area of the I/O stack, making it easier to spot bottlenecks in the I/O subsystem. This utility is provided as part of the blktrace package. Some of the important events tracked by the blktrace mechanism and analyzed by btt are:

    • Queuing of the I/O event (Q)
    • Dispatch of the I/O to the driver event (D)
    • Completion of I/O event (C)
  • iowatcher can use the blktrace output to graph I/O over time. It focuses on the Logical Block Address (LBA) of disk I/O, throughput in megabytes per second, the number of seeks per second, and I/O operations per second. This can help to identify when you are hitting the operations-per-second limit of a device.
  • BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended Berkeley Packet Filter (eBPF) programs. The eBPF programs are triggered on events, such as disk I/O, TCP connections, and process creations. The BCC tools are installed in the /usr/share/bcc/tools/ directory. The following bcc-tools helps to analyze performance:

    • biolatency summarizes the latency in block device I/O (disk I/O) in histogram. This allows the distribution to be studied, including two modes for device cache hits and for cache misses, and latency outliers.
    • biosnoop is a basic block I/O tracing tool for displaying each I/O event along with the issuing process ID, and the I/O latency. Using this tool, you can investigate disk I/O performance issues.
    • biotop is used for block i/o operations in the kernel.
    • filelife tool traces the stat() syscalls.
    • fileslower traces slow synchronous file reads and writes.
    • filetop displays file reads and writes by process.
    • ext4slower, nfsslower, and xfsslower are tools that show file system operations slower than a certain threshold, which defaults to 10ms.

      For more information, see the Analyzing system performance with BPF Compiler Collection.

  • bpftace is a tracing language for eBPF used for analyzing performance issues. It also provides trace utilities like BCC for system observation, which is useful for investigating I/O performance issues.
  • The following SystemTap scripts may be useful in diagnosing storage or file system performance problems:

    • disktop.stp: Checks the status of reading or writing disk every 5 seconds and outputs the top ten entries during that period.
    • iotime.stp: Prints the amount of time spent on read and write operations, and the number of bytes read and written.
    • traceio.stp: Prints the top ten executable based on cumulative I/O traffic observed, every second.
    • traceio2.stp: Prints the executable name and process identifier as reads and writes to the specified device occur.
    • Inodewatch.stp: Prints the executable name and process identifier each time a read or write occurs to the specified inode on the specified major or minor device.
    • inodewatch2.stp: Prints the executable name, process identifier, and attributes each time the attributes are changed on the specified inode on the specified major or minor device.

Additional resources

32.2. Available tuning options for formatting a file system

Some file system configuration decisions cannot be changed after the device is formatted.

The following are the options available before formatting a storage device:

Size
Create an appropriately-sized file system for your workload. Smaller file systems require less time and memory for file system checks. However, if a file system is too small, its performance suffers from high fragmentation.
Block size

The block is the unit of work for the file system. The block size determines how much data can be stored in a single block, and therefore the smallest amount of data that is written or read at one time.

The default block size is appropriate for most use cases. However, your file system performs better and stores data more efficiently if the block size or the size of multiple blocks is the same as or slightly larger than the amount of data that is typically read or written at one time. A small file still uses an entire block. Files can be spread across multiple blocks, but this can create additional runtime overhead.

Additionally, some file systems are limited to a certain number of blocks, which in turn limits the maximum size of the file system. Block size is specified as part of the file system options when formatting a device with the mkfs command. The parameter that specifies the block size varies with the file system.

Geometry

File system geometry is concerned with the distribution of data across a file system. If your system uses striped storage, like RAID, you can improve performance by aligning data and metadata with the underlying storage geometry when you format the device.

Many devices export recommended geometry, which is then set automatically when the devices are formatted with a particular file system. If your device does not export these recommendations, or you want to change the recommended settings, you must specify geometry manually when you format the device with the mkfs command.

The parameters that specify file system geometry vary with the file system.

External journals
Journaling file systems document the changes that will be made during a write operation in a journal file prior to the operation being executed. This reduces the likelihood that a storage device will become corrupted in the event of a system crash or power failure, and speeds up the recovery process.
Note

Red Hat does not recommend using the external journals option.

Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a device with a metadata-intensive workload by placing its journal on dedicated storage that is as fast as, or faster than, the primary storage.

Warning

Ensure that external journals are reliable. Losing an external journal device causes file system corruption. External journals must be created at format time, with journal devices being specified at mount time.

Additional resources

32.3. Available tuning options for mounting a file system

The following are the options available to most file systems and can be specified as the device is mounted:

Access Time

Every time a file is read, its metadata is updated with the time at which access occurred (atime). This involves additional write I/O. The relatime is the default atime setting for most file systems.

However, if updating this metadata is time consuming, and if accurate access time data is not required, you can mount the file system with the noatime mount option. This disables updates to metadata when a file is read. It also enables nodiratime behavior, which disables updates to metadata when a directory is read.

Note

Disabling atime updates by using the noatime mount option can break applications that rely on them, for example, backup programs.

Read-ahead

Read-ahead behavior speeds up file access by pre-fetching data that is likely to be needed soon and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The higher the read-ahead value, the further ahead the system pre-fetches data.

Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects about your file system. However, accurate detection is not always possible. For example, if a storage array presents itself to the system as a single LUN, the system detects the single LUN, and does not set the appropriate read-ahead value for an array.

Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values. The storage-related tuned profiles provided with Red Hat Enterprise Linux raise the read-ahead value, as does using LVM striping, but these adjustments are not always sufficient for all workloads.

Additional resources

  • mount(8), xfs(5), and ext4(5) man pages

32.4. Types of discarding unused blocks

Regularly discarding blocks that are not in use by the file system is a recommended practice for both solid-state disks and thinly-provisioned storage.

The following are the two methods of discarding unused blocks:

Batch discard
This type of discard is part of the fstrim command. It discards all unused blocks in a file system that match criteria specified by the administrator. Red Hat Enterprise Linux 8 supports batch discard on XFS and ext4 formatted devices that support physical discard operations.
Online discard

This type of discard operation is configured at mount time with the discard option, and runs in real time without user intervention. However, it only discards blocks that are transitioning from used to free. Red Hat Enterprise Linux 8 supports online discard on XFS and ext4 formatted devices.

Red Hat recommends batch discard, except where online discard is required to maintain performance, or where batch discard is not feasible for the system’s workload.

Pre-allocation marks disk space as being allocated to a file without writing any data into that space. This can be useful in limiting data fragmentation and poor read performance. Red Hat Enterprise Linux 8 supports pre-allocating space on XFS, ext4, and GFS2 file systems. Applications can also benefit from pre-allocating space by using the fallocate(2) glibc call.

Additional resources

  • mount(8) and fallocate(2) man pages

32.5. Solid-state disks tuning considerations

Solid-state disks (SSD) use NAND flash chips rather than rotating magnetic platters to store persistent data. SSD provides a constant access time for data across their full Logical Block Address range, and does not incur measurable seek costs like their rotating counterparts. They are more expensive per gigabyte of storage space and have a lesser storage density, but they also have lower latency and greater throughput than HDDs.

Performance generally degrades as the used blocks on an SSD approach the capacity of the disk. The degree of degradation varies by vendor, but all devices experience degradation in this circumstance. Enabling discard behavior can help to alleviate this degradation. For more information, see Types of discarding unused blocks.

The default I/O scheduler and virtual memory options are suitable for use with SSDs. Consider the following factors when configuring settings that can affect SSD performance:

I/O Scheduler

Any I/O scheduler is expected to perform well with most SSDs. However, as with any other storage type, Red Hat recommends benchmarking to determine the optimal configuration for a given workload. When using SSDs, Red Hat advises changing the I/O scheduler only for benchmarking particular workloads. For instructions on how to switch between I/O schedulers, see the /usr/share/doc/kernel-version/Documentation/block/switching-sched.txt file.

For single queue HBA, the default I/O scheduler is deadline. For multiple queue HBA, the default I/O scheduler is none. For information on how to set the I/O scheduler, see Setting the disk scheduler.

Virtual Memory
Like the I/O scheduler, virtual memory (VM) subsystem requires no special tuning. Given the fast nature of I/O on SSD, try turning down the vm_dirty_background_ratio and vm_dirty_ratio settings, as increased write-out activity does not usually have a negative impact on the latency of other operations on the disk. However, this tuning can generate more overall I/O, and is therefore not generally recommended without workload-specific testing.
Swap
An SSD can also be used as a swap device, and is likely to produce good page-out and page-in performance.

32.6. Generic block device tuning parameters

The generic tuning parameters listed in this section are available in the /sys/block/sdX/queue/ directory.

The following listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all I/O schedulers:

add_random
Some I/O events contribute to the entropy pool for the /dev/random. This parameter can be set to 0 if the overhead of these contributions become measurable.
iostats

By default, iostats is enabled and the default value is 1. Setting iostats value to 0 disables the gathering of I/O statistics for the device, which removes a small amount of overhead with the I/O path. Setting iostats to 0 might slightly improve performance for very high performance devices, such as certain NVMe solid-state storage devices. It is recommended to leave iostats enabled unless otherwise specified for the given storage model by the vendor.

If you disable iostats, the I/O statistics for the device are no longer present within the /proc/diskstats file. The content of /sys/diskstats file is the source of I/O information for monitoring I/O tools, such as sar or iostats. Therefore, if you disable the iostats parameter for a device, the device is no longer present in the output of I/O monitoring tools.

max_sectors_kb

Specifies the maximum size of an I/O request in kilobytes. The default value is 512 KB. The minimum value for this parameter is determined by the logical block size of the storage device. The maximum value for this parameter is determined by the value of the max_hw_sectors_kb.

Red Hat recommends max_sectors_kb to always be a multiple of the optimal I/O size and the internal erase block size. Use a value of logical_block_size for either parameter if they are zero or not specified by the storage device.

nomerges
Most workloads benefit from request merging. However, disabling merges can be useful for debugging purposes. By default, the nomerges parameter is set to 0, which enables merging. To disable simple one-hit merging, set nomerges to 1. To disable all types of merging, set nomerges to 2.
nr_requests
It is the maximum allowed number of the queued I/O. If the current I/O scheduler is none, this number can only be reduced; otherwise the number can be increased or reduced.
optimal_io_size
Some storage devices report an optimal I/O size through this parameter. If this value is reported, Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size wherever possible.
read_ahead_kb

Defines the maximum number of kilobytes that the operating system may read ahead during a sequential read operation. As a result, the necessary information is already present within the kernel page cache for the next sequential read, which improves read I/O performance.

Device mappers often benefit from a high read_ahead_kb value. 128 KB for each device to be mapped is a good starting point, but increasing the read_ahead_kb value up to request queue’s max_sectors_kb of the disk might improve performance in application environments where sequential reading of large files takes place.

rotational
Some solid-state disks do not correctly advertise their solid-state status, and are mounted as traditional rotational disks. Manually set the rotational value to 0 to disable unnecessary seek-reducing logic in the scheduler.
rq_affinity
The default value of the rq_affinity is 1. It completes the I/O operations on one cpu core, which is in the same cpu group of the issued cpu core. To perform completions only on the processor that issued the I/O request, set the rq_affinity to 2. To disable the mentioned two abilities, set it to 0.
scheduler
To set the scheduler or scheduler preference order for a particular storage device, edit the /sys/block/devname/queue/scheduler file, where devname is the name of the device you want to configure.

Chapter 33. Configuring an operating system to optimize access to network resources

This section describes how to configure the operating system to present optimized access to network resources across their workloads. Network performance problems are sometimes the result of hardware malfunction or faulty infrastructure. Resolving these issues is beyond the scope of this document.

The Tuned service provides a number of different profiles to improve performance in a number of specific use cases:

  • latency-performance
  • network-latency
  • network-throughput

33.1. Tools for monitoring and diagnosing performance issues

The following are the available tools in Red Hat Enterprise Linux 8, which are used for monitoring system performance and diagnosing performance problems related to the networking subsystem:

  • ss utility prints statistical information about sockets, enables administrators to assess device performance over time. By default, ss displays open non-listening TCP sockets that have established connections. Using command-line options, administrators can filter out statistics about specific sockets. Red Hat recommends ss over the deprecated netstat in Red Hat Enterprise Linux
  • ip utility lets administrators manage and monitor routes, devices, routing policies, and tunnels. The ip monitor command can continuously monitor the state of devices, addresses, and routes. Use the -j option to display the output in JSON format, which can be further provided to other utilities to automate information processing.
  • dropwatch is an interactive tool, provided by the dropwatch package. It monitors and records packets that are dropped by the kernel.
  • ethtool utility enables administrators to view and edit network interface card settings. Use this tool to observe the statistics, such as the number of packets dropped by that device, of certain devices. Using the ethtool -S device name command, view the status of a specified device’s counters of the device you want to monitor.
  • /proc/net/snmp file displays data that the snmp agent uses for IP, ICMP, TCP and UDP monitoring and management. Examining this file on a regular basis helps administrators to identify unusual values and thereby identify potential performance problems. For example, an increase in UDP input errors (InErrors) in the /proc/net/snmp file can indicate a bottleneck in a socket receive queue.
  • nstat tool monitors kernel SNMP and network interface statistics. This tool reads data from the /proc/net/snmp file and prints the information in a human readable format.
  • By default, the SystemTap scripts, provided by the systemtap-client package are installed in the /usr/share/systemtap/examples/network directory:

    • nettop.stp: Every 5 seconds, the script displays a list of processes (process identifier and command) with the number of packets sent and received and the amount of data sent and received by the process during that interval.
    • socket-trace.stp: Instruments each of the functions in the Linux kernel’s net/socket.c file, and displays trace data.
    • dropwatch.stp: Every 5 seconds, the script displays the number of socket buffers freed at locations in the kernel. Use the --all-modules option to see symbolic names.
    • latencytap.stp: This script records the effect that different types of latency have on one or more processes. It prints a list of latency types every 30 seconds, sorted in descending order by the total time the process or processes spent waiting. This can be useful for identifying the cause of both storage and network latency.

    Red Hat recommends using the --all-modules option with this script to better enable the mapping of latency events. By default, this script is installed in the /usr/share/systemtap/examples/profiling directory.

  • BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended Berkeley Packet Filter (eBPF) programs. The main utility of the eBPF programs is analyzing OS performance and network performance without experiencing overhead or security issues.

Additional resources

33.2. Bottlenecks in a packet reception

While the network stack is largely self-optimizing, there are a number of points during network packet processing that can become bottlenecks and reduce the performance.

The following are the issues that can cause bottleneck:

The buffer or ring buffer of the network card
The hardware buffer can be a bottleneck if the kernel drops a large number of packets. Use the ethtool utility for monitoring a system for dropped packets.
The hardware or software interrupt queues
Interrupts can increase latency and processor contention. For information on how the processor handles interrupts, see Overview of an interrupt request, Balancing interrupts manually, and Setting the smp_affinity mask.
The socket receive queue of the application
A large number of packets that are not copied or by an increase in the UDP input errors (InErrors) in the /proc/net/snmp file, indicates a bottleneck in an application’s receive queue.

If a the hardware buffer drops a large number of packets, the following are the few potential solutions:

Slow the input traffic
Filter the incoming traffic, reduce the number of joined multicast groups, or reduce the amount of broadcast traffic to decrease the rate at which the queue fills.
Resize the hardware buffer queue

Resize the hardware buffer queue: Reduce the number of packets being dropped by increasing the size of the queue so that it does not overflow as easily. You can modify the rx/tx parameters of the network device with the ethtool command:

ethtool --set-ring device-name value

Change the drain rate of the queue
  • Decrease the rate at which the queue fills by filtering or dropping packets before they reach the queue, or by lowering the weight of the device. Filter incoming traffic or lower the network interface card’s device weight to slow incoming traffic.

    The device weight refers to the number of packets a device can receive at one time in a single scheduled processor access. You can increase the rate at which a queue is drained by increasing its device weight that is controlled by the dev_weight kernel setting. To temporarily alter this parameter, change the contents of the /proc/sys/net/core/dev_weight file, or to permanently alter, use the sysctl command, which is provided by the procps-ng package.

  • Increase the length of the application’s socket queue: This is typically the easiest way to improve the drain rate of a socket queue, but it is unlikely to be a long-term solution. If a socket queue receives a limited amount of traffic in bursts, increasing the depth of the socket queue to match the size of the bursts of traffic may prevent packets from being dropped. To increase the depth of a queue, increase the size of the socket receive buffer by making either of the following changes:

    • Increase the value of the /proc/sys/net/core/rmem_default parameter: This parameter controls the default size of the receive buffer used by sockets. This value must be smaller than or equal to the value of the proc/sys/net/core/rmem_max parameter.
    • Use the setsockopt to configure a larger SO_RCVBUF value: This parameter controls the maximum size in bytes of a socket’s receive buffer. Use the getsockopt system call to determine the current value of the buffer.

Altering the drain rate of a queue is usually the simplest way to mitigate poor network performance. However, increasing the number of packets that a device can receive at one time uses additional processor time, during which no other processes can be scheduled, so this can cause other performance problems.

Additional resources

  • ss(8), socket(7), and ethtool(8) man pages
  • /proc/net/snmp file

33.3. Busy polling

If analysis reveals high latency, your system may benefit from the poll-based rather than interrupt-based packet receipt.

Busy polling helps to reduce latency in the network receive path by allowing socket layer code to poll the receive queue of a network device, and disables network interrupts. This removes delays caused by the interrupt and the resultant context switch. However, it also increases CPU utilization. Busy polling also prevents the CPU from sleeping, which can incur additional power consumption. Busy polling behavior is supported by all the device drivers.

Additional resources

33.3.1. Enabling busy polling

By default, the busy polling is disabled. This procedure describes how to enable busy polling.

Procedure

  1. Ensure if the CONFIG_NET_RX_BUSY_POLL compilation option is enabled:

    # cat /boot/config-$(uname -r) | grep CONFIG_NET_RX_BUSY_POLL
    CONFIG_NET_RX_BUSY_POLL=y
  2. Enable busy polling

    1. To enable busy polling on specific sockets, set the sysctl.net.core.busy_poll kernel value to a value other than 0:

      # echo "net.core.busy_poll=50" > /etc/sysctl.d/95-enable-busy-polling-for-sockets.conf
      # sysctl -p /etc/sysctl.d/95-enable-busy-polling-for-sockets.conf

      This parameter controls the number of microseconds to wait for packets on the socket poll and select syscalls. Red Hat recommends a value of 50.

    2. Add the SO_BUSY_POLL socket option to the socket.
    3. To enable busy polling globally, set the sysctl.net.core.busy_read to a value other than 0:

      # echo "net.core.busy_read=50" > /etc/sysctl.d/95-enable-busy-polling-globally.conf
      # sysctl -p /etc/sysctl.d/95-enable-busy-polling-globally.conf

      The net.core.busy_read parameter controls the number of microseconds to wait for packets on the device queue for socket reads. It also sets the default value of the SO_BUSY_POLL option. Red Hat recommends a value of 50 for a small number of sockets, and a value of 100 for large numbers of sockets. For extremely large numbers of sockets, for example more than several hundred, use the epoll system call instead.

Verification steps

  • Verify if the busy poll is enabled

    # ethtool -k device | grep "busy-poll"
    busy-poll: on [fixed]
    
    # cat /proc/sys/net/core/busy_read
    50

Additional resources

33.4. Receive-Side Scaling

Receive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency. By default, RSS is enabled.

The number of queues or the CPUs that should process network activity for RSS are configured in the appropriate network device driver:

  • For the bnx2x driver, it is configured in the num_queues parameter.
  • For the sfc driver, it is configured in the rss_cpus parameter.

Regardless, it is typically configured in the /sys/class/net/device/queues/rx-queue/ directory, where device is the name of the network device (such as enp1s0) and rx-queue is the name of the appropriate receive queue.

The irqbalance daemon can be used in conjunction with RSS to reduce the likelihood of cross-node memory transfers and cache line bouncing. This lowers the latency of processing network packets.

33.4.1. Viewing the interrupt request queues

When configuring Receive-Side Scaling (RSS), Red Hat recommends limiting the number of queues to one per physical CPU core. Hyper-threads are often represented as separate cores in analysis tools, but configuring queues for all cores including logical cores such as hyper-threads has not proven beneficial to network performance.

When enabled, RSS distributes network processing equally between available CPUs based on the amount of processing each CPU has queued. However, use the --show-rxfh-indir and --set-rxfh-indir parameters of the ethtool utility, to modify how RHEL distributes network activity, and weigh certain types of network activity as more important than others.

This procedure describes how to view the interrupt request queues.

Procedure

  • To determine whether your network interface card supports RSS, check whether multiple interrupt request queues are associated with the interface in /proc/interrupts:

    # egrep 'CPU|p1p1' /proc/interrupts
     CPU0    CPU1    CPU2    CPU3    CPU4    CPU5
    89:   40187       0       0       0       0       0   IR-PCI-MSI-edge   p1p1-0
    90:       0     790       0       0       0       0   IR-PCI-MSI-edge   p1p1-1
    91:       0       0     959       0       0       0   IR-PCI-MSI-edge   p1p1-2
    92:       0       0       0    3310       0       0   IR-PCI-MSI-edge   p1p1-3
    93:       0       0       0       0     622       0   IR-PCI-MSI-edge   p1p1-4
    94:       0       0       0       0       0    2475   IR-PCI-MSI-edge   p1p1-5

    The output shows that the NIC driver created 6 receive queues for the p1p1 interface (p1p1-0 through p1p1-5). It also shows how many interrupts were processed by each queue, and which CPU serviced the interrupt. In this case, there are 6 queues because by default, this particular NIC driver creates one queue per CPU, and this system has 6 CPUs. This is a fairly common pattern among NIC drivers.

  • To list the interrupt request queue for a PCI device with the address 0000:01:00.0:

    # ls -1 /sys/devices/*/*/0000:01:00.0/msi_irqs
    101
    102
    103
    104
    105
    106
    107
    108
    109

33.5. Receive Packet Steering

Receive Packet Steering (RPS) is similar to Receive-Side Scaling (RSS) in that it is used to direct packets to specific CPUs for processing. However, RPS is implemented at the software level, and helps to prevent the hardware queue of a single network interface card from becoming a bottleneck in network traffic.

RPS has several advantages over hardware-based RSS:

  • RPS can be used with any network interface card.
  • It is easy to add software filters to RPS to deal with new protocols.
  • RPS does not increase the hardware interrupt rate of the network device. However, it does introduce inter-processor interrupts.

RPS is configured per network device and receive queue, in the /sys/class/net/device/queues/rx-queue/rps_cpus file, where device is the name of the network device, such as enp1s0 and rx-queue is the name of the appropriate receive queue, such as rx-0.

The default value of the rps_cpus file is 0. This disables RPS, and the CPU handles the network interrupt and also processes the packet. To enable RPS, configure the appropriate rps_cpus file with the CPUs that should process packets from the specified network device and receive queue.

The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of the rps_cpus to f, which is the hexadecimal value for 15. In binary representation, 15 is 00001111 (1+2+4+8).

For network devices with single transmit queues, best performance can be achieved by configuring RPS to use CPUs in the same memory domain. On non-NUMA systems, this means that all available CPUs can be used. If the network interrupt rate is extremely high, excluding the CPU that handles network interrupts may also improve performance.

For network devices with multiple queues, there is typically no benefit to configure both RPS and RSS, as RSS is configured to map a CPU to each receive queue by default. However, RPS can still be beneficial if there are fewer hardware queues than CPUs, and RPS is configured to use CPUs in the same memory domain.

33.6. Receive Flow Steering

Receive Flow Steering (RFS) extends Receive Packet Steering (RPS) behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS back end to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.

Data received from a single sender is not sent to more than one CPU. If the amount of data received from a single sender is greater than a single CPU can handle, configure a larger frame size to reduce the number of interrupts and therefore the amount of processing work for the CPU. Alternatively, consider NIC offload options or faster CPUs.

Consider using numactl or taskset in conjunction with RFS to pin applications to specific cores, sockets, or NUMA nodes. This can help prevent packets from being processed out of order.

33.6.1. Enabling Receive Flow Steering

By default, Receive Flow Steering (RFS) is disabled. This procedure describes how to enable RFS.

Procedure

  1. Set the value of the net.core.rps_sock_flow_entries kernel value to the maximum expected number of concurrently active connections:

    # echo "net.core.rps_sock_flow_entries=32768" > /etc/sysctl.d/95-enable-rps.conf
    Note

    Red Hat recommends a value of 32768 for moderate server loads. All values entered are rounded up to the nearest power of 2 in practice.

  2. Persistently set the value of the net.core.rps_sock_flow_entries:

    # sysctl -p /etc/sysctl.d/95-enable-rps.conf
  3. To temporarily set the value of the sys/class/net/device/queues/rx-queue/rps_flow_cnt file to the value of the (rps_sock_flow_entries/N), where N is the number of receive queues on a device:

    # echo 2048 > /sys/class/net/device/queues/rx-queue/rps_flow_cnt

    Replace device with the name of the network device you wish to configure (for example, enp1s0), and rx-queue with the receive queue you wish to configure (for example, rx-0).

    Replace N with the number of configured receive queues. For example, if the rps_flow_entries is set to 32768 and there are 16 configured receive queues, the rps_flow_cnt = 32786/16= 2048 (that is, rps_flow_cnt = rps_flow_enties/N ).

    For single-queue devices, the value of rps_flow_cnt is the same as the value of rps_sock_flow_entries.

  4. Persistently enable RFS on all network devices, create the /etc/udev/rules.d/99-persistent-net.rules file, and add the following content:

    SUBSYSTEM=="net", ACTION=="add", RUN{program}+="/bin/bash -c 'for x in /sys/$DEVPATH/queues/rx-*; do echo 2048 > $x/rps_flow_cnt;  done'"
  5. Optional: To enable RPS on a specific network device:

    SUBSYSTEM=="net", ACTION=="move", NAME="device name" RUN{program}+="/bin/bash -c 'for x in /sys/$DEVPATH/queues/rx-*; do echo 2048 > $x/rps_flow_cnt; done'"

    Replace device name with the actual network device name.

Verification steps

  • Verify if RFS is enabled:

    # cat /proc/sys/net/core/rps_sock_flow_entries
    32768
    
    # cat /sys/class/net/device/queues/rx-queue/rps_flow_cnt
    2048

Additional resources

  • sysctl(8) man page

33.7. Accelerated RFS

Accelerated RFS boosts the speed of Receive Flow Steering (RFS) by adding hardware assistance. Like RFS, packets are forwarded based on the location of the application consuming the packet.

Unlike traditional RFS, however, packets are sent directly to a CPU that is local to the thread consuming the data:

  • either the CPU that is executing the application
  • or a CPU local to that CPU in the cache hierarchy

Accelerated RFS is only available if the following conditions are met:

  • NIC must support the accelerated RFS. Accelerated RFS is supported by cards that export the ndo_rx_flow_steer() net_device function. Check the NIC’s data sheet to ensure if this feature is supported.
  • ntuple filtering must be enabled. For information on how to enable these filters, see Enabling the ntuple filters.

Once these conditions are met, CPU to queue mapping is deduced automatically based on traditional RFS configuration. That is, CPU to queue mapping is deduced based on the IRQ affinities configured by the driver for each receive queue. For more information on enabling the traditional RFS, see Enabling Receive Flow Steering.

33.7.1. Enabling the ntuple filters

The ntuple filtering must be enabled. Use the ethtool -k command to enable the ntuple filters.

Procedure

  1. Display the current status of the ntuple filter:

    # ethtool -k enp1s0 | grep ntuple-filters
    
    ntuple-filters: off
  2. Enable the ntuple filters:

    # ethtool -k enp1s0 ntuple on
Note

If the output is ntuple-filters: off [fixed], then the ntuple filtering is disabled and you cannot configure it:

# ethtool -k enp1s0 | grep ntuple-filters
ntuple-filters: off [fixed]

Verification steps

  • Ensure if ntuple filters are enabled:

    # ethtool -k enp1s0 | grep ntuple-filters
    ntuple-filters: on

Additional resources

  • ethtool(8) man page

Chapter 34. Configuring an operating system to optimize memory access

This section describes how to configure the operating system to optimize memory access across workloads, and the tools you can use to do so.

34.1. Tools for monitoring and diagnosing system memory issues

The following tools are available in Red Hat Enterprise Linux 8 for monitoring system performance and diagnosing performance problems related to system memory:

  • vmstat tool, provided by the procps-ng package, displays reports of a system’s processes, memory, paging, block I/O, traps, disks, and CPU activity. It provides an instantaneous report of the average of these events since the machine was last turned on, or since the previous report.
  • valgrind framework provides instrumentation to user-space binaries. Install this tool, using the yum install valgrind command. It includes a number of tools, that you can use to profile and analyze program performance, such as:

    • memcheck option is the default valgrind tool. It detects and reports on a number of memory errors that can be difficult to detect and diagnose, such as:

      • Memory access that should not occur
      • Undefined or uninitialized value use
      • Incorrectly freed heap memory
      • Pointer overlap
      • Memory leaks

        Note

        Memcheck can only report these errors, it cannot prevent them from occurring. However, memcheck logs an error message immediately before the error occurs.

    • cachegrind option simulates application interaction with a system’s cache hierarchy and branch predictor. It gathers statistics for the duration of application’s execution and outputs a summary to the console.
    • massif option measures the heap space used by a specified application. It measures both useful space and any additional space allocated for bookkeeping and alignment purposes.

Additional resources

  • vmstat(8) and valgrind(1) man pages
  • /usr/share/doc/valgrind-version/valgrind_manual.pdf file

34.2. Overview of a system’s memory

The Linux Kernel is designed to maximize the utilization of a system’s memory resources (RAM). Due to these design characteristics, and depending on the memory requirements of the workload, part of the system’s memory is in use within the kernel on behalf of the workload, while a small part of the memory is free. This free memory is reserved for special system allocations, and for other low or high priority system services.

The rest of the system’s memory is dedicated to the workload itself, and divided into the following two categories:

File memory

Pages added in this category represent parts of files in permanent storage. These pages, from the page cache, can be mapped or unmapped in an application’s address spaces. You can use applications to map files into their address space using the mmap system calls, or to operate on files via the buffered I/O read or write system calls.

Buffered I/O system calls, as well as applications that map pages directly, can re-utilize unmapped pages. As a result, these pages are stored in the cache by the kernel, especially when the system is not running any memory intensive tasks, to avoid re-issuing costly I/O operations over the same set of pages.

Anonymous memory
Pages in this category are in use by a dynamically allocated process, or are not related to files in permanent storage. This set of pages back up the in-memory control structures of each task, such as the application stack and heap areas.

Figure 34.1. Memory usage patterns

RHEL Memory Usage Patterns

34.3. Virtual memory parameters

The virtual memory parameters are listed in the /proc/sys/vm directory.

The following are the available virtual memory parameters:

vm.dirty_ratio
Is a percentage value. When this percentage of the total system memory is modified, the system begins writing the modifications to the disk with the pdflush operation. The default value is 20 percent.
vm.dirty_background_ratio
A percentage value. When this percentage of total system memory is modified, the system begins writing the modifications to the disk in the background. The default value is 10 percent.
vm.overcommit_memory

Defines the conditions that determine whether a large memory request is accepted or denied.The default value is 0.

By default, the kernel performs heuristic memory overcommit handling by estimating the amount of memory available and failing requests that are too large. However, since memory is allocated using a heuristic rather than a precise algorithm, overloading memory is possible with this setting.

Setting the overcommit_memory parameter’s value:

  • When this parameter is set to 1, the kernel performs no memory overcommit handling. This increases the possibility of memory overload, but improves performance for memory-intensive tasks.
  • When this parameter is set to 2, the kernel denies requests for memory equal to or larger than the sum of the total available swap space and the percentage of physical RAM specified in the overcommit_ratio. This reduces the risk of overcommitting memory, but is recommended only for systems with swap areas larger than their physical memory.
vm.overcommit_ratio
Specifies the percentage of physical RAM considered when overcommit_memory is set to 2. The default value is 50.
vm.max_map_count
Defines the maximum number of memory map areas that a process can use. The default value is 65530. Increase this value if your application needs more memory map areas.
vm.min_free_kbytes

Sets the size of the reserved free pages pool. It is also responsible for setting the min_page, low_page, and high_page thresholds that govern the behavior of the Linux kernel’s page reclaim algorithms. It also specifies the minimum number of kilobytes to keep free across the system. This calculates a specific value for each low memory zone, each of which is assigned a number of reserved free pages in proportion to their size.

Setting the vm.min_free_kbytes parameter’s value:

  • Increasing the parameter value effectively reduces the application working set usable memory. Therefore, you might want to use it for only kernel-driven workloads, where driver buffers need to be allocated in atomic contexts.
  • Decreasing the parameter value might render the kernel unable to service system requests, if memory becomes heavily contended in the system.

    Warning

    Extreme values can be detrimental to the system’s performance. Setting the vm.min_free_kbytes to an extremely low value prevents the system from reclaiming memory effectively, which can result in system crashes and failure to service interrupts or other kernel services. However, setting vm.min_free_kbytes too high considerably increases system reclaim activity, causing allocation latency due to a false direct reclaim state. This might cause the system to enter an out-of-memory state immediately.

    The vm.min_free_kbytes parameter also sets a page reclaim watermark, called min_pages. This watermark is used as a factor when determining the two other memory watermarks, low_pages, and high_pages, that govern page reclaim algorithms.

/proc/PID/oom_adj

In the event that a system runs out of memory, and the panic_on_oom parameter is set to 0, the oom_killer function kills processes, starting with the process that has the highest oom_score, until the system recovers.

The oom_adj parameter determines the oom_score of a process. This parameter is set per process identifier. A value of -17 disables the oom_killer for that process. Other valid values range from -16 to 15.

Note

Processes created by an adjusted process inherit the oom_score of that process.

vm.swappiness

The swappiness value, ranging from 0 to 100, controls the degree to which the system favors reclaiming memory from the anonymous memory pool, or the page cache memory pool.

Setting the swappiness parameter’s value:

  • Higher values favor file-mapped driven workloads while swapping out the less actively accessed processes’ anonymous mapped memory of RAM. This is useful for file-servers or streaming applications that depend on data, from files in the storage, to reside on memory to reduce I/O latency for the service requests.
  • Low values favor anonymous-mapped driven workloads while reclaiming the page cache (file mapped memory). This setting is useful for applications that do not depend heavily on the file system information, and heavily utilize dynamically allocated and private memory, such as mathematical and number crunching applications, and few hardware virtualization supervisors like QEMU.

    The default value of the vm.swappiness parameter is 30.

    Warning

    Setting the vm.swappiness to 0 aggressively avoids swapping anonymous memory out to a disk, this increases the risk of processes being killed by the oom_killer function when under memory or I/O intensive workloads.

Additional resources

34.4. File system parameters

The file system parameters are listed in the /proc/sys/fs directory. The following are the available file system parameters:

aio-max-nr
Defines the maximum allowed number of events in all active asynchronous input/output contexts. The default value is 65536, and modifying this value does not pre-allocate or resize any kernel data structures.
file-max

Determines the maximum number of file handles for the entire system. The default value on Red Hat Enterprise Linux 8 is either 8192 or one tenth of the free memory pages available at the time the kernel starts, whichever is higher.

Raising this value can resolve errors caused by a lack of available file handles.

Additional resources

  • sysctl(8) man page

34.5. Kernel parameters

The default values for the kernel parameters are located in the /proc/sys/kernel/ directory. These values are calculated by the kernel at boot time depending on the available system resources.

The following are the available kernel parameters used to set up limits for the msg* and shm* System V IPC (sysvipc) system calls:

msgmax
Defines the maximum allowed size in bytes of any single message in a message queue. This value must not exceed the size of the queue (msgmnb). Use the sysctl msgmax command to determine the current msgmax value on your system.
msgmnb
Defines the maximum size in bytes of a single message queue. Use the sysctl msgmnb command to determine the current msgmnb value on your system.
msgmni
Defines the maximum number of message queue identifiers, and therefore the maximum number of queues. Use the sysctl msgmni command to determine the current msgmni value on your system.
shmall
Defines the total amount of shared memory pages that can be used on the system at one time. For example, a page is 4096 bytes on the AMD64 and Intel 64 architecture. Use the sysctl shmall command to determine the current shmall value on your system.
shmmax
Defines the maximum size in bytes of a single shared memory segment allowed by the kernel. Use the sysctl shmmax command to determine the current shmmax value on your system.
shmmni
Defines the system-wide maximum number of shared memory segments. The default value is 4096 on all systems.

Additional resources

  • sysvipc(7) and sysctl(8) man pages

Chapter 35. Configuring huge pages

Physical memory is managed in fixed-size chunks called pages. On the x86_64 architecture, supported by Red Hat Enterprise Linux 8, the default size of a memory page is 4 KB. This default page size has proved to be suitable for general-purpose operating systems, such as Red Hat Enterprise Linux, which supports many different kinds of workloads.

However, specific applications can benefit from using larger page sizes in certain cases. For example, an application that works with a large and relatively fixed data set of hundreds of megabytes or even dozens of gigabytes can have performance issues when using 4 KB pages. Such data sets can require a huge amount of 4 KB pages, which can lead to overhead in the operating system and the CPU.

This section provides information about huge pages available in Red Hat Enterprise Linux 8 and how you can configure them.

35.1. Available huge page features

With Red Hat Enterprise Linux 8, you can use huge pages for applications that work with big data sets, and improve the performance of such applications.

The following are the huge page methods, which are supported in Red Hat Enterprise Linux 8:

HugeTLB pages

HugeTLB pages are also called static huge pages. There are two ways of reserving HugeTLB pages:

  • At boot time: It increases the possibility of success because the memory has not yet been significantly fragmented. However, on NUMA machines, the number of pages is automatically split among the NUMA nodes. For more information on parameters that influence HugeTLB page behavior at boot time, see Parameters for reserving HugeTLB pages at boot time and how to use these parameters to configure HugeTLB pages at boot time, see Configuring HugeTLB at boot time.
  • At run time: It allows you to reserve the huge pages per NUMA node. If the run-time reservation is done as early as possible in the boot process, the probability of memory fragmentation is lower. For more information on parameters that influence HugeTLB page behavior at boot time, see Parameters for reserving HugeTLB pages at run time and how to use these parameters to configure HugeTLB pages at boot time, see Configuring HugeTLB at run time.
Transparent HugePages (THP)

With THP, the kernel automatically assigns huge pages to processes, and therefore there is no need to manually reserve the static huge pages. The following are the two modes of operation in THP:

  • system-wide: Here, the kernel tries to assign huge pages to a process whenever it is possible to allocate the huge pages and the process is using a large contiguous virtual memory area.
  • per-process: Here, the kernel only assigns huge pages to the memory areas of individual processes which you can specify using the madvise() system call.

    Note

    The THP feature only supports 2 MB pages.

    For information on enabling and disabling THP, see For more information on parameters that influence HugeTLB page behavior at boot time, see Enabling transparent hugepages and Disabling transparent hugepages.

35.2. Parameters for reserving HugeTLB pages at boot time

Use the following parameters to influence HugeTLB page behavior at boot time.

Table 35.1. Parameters used to configure HugeTLB pages at boot time

ParameterDescriptionDefault value

hugepages

Defines the number of persistent huge pages configured in the kernel at boot time.

In a NUMA system, huge pages, that have this parameter defined, are divided equally between nodes.

You can assign huge pages to specific nodes at runtime by changing the value of the nodes in the /sys/devices/system/node/node_id/hugepages/hugepages-size/nr_hugepages file.

The default value is 0.

To update this value at boot, change the value of this parameter in the /proc/sys/vm/nr_hugepages file.

hugepagesz

Defines the size of persistent huge pages configured in the kernel at boot time.

Valid values are 2 MB and 1 GB. The default value is 2 MB.

default_hugepagesz

Defines the default size of persistent huge pages configured in the kernel at boot time.

Valid values are 2 MB and 1 GB. The default value is 2 MB.

35.3. Configuring HugeTLB at boot time

The page size, which the HugeTLB subsystem supports, depends on the architecture. The x86_64 architecture supports 2 MB huge pages and 1 GB gigantic pages.

This procedure describes how to reserve a 1 GB page at boot time.

Procedure

  1. Create a HugeTLB pool for 1 GB pages by appending the following line to the kernel command-line options in the /etc/default/grub file as root:

    default_hugepagesz=1G hugepagesz=1G
  2. Regenerate the GRUB2 configuration using the edited default file:

    1. If your system uses BIOS firmware, execute the following command:

      # grub2-mkconfig -o /boot/grub2/grub.cfg
    2. If your system uses UEFI framework, execute the following command:

      # grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
  3. Create a new file called hugetlb-gigantic-pages.service in the /usr/lib/systemd/system/ directory and add the following content:

    [Unit]
    Description=HugeTLB Gigantic Pages Reservation
    DefaultDependencies=no
    Before=dev-hugepages.mount
    ConditionPathExists=/sys/devices/system/node
    ConditionKernelCommandLine=hugepagesz=1G
    
    [Service]
    Type=oneshot
    RemainAfterExit=yes
    ExecStart=/usr/lib/systemd/hugetlb-reserve-pages.sh
    
    [Install]
    WantedBy=sysinit.target
  4. Create a new file called hugetlb-reserve-pages.sh in the /usr/lib/systemd/ directory and add the following content:

    While adding the following content, replace number_of_pages with the number of 1GB pages you want to reserve, and node with the name of the node on which to reserve these pages.

    #!/bin/sh
    
    nodes_path=/sys/devices/system/node/
    if [ ! -d $nodes_path ]; then
        echo "ERROR: $nodes_path does not exist"
        exit 1
    fi
    
    reserve_pages()
    {
        echo $1 > $nodes_path/$2/hugepages/hugepages-1048576kB/nr_hugepages
    }
    
    reserve_pages number_of_pages node

    For example, to reserve two 1 GB pages on node0 and one 1GB page on node1, replace the number_of_pages with 2 for node0 and 1 for node1:

    reserve_pages 2 node0
    reserve_pages 1 node1
  5. Create an executable script:

    # chmod +x /usr/lib/systemd/hugetlb-reserve-pages.sh
  6. Enable early boot reservation:

    # systemctl enable hugetlb-gigantic-pages
Note
  • You can try reserving more 1GB pages at runtime by writing to nr_hugepages at any time. However, such reservations can fail due to memory fragmentation. The most reliable way to reserve 1 GB pages is by using this hugetlb-reserve-pages.sh script, which runs early during boot.
  • Reserving static huge pages can effectively reduce the amount of memory available to the system, and prevents it from properly utilizing its full memory capacity. Although a properly sized pool of reserved huge pages can be beneficial to applications that utilize it, an oversized or unused pool of reserved huge pages will eventually be detrimental to overall system performance. When setting a reserved huge page pool, ensure that the system can properly utilize its full memory capacity.

Additional resources

  • systemd.service(5) man page
  • /usr/share/doc/kernel-doc-kernel_version/Documentation/vm/hugetlbpage.txt file

35.4. Parameters for reserving HugeTLB pages at run time

Use the following parameters to influence HugeTLB page behavior at run time.

Table 35.2. Parameters used to configure HugeTLB pages at run time

ParameterDescriptionFile name

nr_hugepages

Defines the number of huge pages of a specified size assigned to a specified NUMA node.

/sys/devices/system/node/node_id/hugepages/hugepages-size/nr_hugepages

nr_overcommit_hugepages

Defines the maximum number of additional huge pages that can be created and used by the system through overcommitting memory.

Writing any non-zero value into this file indicates that the system obtains that number of huge pages from the kernel’s normal page pool if the persistent huge page pool is exhausted. As these surplus huge pages become unused, they are then freed and returned to the kernel’s normal page pool.

/proc/sys/vm/nr_overcommit_hugepages

35.5. Configuring HugeTLB at run time

This procedure describes how to add 20 2048 kB huge pages to node2.

To reserve pages based on your requirements, replace:

  • 20 with the number of huge pages you wish to reserve,
  • 2048kB with the size of the huge pages,
  • node2 with the node on which you wish to reserve the pages.

Procedure

  1. Display the memory statistics:

    # numastat -cm | egrep 'Node|Huge'
                     Node 0 Node 1 Node 2 Node 3  Total add
    AnonHugePages         0      2      0      8     10
    HugePages_Total       0      0      0      0      0
    HugePages_Free        0      0      0      0      0
    HugePages_Surp        0      0      0      0      0
  2. Add the number of huge pages of a specified size to the node:

    # echo 20 > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

Verification steps

  • Ensure that the number of huge pages are added:

    # numastat -cm | egrep 'Node|Huge'
                     Node 0 Node 1 Node 2 Node 3  Total
    AnonHugePages         0      2      0      8     10
    HugePages_Total       0      0     40      0     40
    HugePages_Free        0      0     40      0     40
    HugePages_Surp        0      0      0      0      0

Additional resources

  • numastat(8) man page

35.6. Enabling transparent hugepages

THP is enabled by default in Red Hat Enterprise Linux 8. However, you can enable or disable THP.

This procedure describes how to enable THP.

Procedure

  1. Check the current status of THP:

    # cat /sys/kernel/mm/transparent_hugepage/enabled
  2. Enable THP:

    # echo always > /sys/kernel/mm/transparent_hugepage/enabled
  3. To prevent applications from allocating more memory resources than necessary, disable the system-wide transparent huge pages and only enable them for the applications that explicitly request it through the madvise:

    # echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
Note

Sometimes, providing low latency to short-lived allocations has higher priority than immediately achieving the best performance with long-lived allocations. In such cases, you can disable direct compaction while leaving THP enabled.

Direct compaction is a synchronous memory compaction during the huge page allocation. Disabling direct compaction provides no guarantee of saving memory, but can decrease the risk of higher latencies during frequent page faults. Note that if the workload benefits significantly from THP, the performance decreases. Disable direct compaction:

# echo madvise > /sys/kernel/mm/transparent_hugepage/defrag

Additional resources

35.7. Disabling transparent hugepages

THP is enabled by default in Red Hat Enterprise Linux 8. However, you can enable or disable THP.

This procedure describes how to disable THP.

Procedure

  1. Check the current status of THP:

    # cat /sys/kernel/mm/transparent_hugepage/enabled
  2. Disable THP:

    # echo never > /sys/kernel/mm/transparent_hugepage/enabled

35.8. Impact of page size on translation lookaside buffer size

Reading address mappings from the page table is time-consuming and resource-expensive, so CPUs are built with a cache for recently-used addresses, called the Translation Lookaside Buffer (TLB). However, the default TLB can only cache a certain number of address mappings.

If a requested address mapping is not in the TLB, called a TLB miss, the system still needs to read the page table to determine the physical to virtual address mapping. Because of the relationship between application memory requirements and the size of pages used to cache address mappings, applications with large memory requirements are more likely to suffer performance degradation from TLB misses than applications with minimal memory requirements. It is therefore important to avoid TLB misses wherever possible.

Both HugeTLB and Transparent Huge Page features allow applications to use pages larger than 4 KB. This allows addresses stored in the TLB to reference more memory, which reduces TLB misses and improves application performance.

Chapter 36. Getting started with SystemTap

As a system administrator, you can use SystemTap to identify underlying causes of a bug or performance problem on a running Linux system.

As an application developer, you can use SystemTap to monitor in fine detail how your application behaves within the Linux system.

36.1. The purpose of SystemTap

SystemTap is a tracing and probing tool that you can use to study and monitor the activities of your operating system (particularly, the kernel) in fine detail. SystemTap provides information similar to the output of tools such as netstat, ps, top, and iostat. However, SystemTap provides more filtering and analysis options for collected information. In SystemTap scripts, you specify the information that SystemTap gathers.

SystemTap aims to supplement the existing suite of Linux monitoring tools by providing users with the infrastructure to track kernel activity and combining this capability with two attributes:

Flexibility
the SystemTap framework enables you to develop simple scripts for investigating and monitoring a wide variety of kernel functions, system calls, and other events that occur in kernel space. With this, SystemTap is not so much a tool as it is a system that allows you to develop your own kernel-specific forensic and monitoring tools.
Ease-of-Use
SystemTap enables you to monitor kernel activity without having to recompile the kernel or reboot the system.

36.2. Installing SystemTap

To begin using SystemTap, install the required packages. To use SystemTap on more than one kernel where a system has multiple kernels installed, install the corresponding required kernel packages for each kernel version.

Prerequisites

Procedure

  1. Install the required SystemTap packages:

    # yum install systemtap
  2. Install the required kernel packages:

    1. Using stap-prep:

      # stap-prep
    2. If stap-prep does not work, install the required kernel packages manually:

      # yum install kernel-debuginfo-$(uname -r) kernel-debuginfo-common-$(uname -i)-$(uname -r) kernel-devel-$(uname -r)

      $(uname -i) is automatically replaced with the hardware platform of your system and $(uname -r) is automatically replaced with the version of your running kernel.

Verification steps

  • If the kernel to be probed with SystemTap is currently in use, test if your installation was successful:

    # stap -v -e 'probe kernel.function("vfs_read") {printf("read performed\n"); exit()}'

    A successful SystemTap deployment results in an output similar to the following:

    Pass 1: parsed user script and 45 library script(s) in 340usr/0sys/358real ms.
    Pass 2: analyzed script: 1 probe(s), 1 function(s), 0 embed(s), 0 global(s) in 290usr/260sys/568real ms.
    Pass 3: translated to C into "/tmp/stapiArgLX/stap_e5886fa50499994e6a87aacdc43cd392_399.c" in 490usr/430sys/938real ms.
    Pass 4: compiled C into "stap_e5886fa50499994e6a87aacdc43cd392_399.ko" in 3310usr/430sys/3714real ms.
    Pass 5: starting run. 1
    read performed 2
    Pass 5: run completed in 10usr/40sys/73real ms. 3

    The last three lines of output (beginning with Pass 5) indicate that:

    1
    SystemTap successfully created the instrumentation to probe the kernel and ran the instrumentation.
    2
    SystemTap detected the specified event (in this case, A VFS read).
    3
    SystemTap executed a valid handler (printed text and then closed it with no errors).

36.3. Privileges to run SystemTap

Running SystemTap scripts requires elevated system privileges but, in some instances, non-privileged users might need to run SystemTap instrumentation on their machine.

To allow users to run SystemTap without root access, add users to both of these user groups:

stapdev

Members of this group can use stap to run SystemTap scripts, or staprun to run SystemTap instrumentation modules.

Running stap involves compiling SystemTap scripts into kernel modules and loading them into the kernel. This requires elevated privileges to the system, which are granted to stapdev members. Unfortunately, such privileges also grant effective root access to stapdev members. As such, only grant stapdev group membership to users who can be trusted with root access.

stapusr
Members of this group can only use staprun to run SystemTap instrumentation modules. In addition, they can only run those modules from the /lib/modules/kernel_version/systemtap/ directory. This directory must be owned only by the root user, and must only be writable by the root user.

36.4. Running SystemTap scripts

You can run SystemTap scripts from standard input or from a file.

Sample scripts that are distributed with the installation of SystemTap can be found in the /usr/share/systemtap/examples directory.

Prerequisites

  1. SystemTap and the associated required kernel packages are installed as described in Installing Systemtap.
  2. To run SystemTap scripts as a normal user, add the user to the SystemTap groups:

    # usermod --append --groups
    stapdev,stapusr user-name

Procedure

  • Run the SystemTap script:

    • From standard input:

      # echo "probe timer.s(1) {exit()}" | stap -

      This command instructs stap to run the script passed by echo to standard input. To add stap options, insert them before the - character. For example, to make the results from this command more verbose, the command is:

      # echo "probe timer.s(1) {exit()}" | stap -v -
    • From a file:

      # stap file_name

Chapter 37. Cross-instrumentation of SystemTap

Cross-instrumentation of SystemTap is creating SystemTap instrumentation modules from a SystemTap script on one system to be used on another system that does not have SystemTap fully deployed.

37.1. SystemTap cross-instrumentation

When you run a SystemTap script, a kernel module is built out of that script. SystemTap then loads the module into the kernel.

Normally, SystemTap scripts can run only on systems where SystemTap is deployed. To run SystemTap on ten systems, SystemTap needs to be deployed on all those systems. In some cases, this might be neither feasible nor desired. For example, corporate policy might prohibit you from installing packages that provide compilers or debug information on specific machines, which will prevent the deployment of SystemTap.

To work around this, use cross-instrumentation. Cross-instrumentation is the process of generating SystemTap instrumentation modules from a SystemTap script on one system to be used on another system. This process offers the following benefits:

  • The kernel information packages for various machines can be installed on a single host machine.
Caution

Kernel packaging bugs may prevent this. In such cases, the kernel-debuginfo and kernel-devel packages for the host system and target system must match. If this occurs, report the bug at https://bugzilla.redhat.com/.

  • Each target machine needs only one package to be installed to use the generated SystemTap instrumentation module: systemtap-runtime.
Important

The host system must be the same architecture and running the same distribution of Linux as the target system in order for the built instrumentation module to work.

Terminology
instrumentation module
The kernel module built from a SystemTap script; the SystemTap module is built on the host system, and will be loaded on the target kernel of the target system.
host system
The system on which the instrumentation modules (from SystemTap scripts) are compiled, to be loaded on target systems.
target system
The system in which the instrumentation module is being built (from SystemTap scripts).
target kernel
The kernel of the target system. This is the kernel that loads and runs the instrumentation module.

37.2. Initializing cross-instrumentation of SystemTap

Initialize cross-instrumentation of SystemTap to build SystemTap instrumentation modules from a SystemTap script on one system and use them on another system that does not have SystemTap fully deployed.

Prerequisites

  • SystemTap is installed on the host system as described in Installing Systemtap.
  • The systemtap-runtime package is installed on each target system:

    # yum install systemtap-runtime
  • Both the host system and target system are the same architecture.
  • Both the host system and target system are running the same major version of Red Hat Enterprise Linux (such as Red Hat Enterprise Linux 8), they can be running different minor versions (such as 8.1 and 8.2).
Caution

Kernel packaging bugs may prevent multiple kernel-debuginfo and kernel-devel packages from being installed on one system. In such cases, the minor version for the host system and target system must match. If this occurs, report the bug at https://bugzilla.redhat.com/.

Procedure

  1. Determine the kernel running on each target system:

    $ uname -r

    Repeat this step for each target system.

  2. On the host system, install the target kernel and related packages for each target system by the method described in Installing Systemtap.
  3. Build an instrumentation module on the host system, copy this module to and run this module on on the target system either:

    1. Using remote implementation:

      # stap --remote target_system script

      This command remotely implements the specified script on the target system. You must ensure an SSH connection can be made to the target system from the host system for this to be successful.

    2. Manually:

      1. Build the instrumentation module on the host system:

        # stap -r kernel_version script -m module_name -p 4

        Here, kernel_version refers to the version of the target kernel determined in step 1, script refers to the script to be converted into an instrumentation module, and module_name is the desired name of the instrumentation module. The -p4 option tells SystemTap to not load and run the compiled module.

      2. Once the instrumentation module is compiled, copy it to the target system and load it using the following command:

        # staprun module_name.ko

Chapter 38. Monitoring network activity with SystemTap

You can use helpful example SystemTap scripts available in the /usr/share/systemtap/testsuite/systemtap.examples/ directory, upon installing the systemtap-testsuite package, to monitor and investigate the network activity of your system.

38.1. Profiling network activity with SystemTap

You can use the nettop.stp example SystemTap script to profile network activity. The script tracks which processes are generating network traffic on the system, and provides the following information about each process:

PID
The ID of the listed process.
UID
User ID. A user ID of 0 refers to the root user.
DEV
Which ethernet device the process used to send or receive data (for example, eth0, eth1).
XMIT_PK
The number of packets transmitted by the process.
RECV_PK
The number of packets received by the process.
XMIT_KB
The amount of data sent by the process, in kilobytes.
RECV_KB
The amount of data received by the service, in kilobytes.

Prerequisites

Procedure

  • Run the nettop.stp script:

    # stap  --example nettop.stp

    The nettop.stp script provides network profile sampling every 5 seconds.

    Output of the nettop.stp script looks similar to the following:

    [...]
      PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
        0     0 eth0          0       5       0       0 swapper
    11178     0 eth0          2       0       0       0 synergyc
      PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
     2886     4 eth0         79       0       5       0 cups-polld
    11362     0 eth0          0      61       0       5 firefox
        0     0 eth0          3      32       0       3 swapper
     2886     4 lo            4       4       0       0 cups-polld
    11178     0 eth0          3       0       0       0 synergyc
      PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
        0     0 eth0          0       6       0       0 swapper
     2886     4 lo            2       2       0       0 cups-polld
    11178     0 eth0          3       0       0       0 synergyc
     3611     0 eth0          0       1       0       0 Xorg
      PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
        0     0 eth0          3      42       0       2 swapper
    11178     0 eth0         43       1       3       0 synergyc
    11362     0 eth0          0       7       0       0 firefox
     3897     0 eth0          0       1       0       0 multiload-apple

38.2. Tracing functions called in network socket code with SystemTap

You can use the socket-trace.stp example SystemTap script to trace functions called from the kernel’s net/socket.c file. This helps you identify, in finer detail, how each process interacts with the network at the kernel level.

Prerequisites

Procedure

  • Run the socket-trace.stp script:

    # stap  --example socket-trace.stp

    A 3-second excerpt of the output of the socket-trace.stp script looks similar to the following:

    [...]
    0 Xorg(3611): -> sock_poll
    3 Xorg(3611): <- sock_poll
    0 Xorg(3611): -> sock_poll
    3 Xorg(3611): <- sock_poll
    0 gnome-terminal(11106): -> sock_poll
    5 gnome-terminal(11106): <- sock_poll
    0 scim-bridge(3883): -> sock_poll
    3 scim-bridge(3883): <- sock_poll
    0 scim-bridge(3883): -> sys_socketcall
    4 scim-bridge(3883):  -> sys_recv
    8 scim-bridge(3883):   -> sys_recvfrom
    12 scim-bridge(3883):-> sock_from_file
    16 scim-bridge(3883):<- sock_from_file
    20 scim-bridge(3883):-> sock_recvmsg
    24 scim-bridge(3883):<- sock_recvmsg
    28 scim-bridge(3883):   <- sys_recvfrom
    31 scim-bridge(3883):  <- sys_recv
    35 scim-bridge(3883): <- sys_socketcall
    [...]

38.3. Monitoring network packet drops with SystemTap

The network stack in Linux can discard packets for various reasons. Some Linux kernels include a tracepoint, kernel.trace("kfree_skb")`, which tracks where packets are discarded.

The dropwatch.stp SystemTap script uses kernel.trace("kfree_skb") to trace packet discards; the script summarizes what locations discard packets in every 5-second interval.

Prerequisites

Procedure

  • Run the dropwatch.stp script:

    # stap  --example dropwatch.stp

    Running the dropwatch.stp script for 15 seconds results in output similar to the following:

    Monitoring for dropped packets
    51 packets dropped at location 0xffffffff8024cd0f
    2 packets dropped at location 0xffffffff8044b472
    51 packets dropped at location 0xffffffff8024cd0f
    1 packets dropped at location 0xffffffff8044b472
    97 packets dropped at location 0xffffffff8024cd0f
    1 packets dropped at location 0xffffffff8044b472
    Stopping dropped packet monitor
    Note

    To make the location of packet drops more meaningful, see the /boot/System.map-$(uname -r) file. This file lists the starting addresses for each function, enabling you to map the addresses in the output of the dropwatch.stp script to a specific function name. Given the following snippet of the /boot/System.map-$(uname -r) file, the address 0xffffffff8024cd0f maps to the function unix_stream_recvmsg and the address 0xffffffff8044b472 maps to the function arp_rcv:

    [...]
    ffffffff80