Scalability and performance
Scaling your OpenShift Container Platform cluster and tuning performance in production environments
Abstract
Chapter 1. Recommended performance and scalability practices
1.1. Recommended control plane practices
This topic provides recommended performance and scalability practices for control planes in OpenShift Container Platform.
1.1.1. Recommended practices for scaling the cluster
The guidance in this section is only relevant for installations with cloud provider integration.
Apply the following best practices to scale the number of worker machines in your OpenShift Container Platform cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the worker machine set.
When scaling up the cluster to higher node counts:
- Spread nodes across all of the available zones for higher availability.
- Scale up by no more than 25 to 50 machines at once.
- Consider creating new compute machine sets in each available zone with alternative instance types of similar size to help mitigate any periodic provider capacity constraints. For example, on AWS, use m5.large and m5d.large.
Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.
The controller might not be able to create the machines if the replicas in the compute machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which OpenShift Container Platform is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which OpenShift Container Platform is deployed has API request limits; excessive queries might lead to machine creation failures due to cloud platform limitations.
Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.
When scaling large and dense clusters to lower node counts, it might take large amounts of time because the process involves draining or evicting the objects running on the nodes being terminated in parallel. Also, the client might start to throttle the requests if there are too many objects to evict. The default client queries per second (QPS) and burst rates are currently set to 5
and 10
respectively. These values cannot be modified in OpenShift Container Platform.
1.1.2. Control plane node sizing
The control plane node resource requirements depend on the number and type of nodes and objects in the cluster. The following control plane node size recommendations are based on the results of a control plane density focused testing, or Cluster-density. This test creates the following objects across a given number of namespaces:
- 1 image stream
- 1 build
-
5 deployments, with 2 pod replicas in a
sleep
state, mounting 4 secrets, 4 config maps, and 1 downward API volume each - 5 services, each one pointing to the TCP/8080 and TCP/8443 ports of one of the previous deployments
- 1 route pointing to the first of the previous services
- 10 secrets containing 2048 random string characters
- 10 config maps containing 2048 random string characters
Number of worker nodes | Cluster-density (namespaces) | CPU cores | Memory (GB) |
---|---|---|---|
24 | 500 | 4 | 16 |
120 | 1000 | 8 | 32 |
252 | 4000 | 16, but 24 if using the OVN-Kubernetes network plug-in | 64, but 128 if using the OVN-Kubernetes network plug-in |
501, but untested with the OVN-Kubernetes network plug-in | 4000 | 16 | 96 |
The data from the table above is based on an OpenShift Container Platform running on top of AWS, using r5.4xlarge instances as control-plane nodes and m5.2xlarge instances as worker nodes.
On a large and dense cluster with three control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted, or fails. The failures can be due to unexpected issues with power, network, underlying infrastructure, or intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available, which leads to increase in the resource usage. This is also expected during upgrades because the control plane nodes are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
The node sizing varies depending on the number of nodes and object counts in the cluster. It also depends on whether the objects are actively being created on the cluster. During object creation, the control plane is more active in terms of resource usage compared to when the objects are in the running
phase.
Operator Lifecycle Manager (OLM ) runs on the control plane nodes and its memory footprint depends on the number of namespaces and user installed operators that OLM needs to manage on the cluster. Control plane nodes need to be sized accordingly to avoid OOM kills. Following data points are based on the results from cluster maximums testing.
Number of namespaces | OLM memory at idle state (GB) | OLM memory with 5 user operators installed (GB) |
---|---|---|
500 | 0.823 | 1.7 |
1000 | 1.2 | 2.5 |
1500 | 1.7 | 3.2 |
2000 | 2 | 4.4 |
3000 | 2.7 | 5.6 |
4000 | 3.8 | 7.6 |
5000 | 4.2 | 9.02 |
6000 | 5.8 | 11.3 |
7000 | 6.6 | 12.9 |
8000 | 6.9 | 14.8 |
9000 | 8 | 17.7 |
10,000 | 9.9 | 21.6 |
You can modify the control plane node size in a running OpenShift Container Platform 4.12 cluster for the following configurations only:
- Clusters installed with a user-provisioned installation method.
- AWS clusters installed with an installer-provisioned infrastructure installation method.
- Clusters that use a control plane machine set to manage control plane machines.
For all other configurations, you must estimate your total node count and use the suggested control plane node size during installation.
The recommendations are based on the data points captured on OpenShift Container Platform clusters with OpenShift SDN as the network plugin.
In OpenShift Container Platform 4.12, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. The sizes are determined taking that into consideration.
1.1.2.1. Selecting a larger Amazon Web Services instance type for control plane machines
If the control plane machines in an Amazon Web Services (AWS) cluster require more resources, you can select a larger AWS instance type for the control plane machines to use.
The procedure for clusters that use a control plane machine set is different from the procedure for clusters that do not use a control plane machine set.
If you are uncertain about the state of the ControlPlaneMachineSet
CR in your cluster, you can verify the CR status.
1.1.2.1.1. Changing the Amazon Web Services instance type by using a control plane machine set
You can change the Amazon Web Services (AWS) instance type that your control plane machines use by updating the specification in the control plane machine set custom resource (CR).
Prerequisites
- Your AWS cluster uses a control plane machine set.
Procedure
Edit your control plane machine set CR by running the following command:
$ oc --namespace openshift-machine-api edit controlplanemachineset.machine.openshift.io cluster
Edit the following line under the
providerSpec
field:providerSpec: value: ... instanceType: <compatible_aws_instance_type> 1
- 1
- Specify a larger AWS instance type with the same base as the previous selection. For example, you can change
m6i.xlarge
tom6i.2xlarge
orm6i.4xlarge
.
Save your changes.
-
For clusters that use the default
RollingUpdate
update strategy, the Operator automatically propagates the changes to your control plane configuration. -
For clusters that are configured to use the
OnDelete
update strategy, you must replace your control plane machines manually.
-
For clusters that use the default
Additional resources
1.1.2.1.2. Changing the Amazon Web Services instance type by using the AWS console
You can change the Amazon Web Services (AWS) instance type that your control plane machines use by updating the instance type in the AWS console.
Prerequisites
- You have access to the AWS console with the permissions required to modify the EC2 Instance for your cluster.
-
You have access to the OpenShift Container Platform cluster as a user with the
cluster-admin
role.
Procedure
- Open the AWS console and fetch the instances for the control plane machines.
Choose one control plane machine instance.
- For the selected control plane machine, back up the etcd data by creating an etcd snapshot. For more information, see "Backing up etcd".
- In the AWS console, stop the control plane machine instance.
- Select the stopped instance, and click Actions → Instance Settings → Change instance type.
-
Change the instance to a larger type, ensuring that the type is the same base as the previous selection, and apply changes. For example, you can change
m6i.xlarge
tom6i.2xlarge
orm6i.4xlarge
. - Start the instance.
-
If your OpenShift Container Platform cluster has a corresponding
Machine
object for the instance, update the instance type of the object to match the instance type set in the AWS console.
- Repeat this process for each control plane machine.
Additional resources
1.2. Recommended infrastructure practices
This topic provides recommended performance and scalability practices for infrastructure in OpenShift Container Platform.
1.2.1. Infrastructure node sizing
Infrastructure nodes are nodes that are labeled to run pieces of the OpenShift Container Platform environment. The infrastructure node resource requirements depend on the cluster age, nodes, and objects in the cluster, as these factors can lead to an increase in the number of metrics or time series in Prometheus. The following infrastructure node size recommendations are based on the results observed in cluster-density testing detailed in the Control plane node sizing section, where the monitoring stack and the default ingress-controller were moved to these nodes.
Number of worker nodes | Cluster density, or number of namespaces | CPU cores | Memory (GB) |
---|---|---|---|
27 | 500 | 4 | 24 |
120 | 1000 | 8 | 48 |
252 | 4000 | 16 | 128 |
501 | 4000 | 32 | 128 |
In general, three infrastructure nodes are recommended per cluster.
These sizing recommendations should be used as a guideline. Prometheus is a highly memory intensive application; the resource usage depends on various factors including the number of nodes, objects, the Prometheus metrics scraping interval, metrics or time series, and the age of the cluster. In addition, the router resource usage can also be affected by the number of routes and the amount/type of inbound requests.
These recommendations apply only to infrastructure nodes hosting Monitoring, Ingress and Registry infrastructure components installed during cluster creation.
In OpenShift Container Platform 4.12, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. This influences the stated sizing recommendations.
1.2.2. Scaling the Cluster Monitoring Operator
OpenShift Container Platform exposes metrics that the Cluster Monitoring Operator collects and stores in the Prometheus-based monitoring stack. As an administrator, you can view dashboards for system resources, containers, and components metrics in the OpenShift Container Platform web console by navigating to Observe → Dashboards.
1.2.3. Prometheus database storage requirements
Red Hat performed various tests for different scale sizes.
The Prometheus storage requirements below are not prescriptive. Higher resource consumption might be observed in your cluster depending on workload activity and resource use.
Table 1.1. Prometheus Database storage requirements based on number of nodes/pods in the cluster
Number of Nodes | Number of pods | Prometheus storage growth per day | Prometheus storage growth per 15 days | RAM Space (per scale size) | Network (per tsdb chunk) |
---|---|---|---|---|---|
50 | 1800 | 6.3 GB | 94 GB | 6 GB | 16 MB |
100 | 3600 | 13 GB | 195 GB | 10 GB | 26 MB |
150 | 5400 | 19 GB | 283 GB | 12 GB | 36 MB |
200 | 7200 | 25 GB | 375 GB | 14 GB | 46 MB |
Approximately 20 percent of the expected size was added as overhead to ensure that the storage requirements do not exceed the calculated value.
The above calculation is for the default OpenShift Container Platform Cluster Monitoring Operator.
CPU utilization has minor impact. The ratio is approximately 1 core out of 40 per 50 nodes and 1800 pods.
Recommendations for OpenShift Container Platform
- Use at least three infrastructure (infra) nodes.
- Use at least three openshift-container-storage nodes with non-volatile memory express (NVMe) drives.
1.2.4. Configuring cluster monitoring
You can increase the storage capacity for the Prometheus component in the cluster monitoring stack.
Procedure
To increase the storage capacity for Prometheus:
Create a YAML configuration file,
cluster-monitoring-config.yaml
. For example:apiVersion: v1 kind: ConfigMap data: config.yaml: | prometheusK8s: retention: {{PROMETHEUS_RETENTION_PERIOD}} 1 nodeSelector: node-role.kubernetes.io/infra: "" volumeClaimTemplate: spec: storageClassName: {{STORAGE_CLASS}} 2 resources: requests: storage: {{PROMETHEUS_STORAGE_SIZE}} 3 alertmanagerMain: nodeSelector: node-role.kubernetes.io/infra: "" volumeClaimTemplate: spec: storageClassName: {{STORAGE_CLASS}} 4 resources: requests: storage: {{ALERTMANAGER_STORAGE_SIZE}} 5 metadata: name: cluster-monitoring-config namespace: openshift-monitoring
- 1
- A typical value is
PROMETHEUS_RETENTION_PERIOD=15d
. Units are measured in time using one of these suffixes: s, m, h, d. - 2 4
- The storage class for your cluster.
- 3
- A typical value is
PROMETHEUS_STORAGE_SIZE=2000Gi
. Storage values can be a plain integer or as a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki. - 5
- A typical value is
ALERTMANAGER_STORAGE_SIZE=20Gi
. Storage values can be a plain integer or as a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
- Add values for the retention period, storage class, and storage sizes.
- Save the file.
Apply the changes by running:
$ oc create -f cluster-monitoring-config.yaml
1.2.5. Additional resources
1.3. Recommended etcd practices
This topic provides recommended performance and scalability practices for etcd in OpenShift Container Platform.
1.3.1. Recommended etcd practices
Because etcd writes data to disk and persists proposals on disk, its performance depends on disk performance. Although etcd is not particularly I/O intensive, it requires a low latency block device for optimal performance and stability. Because etcd’s consensus protocol depends on persistently storing metadata to a log (WAL), etcd is sensitive to disk-write latency. Slow disks and disk activity from other processes can cause long fsync latencies.
Those latencies can cause etcd to miss heartbeats, not commit new proposals to the disk on time, and ultimately experience request timeouts and temporary leader loss. High write latencies also lead to an OpenShift API slowness, which affects cluster performance. Because of these reasons, avoid colocating other workloads on the control-plane nodes.
In terms of latency, run etcd on top of a block device that can write at least 50 IOPS of 8000 bytes long sequentially. That is, with a latency of 10ms, keep in mind that uses fdatasync to synchronize each write in the WAL. For heavy loaded clusters, sequential 500 IOPS of 8000 bytes (2 ms) are recommended. To measure those numbers, you can use a benchmarking tool, such as fio.
To achieve such performance, run etcd on machines that are backed by SSD or NVMe disks with low latency and high throughput. Consider single-level cell (SLC) solid-state drives (SSDs), which provide 1 bit per memory cell, are durable and reliable, and are ideal for write-intensive workloads.
The following hard disk features provide optimal etcd performance:
- Low latency to support fast read operation.
- High-bandwidth writes for faster compactions and defragmentation.
- High-bandwidth reads for faster recovery from failures.
- Solid state drives as a minimum selection, however NVMe drives are preferred.
- Server-grade hardware from various manufacturers for increased reliability.
- RAID 0 technology for increased performance.
- Dedicated etcd drives. Do not place log files or other heavy workloads on etcd drives.
Avoid NAS or SAN setups and spinning drives. Always benchmark by using utilities such as fio. Continuously monitor the cluster performance as it increases.
Avoid using the Network File System (NFS) protocol or other network based file systems.
Some key metrics to monitor on a deployed OpenShift Container Platform cluster are p99 of etcd disk write ahead log duration and the number of etcd leader changes. Use Prometheus to track these metrics.
The etcd member database sizes can vary in a cluster during normal operations. This difference does not affect cluster upgrades, even if the leader size is different from the other members.
To validate the hardware for etcd before or after you create the OpenShift Container Platform cluster, you can use fio.
Prerequisites
- Container runtimes such as Podman or Docker are installed on the machine that you’re testing.
-
Data is written to the
/var/lib/etcd
path.
Procedure
Run fio and analyze the results:
If you use Podman, run this command:
$ sudo podman run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf
If you use Docker, run this command:
$ sudo docker run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf
The output reports whether the disk is fast enough to host etcd by comparing the 99th percentile of the fsync metric captured from the run to see if it is less than 10 ms. A few of the most important etcd metrics that might affected by I/O performance are as follow:
-
etcd_disk_wal_fsync_duration_seconds_bucket
metric reports the etcd’s WAL fsync duration -
etcd_disk_backend_commit_duration_seconds_bucket
metric reports the etcd backend commit latency duration -
etcd_server_leader_changes_seen_total
metric reports the leader changes
Because etcd replicates the requests among all the members, its performance strongly depends on network input/output (I/O) latency. High network latencies result in etcd heartbeats taking longer than the election timeout, which results in leader elections that are disruptive to the cluster. A key metric to monitor on a deployed OpenShift Container Platform cluster is the 99th percentile of etcd network peer latency on each etcd cluster member. Use Prometheus to track the metric.
The histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[2m]))
metric reports the round trip time for etcd to finish replicating the client requests between the members. Ensure that it is less than 50 ms.
1.3.2. Moving etcd to a different disk
You can move etcd from a shared disk to a separate disk to prevent or resolve performance issues.
Prerequisites
-
The
MachineConfigPool
must matchmetadata.labels[machineconfiguration.openshift.io/role]
. This applies to a controller, worker, or a custom pool. -
The node’s auxiliary storage device, such as
/dev/sdb
, must match the sdb. Change this reference in all places in the file.
This procedure does not move parts of the root file system, such as /var/
, to another disk or partition on an installed node.
The Machine Config Operator (MCO) is responsible for mounting a secondary disk for an OpenShift Container Platform 4.12 container storage.
Use the following steps to move etcd to a different device:
Procedure
Create a
machineconfig
YAML file namedetcd-mc.yml
and add the following information:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 98-var-lib-etcd spec: config: ignition: version: 3.2.0 systemd: units: - contents: | [Unit] Description=Make File System on /dev/sdb DefaultDependencies=no BindsTo=dev-sdb.device After=dev-sdb.device var.mount Before=systemd-fsck@dev-sdb.service [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/lib/systemd/systemd-makefs xfs /dev/sdb TimeoutSec=0 [Install] WantedBy=var-lib-containers.mount enabled: true name: systemd-mkfs@dev-sdb.service - contents: | [Unit] Description=Mount /dev/sdb to /var/lib/etcd Before=local-fs.target Requires=systemd-mkfs@dev-sdb.service After=systemd-mkfs@dev-sdb.service var.mount [Mount] What=/dev/sdb Where=/var/lib/etcd Type=xfs Options=defaults,prjquota [Install] WantedBy=local-fs.target enabled: true name: var-lib-etcd.mount - contents: | [Unit] Description=Sync etcd data if new mount is empty DefaultDependencies=no After=var-lib-etcd.mount var.mount Before=crio.service [Service] Type=oneshot RemainAfterExit=yes ExecCondition=/usr/bin/test ! -d /var/lib/etcd/member ExecStart=/usr/sbin/setenforce 0 ExecStart=/bin/rsync -ar /sysroot/ostree/deploy/rhcos/var/lib/etcd/ /var/lib/etcd/ ExecStart=/usr/sbin/setenforce 1 TimeoutSec=0 [Install] WantedBy=multi-user.target graphical.target enabled: true name: sync-var-lib-etcd-to-etcd.service - contents: | [Unit] Description=Restore recursive SELinux security contexts DefaultDependencies=no After=var-lib-etcd.mount Before=crio.service [Service] Type=oneshot RemainAfterExit=yes ExecStart=/sbin/restorecon -R /var/lib/etcd/ TimeoutSec=0 [Install] WantedBy=multi-user.target graphical.target enabled: true name: restorecon-var-lib-etcd.service
Create the machine configuration by entering the following commands:
$ oc login -u ${ADMIN} -p ${ADMINPASSWORD} ${API} ... output omitted ...
$ oc create -f etcd-mc.yml machineconfig.machineconfiguration.openshift.io/98-var-lib-etcd created
$ oc login -u ${ADMIN} -p ${ADMINPASSWORD} ${API} [... output omitted ...]
$ oc create -f etcd-mc.yml machineconfig.machineconfiguration.openshift.io/98-var-lib-etcd created
The nodes are updated and rebooted. After the reboot completes, the following events occur:
- An XFS file system is created on the specified disk.
-
The disk mounts to
/var/lib/etc
. -
The content from
/sysroot/ostree/deploy/rhcos/var/lib/etcd
syncs to/var/lib/etcd
. -
A restore of
SELinux
labels is forced for/var/lib/etcd
. - The old content is not removed.
After the nodes are on a separate disk, update the machine configuration file,
etcd-mc.yml
with the following information:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 98-var-lib-etcd spec: config: ignition: version: 3.2.0 systemd: units: - contents: | [Unit] Description=Mount /dev/sdb to /var/lib/etcd Before=local-fs.target Requires=systemd-mkfs@dev-sdb.service After=systemd-mkfs@dev-sdb.service var.mount [Mount] What=/dev/sdb Where=/var/lib/etcd Type=xfs Options=defaults,prjquota [Install] WantedBy=local-fs.target enabled: true name: var-lib-etcd.mount
Apply the modified version that removes the logic for creating and syncing the device by entering the following command:
$ oc replace -f etcd-mc.yml
The previous step prevents the nodes from rebooting.
Additional resources
1.3.3. Defragmenting etcd data
For large and dense clusters, etcd can suffer from poor performance if the keyspace grows too large and exceeds the space quota. Periodically maintain and defragment etcd to free up space in the data store. Monitor Prometheus for etcd metrics and defragment it when required; otherwise, etcd can raise a cluster-wide alarm that puts the cluster into a maintenance mode that accepts only key reads and deletes.
Monitor these key metrics:
-
etcd_server_quota_backend_bytes
, which is the current quota limit -
etcd_mvcc_db_total_size_in_use_in_bytes
, which indicates the actual database usage after a history compaction -
etcd_mvcc_db_total_size_in_bytes
, which shows the database size, including free space waiting for defragmentation
Defragment etcd data to reclaim disk space after events that cause disk fragmentation, such as etcd history compaction.
History compaction is performed automatically every five minutes and leaves gaps in the back-end database. This fragmented space is available for use by etcd, but is not available to the host file system. You must defragment etcd to make this space available to the host file system.
Defragmentation occurs automatically, but you can also trigger it manually.
Automatic defragmentation is good for most cases, because the etcd operator uses cluster information to determine the most efficient operation for the user.
1.3.3.1. Automatic defragmentation
The etcd Operator automatically defragments disks. No manual intervention is needed.
Verify that the defragmentation process is successful by viewing one of these logs:
- etcd logs
- cluster-etcd-operator pod
- operator status error log
Automatic defragmentation can cause leader election failure in various OpenShift core components, such as the Kubernetes controller manager, which triggers a restart of the failing component. The restart is harmless and either triggers failover to the next running instance or the component resumes work again after the restart.
Example log output for successful defragmentation
etcd member has been defragmented: <member_name>, memberID: <member_id>
Example log output for unsuccessful defragmentation
failed defrag on member: <member_name>, memberID: <member_id>: <error_message>
1.3.3.2. Manual defragmentation
A Prometheus alert indicates when you need to use manual defragmentation. The alert is displayed in two cases:
- When etcd uses more than 50% of its available space for more than 10 minutes
- When etcd is actively using less than 50% of its total database size for more than 10 minutes
You can also determine whether defragmentation is needed by checking the etcd database size in MB that will be freed by defragmentation with the PromQL expression: (etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024
Defragmenting etcd is a blocking action. The etcd member will not respond until defragmentation is complete. For this reason, wait at least one minute between defragmentation actions on each of the pods to allow the cluster to recover.
Follow this procedure to defragment etcd data on each etcd member.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role.
Procedure
Determine which etcd member is the leader, because the leader should be defragmented last.
Get the list of etcd pods:
$ oc -n openshift-etcd get pods -l k8s-app=etcd -o wide
Example output
etcd-ip-10-0-159-225.example.redhat.com 3/3 Running 0 175m 10.0.159.225 ip-10-0-159-225.example.redhat.com <none> <none> etcd-ip-10-0-191-37.example.redhat.com 3/3 Running 0 173m 10.0.191.37 ip-10-0-191-37.example.redhat.com <none> <none> etcd-ip-10-0-199-170.example.redhat.com 3/3 Running 0 176m 10.0.199.170 ip-10-0-199-170.example.redhat.com <none> <none>
Choose a pod and run the following command to determine which etcd member is the leader:
$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com etcdctl endpoint status --cluster -w table
Example output
Defaulting container name to etcdctl. Use 'oc describe pod/etcd-ip-10-0-159-225.example.redhat.com -n openshift-etcd' to see all of the containers in this pod. +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.191.37:2379 | 251cd44483d811c3 | 3.4.9 | 104 MB | false | false | 7 | 91624 | 91624 | | | https://10.0.159.225:2379 | 264c7c58ecbdabee | 3.4.9 | 104 MB | false | false | 7 | 91624 | 91624 | | | https://10.0.199.170:2379 | 9ac311f93915cc79 | 3.4.9 | 104 MB | true | false | 7 | 91624 | 91624 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Based on the
IS LEADER
column of this output, thehttps://10.0.199.170:2379
endpoint is the leader. Matching this endpoint with the output of the previous step, the pod name of the leader isetcd-ip-10-0-199-170.example.redhat.com
.
Defragment an etcd member.
Connect to the running etcd container, passing in the name of a pod that is not the leader:
$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com
Unset the
ETCDCTL_ENDPOINTS
environment variable:sh-4.4# unset ETCDCTL_ENDPOINTS
Defragment the etcd member:
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag
Example output
Finished defragmenting etcd member[https://localhost:2379]
If a timeout error occurs, increase the value for
--command-timeout
until the command succeeds.Verify that the database size was reduced:
sh-4.4# etcdctl endpoint status -w table --cluster
Example output
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.191.37:2379 | 251cd44483d811c3 | 3.4.9 | 104 MB | false | false | 7 | 91624 | 91624 | | | https://10.0.159.225:2379 | 264c7c58ecbdabee | 3.4.9 | 41 MB | false | false | 7 | 91624 | 91624 | | 1 | https://10.0.199.170:2379 | 9ac311f93915cc79 | 3.4.9 | 104 MB | true | false | 7 | 91624 | 91624 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
This example shows that the database size for this etcd member is now 41 MB as opposed to the starting size of 104 MB.
Repeat these steps to connect to each of the other etcd members and defragment them. Always defragment the leader last.
Wait at least one minute between defragmentation actions to allow the etcd pod to recover. Until the etcd pod recovers, the etcd member will not respond.
If any
NOSPACE
alarms were triggered due to the space quota being exceeded, clear them.Check if there are any
NOSPACE
alarms:sh-4.4# etcdctl alarm list
Example output
memberID:12345678912345678912 alarm:NOSPACE
Clear the alarms:
sh-4.4# etcdctl alarm disarm
Next steps
After defragmentation, if etcd still uses more than 50% of its available space, consider increasing the disk quota for etcd.
Chapter 2. Planning your environment according to object maximums
Consider the following tested object maximums when you plan your OpenShift Container Platform cluster.
These guidelines are based on the largest possible cluster. For smaller clusters, the maximums are lower. There are many factors that influence the stated thresholds, including the etcd version or storage data format.
In most cases, exceeding these numbers results in lower overall performance. It does not necessarily mean that the cluster will fail.
Clusters that experience rapid change, such as those with many starting and stopping pods, can have a lower practical maximum size than documented.
2.1. OpenShift Container Platform tested cluster maximums for major releases
Red Hat does not provide direct guidance on sizing your OpenShift Container Platform cluster. This is because determining whether your cluster is within the supported bounds of OpenShift Container Platform requires careful consideration of all the multidimensional factors that limit the cluster scale.
OpenShift Container Platform supports tested cluster maximums rather than absolute cluster maximums. Not every combination of OpenShift Container Platform version, control plane workload, and network plugin are tested, so the following table does not represent an absolute expectation of scale for all deployments. It might not be possible to scale to a maximum on all dimensions simultaneously. The table contains tested maximums for specific workload and deployment configurations, and serves as a scale guide as to what can be expected with similar deployments.
Maximum type | 4.x tested maximum |
---|---|
Number of nodes | 2,000 [1] |
Number of pods [2] | 150,000 |
Number of pods per node | 500 [3] |
Number of pods per core | There is no default value. |
Number of namespaces [4] | 10,000 |
Number of builds | 10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy |
Number of pods per namespace [5] | 25,000 |
Number of routes and back ends per Ingress Controller | 2,000 per router |
Number of secrets | 80,000 |
Number of config maps | 90,000 |
Number of services [6] | 10,000 |
Number of services per namespace | 5,000 |
Number of back-ends per service | 5,000 |
Number of deployments per namespace [5] | 2,000 |
Number of build configs | 12,000 |
Number of custom resource definitions (CRD) | 512 [7] |
- Pause pods were deployed to stress the control plane components of OpenShift Container Platform at 2000 node scale. The ability to scale to similar numbers will vary depending upon specific deployment and workload parameters.
- The pod count displayed here is the number of test pods. The actual number of pods depends on the application’s memory, CPU, and storage requirements.
-
This was tested on a cluster with 100 worker nodes with 500 pods per worker node. The default
maxPods
is still 250. To get to 500maxPods
, the cluster must be created with amaxPods
set to500
using a custom kubelet config. If you need 500 user pods, you need ahostPrefix
of22
because there are 10-15 system pods already running on the node. The maximum number of pods with attached persistent volume claims (PVC) depends on storage backend from where PVC are allocated. In our tests, only OpenShift Data Foundation v4 (OCS v4) was able to satisfy the number of pods per node discussed in this document. - When there are a large number of active projects, etcd might suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentation, is highly recommended to free etcd storage.
- There are a number of control loops in the system that must iterate over all objects in a given namespace as a reaction to some changes in state. Having a large number of objects of a given type in a single namespace can make those loops expensive and slow down processing given state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy the application requirements.
- Each service port and each service back-end has a corresponding entry in iptables. The number of back-ends of a given service impact the size of the endpoints objects, which impacts the size of data that is being sent all over the system.
-
OpenShift Container Platform has a limit of 512 total custom resource definitions (CRD), including those installed by OpenShift Container Platform, products integrating with OpenShift Container Platform and user created CRDs. If there are more than 512 CRDs created, then there is a possibility that
oc
commands requests may be throttled.
2.1.1. Example scenario
As an example, 500 worker nodes (m5.2xl) were tested, and are supported, using OpenShift Container Platform 4.12, the OVN-Kubernetes network plugin, and the following workload objects:
- 200 namespaces, in addition to the defaults
- 60 pods per node; 30 server and 30 client pods (30k total)
- 57 image streams/ns (11.4k total)
- 15 services/ns backed by the server pods (3k total)
- 15 routes/ns backed by the previous services (3k total)
- 20 secrets/ns (4k total)
- 10 config maps/ns (2k total)
- 6 network policies/ns, including deny-all, allow-from ingress and intra-namespace rules
- 57 builds/ns
The following factors are known to affect cluster workload scaling, positively or negatively, and should be factored into the scale numbers when planning a deployment. For additional information and guidance, contact your sales representative or Red Hat support.
- Number of pods per node
- Number of containers per pod
- Type of probes used (for example, liveness/readiness, exec/http)
- Number of network policies
- Number of projects, or namespaces
- Number of image streams per project
- Number of builds per project
- Number of services/endpoints and type
- Number of routes
- Number of shards
- Number of secrets
- Number of config maps
Rate of API calls, or the cluster “churn”, which is an estimation of how quickly things change in the cluster configuration.
-
Prometheus query for pod creation requests per second over 5 minute windows:
sum(irate(apiserver_request_count{resource="pods",verb="POST"}[5m]))
-
Prometheus query for all API requests per second over 5 minute windows:
sum(irate(apiserver_request_count{}[5m]))
-
Prometheus query for pod creation requests per second over 5 minute windows:
- Cluster node resource consumption of CPU
- Cluster node resource consumption of memory
2.2. OpenShift Container Platform environment and configuration on which the cluster maximums are tested
2.2.1. AWS cloud platform
Node | Flavor | vCPU | RAM(GiB) | Disk type | Disk size(GiB)/IOS | Count | Region |
---|---|---|---|---|---|---|---|
Control plane/etcd [1] | r5.4xlarge | 16 | 128 | gp3 | 220 | 3 | us-west-2 |
Infra [2] | m5.12xlarge | 48 | 192 | gp3 | 100 | 3 | us-west-2 |
Workload [3] | m5.4xlarge | 16 | 64 | gp3 | 500 [4] | 1 | us-west-2 |
Compute | m5.2xlarge | 8 | 32 | gp3 | 100 | 3/25/250/500 [5] | us-west-2 |
- gp3 disks with a baseline performance of 3000 IOPS and 125 MiB per second are used for control plane/etcd nodes because etcd is latency sensitive. gp3 volumes do not use burst performance.
- Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.
- Workload node is dedicated to run performance and scalability workload generators.
- Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.
- Cluster is scaled in iterations and performance and scalability tests are executed at the specified node counts.
2.2.2. IBM Power platform
Node | vCPU | RAM(GiB) | Disk type | Disk size(GiB)/IOS | Count |
---|---|---|---|---|---|
Control plane/etcd [1] | 16 | 32 | io1 | 120 / 10 IOPS per GiB | 3 |
Infra [2] | 16 | 64 | gp2 | 120 | 2 |
Workload [3] | 16 | 256 | gp2 | 120 [4] | 1 |
Compute | 16 | 64 | gp2 | 120 | 2 to 100 [5] |
- io1 disks with 120 / 10 IOPS per GiB are used for control plane/etcd nodes as etcd is I/O intensive and latency sensitive.
- Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.
- Workload node is dedicated to run performance and scalability workload generators.
- Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.
- Cluster is scaled in iterations.
2.2.3. IBM zSystems platform
Node | vCPU [4] | RAM(GiB)[5] | Disk type | Disk size(GiB)/IOS | Count |
---|---|---|---|---|---|
Control plane/etcd [1,2] | 8 | 32 | ds8k | 300 / LCU 1 | 3 |
Compute [1,3] | 8 | 32 | ds8k | 150 / LCU 2 | 4 nodes (scaled to 100/250/500 pods per node) |
- Nodes are distributed between two logical control units (LCUs) to optimize disk I/O load of the control plane/etcd nodes as etcd is I/O intensive and latency sensitive. Etcd I/O demand should not interfere with other workloads.
- Four compute nodes are used for the tests running several iterations with 100/250/500 pods at the same time. First, idling pods were used to evaluate if pods can be instanced. Next, a network and CPU demanding client/server workload were used to evaluate the stability of the system under stress. Client and server pods were pairwise deployed and each pair was spread over two compute nodes.
- No separate workload node was used. The workload simulates a microservice workload between two compute nodes.
- Physical number of processors used is six Integrated Facilities for Linux (IFLs).
- Total physical memory used is 512 GiB.
2.3. How to plan your environment according to tested cluster maximums
Oversubscribing the physical resources on a node affects resource guarantees the Kubernetes scheduler makes during pod placement. Learn what measures you can take to avoid memory swapping.
Some of the tested maximums are stretched only in a single dimension. They will vary when many objects are running on the cluster.
The numbers noted in this documentation are based on Red Hat’s test methodology, setup, configuration, and tunings. These numbers can vary based on your own individual setup and environments.
While planning your environment, determine how many pods are expected to fit per node:
required pods per cluster / pods per node = total number of nodes needed
The default maximum number of pods per node is 250. However, the number of pods that fit on a node is dependent on the application itself. Consider the application’s memory, CPU, and storage requirements, as described in "How to plan your environment according to application requirements".
Example scenario
If you want to scope your cluster for 2200 pods per cluster, you would need at least five nodes, assuming that there are 500 maximum pods per node:
2200 / 500 = 4.4
If you increase the number of nodes to 20, then the pod distribution changes to 110 pods per node:
2200 / 20 = 110
Where:
required pods per cluster / total number of nodes = expected pods per node
OpenShift Container Platform comes with several system pods, such as SDN, DNS, Operators, and others, which run across every worker node by default. Therefore, the result of the above formula can vary.
2.4. How to plan your environment according to application requirements
Consider an example application environment:
Pod type | Pod quantity | Max memory | CPU cores | Persistent storage |
---|---|---|---|---|
apache | 100 | 500 MB | 0.5 | 1 GB |
node.js | 200 | 1 GB | 1 | 1 GB |
postgresql | 100 | 1 GB | 2 | 10 GB |
JBoss EAP | 100 | 1 GB | 1 | 1 GB |
Extrapolated requirements: 550 CPU cores, 450GB RAM, and 1.4TB storage.
Instance size for nodes can be modulated up or down, depending on your preference. Nodes are often resource overcommitted. In this deployment scenario, you can choose to run additional smaller nodes or fewer larger nodes to provide the same amount of resources. Factors such as operational agility and cost-per-instance should be considered.
Node type | Quantity | CPUs | RAM (GB) |
---|---|---|---|
Nodes (option 1) | 100 | 4 | 16 |
Nodes (option 2) | 50 | 8 | 32 |
Nodes (option 3) | 25 | 16 | 64 |
Some applications lend themselves well to overcommitted environments, and some do not. Most Java applications and applications that use huge pages are examples of applications that would not allow for overcommitment. That memory can not be used for other applications. In the example above, the environment would be roughly 30 percent overcommitted, a common ratio.
The application pods can access a service either by using environment variables or DNS. If using environment variables, for each active service the variables are injected by the kubelet when a pod is run on a node. A cluster-aware DNS server watches the Kubernetes API for new services and creates a set of DNS records for each one. If DNS is enabled throughout your cluster, then all pods should automatically be able to resolve services by their DNS name. Service discovery using DNS can be used in case you must go beyond 5000 services. When using environment variables for service discovery, the argument list exceeds the allowed length after 5000 services in a namespace, then the pods and deployments will start failing. Disable the service links in the deployment’s service specification file to overcome this:
--- apiVersion: template.openshift.io/v1 kind: Template metadata: name: deployment-config-template creationTimestamp: annotations: description: This template will create a deploymentConfig with 1 replica, 4 env vars and a service. tags: '' objects: - apiVersion: apps.openshift.io/v1 kind: DeploymentConfig metadata: name: deploymentconfig${IDENTIFIER} spec: template: metadata: labels: name: replicationcontroller${IDENTIFIER} spec: enableServiceLinks: false containers: - name: pause${IDENTIFIER} image: "${IMAGE}" ports: - containerPort: 8080 protocol: TCP env: - name: ENVVAR1_${IDENTIFIER} value: "${ENV_VALUE}" - name: ENVVAR2_${IDENTIFIER} value: "${ENV_VALUE}" - name: ENVVAR3_${IDENTIFIER} value: "${ENV_VALUE}" - name: ENVVAR4_${IDENTIFIER} value: "${ENV_VALUE}" resources: {} imagePullPolicy: IfNotPresent capabilities: {} securityContext: capabilities: {} privileged: false restartPolicy: Always serviceAccount: '' replicas: 1 selector: name: replicationcontroller${IDENTIFIER} triggers: - type: ConfigChange strategy: type: Rolling - apiVersion: v1 kind: Service metadata: name: service${IDENTIFIER} spec: selector: name: replicationcontroller${IDENTIFIER} ports: - name: serviceport${IDENTIFIER} protocol: TCP port: 80 targetPort: 8080 clusterIP: '' type: ClusterIP sessionAffinity: None status: loadBalancer: {} parameters: - name: IDENTIFIER description: Number to append to the name of resources value: '1' required: true - name: IMAGE description: Image to use for deploymentConfig value: gcr.io/google-containers/pause-amd64:3.0 required: false - name: ENV_VALUE description: Value to use for environment variables generate: expression from: "[A-Za-z0-9]{255}" required: false labels: template: deployment-config-template
The number of application pods that can run in a namespace is dependent on the number of services and the length of the service name when the environment variables are used for service discovery. ARG_MAX
on the system defines the maximum argument length for a new process and it is set to 2097152 bytes (2 MiB) by default. The Kubelet injects environment variables in to each pod scheduled to run in the namespace including:
-
<SERVICE_NAME>_SERVICE_HOST=<IP>
-
<SERVICE_NAME>_SERVICE_PORT=<PORT>
-
<SERVICE_NAME>_PORT=tcp://<IP>:<PORT>
-
<SERVICE_NAME>_PORT_<PORT>_TCP=tcp://<IP>:<PORT>
-
<SERVICE_NAME>_PORT_<PORT>_TCP_PROTO=tcp
-
<SERVICE_NAME>_PORT_<PORT>_TCP_PORT=<PORT>
-
<SERVICE_NAME>_PORT_<PORT>_TCP_ADDR=<ADDR>
The pods in the namespace will start to fail if the argument length exceeds the allowed value and the number of characters in a service name impacts it. For example, in a namespace with 5000 services, the limit on the service name is 33 characters, which enables you to run 5000 pods in the namespace.
Chapter 3. Recommended host practices for IBM zSystems & IBM(R) LinuxONE environments
This topic provides recommended host practices for OpenShift Container Platform on IBM zSystems and IBM® LinuxONE.
The s390x architecture is unique in many aspects. Therefore, some recommendations made here might not apply to other platforms.
Unless stated otherwise, these practices apply to both z/VM and Red Hat Enterprise Linux (RHEL) KVM installations on IBM zSystems and IBM® LinuxONE.
3.1. Managing CPU overcommitment
In a highly virtualized IBM zSystems environment, you must carefully plan the infrastructure setup and sizing. One of the most important features of virtualization is the capability to do resource overcommitment, allocating more resources to the virtual machines than actually available at the hypervisor level. This is very workload dependent and there is no golden rule that can be applied to all setups.
Depending on your setup, consider these best practices regarding CPU overcommitment:
- At LPAR level (PR/SM hypervisor), avoid assigning all available physical cores (IFLs) to each LPAR. For example, with four physical IFLs available, you should not define three LPARs with four logical IFLs each.
- Check and understand LPAR shares and weights.
- An excessive number of virtual CPUs can adversely affect performance. Do not define more virtual processors to a guest than logical processors are defined to the LPAR.
- Configure the number of virtual processors per guest for peak workload, not more.
- Start small and monitor the workload. Increase the vCPU number incrementally if necessary.
- Not all workloads are suitable for high overcommitment ratios. If the workload is CPU intensive, you will probably not be able to achieve high ratios without performance problems. Workloads that are more I/O intensive can keep consistent performance even with high overcommitment ratios.
3.2. Disable Transparent Huge Pages
Transparent Huge Pages (THP) attempt to automate most aspects of creating, managing, and using huge pages. Since THP automatically manages the huge pages, this is not always handled optimally for all types of workloads. THP can lead to performance regressions, since many applications handle huge pages on their own. Therefore, consider disabling THP.
3.3. Boost networking performance with Receive Flow Steering
Receive Flow Steering (RFS) extends Receive Packet Steering (RPS) by further reducing network latency. RFS is technically based on RPS, and improves the efficiency of packet processing by increasing the CPU cache hit rate. RFS achieves this, and in addition considers queue length, by determining the most convenient CPU for computation so that cache hits are more likely to occur within the CPU. Thus, the CPU cache is invalidated less and requires fewer cycles to rebuild the cache. This can help reduce packet processing run time.
3.3.1. Use the Machine Config Operator (MCO) to activate RFS
Procedure
Copy the following MCO sample profile into a YAML file. For example,
enable-rfs.yaml
:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 50-enable-rfs spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:text/plain;charset=US-ASCII,%23%20turn%20on%20Receive%20Flow%20Steering%20%28RFS%29%20for%20all%20network%20interfaces%0ASUBSYSTEM%3D%3D%22net%22%2C%20ACTION%3D%3D%22add%22%2C%20RUN%7Bprogram%7D%2B%3D%22/bin/bash%20-c%20%27for%20x%20in%20/sys/%24DEVPATH/queues/rx-%2A%3B%20do%20echo%208192%20%3E%20%24x/rps_flow_cnt%3B%20%20done%27%22%0A filesystem: root mode: 0644 path: /etc/udev/rules.d/70-persistent-net.rules - contents: source: data:text/plain;charset=US-ASCII,%23%20define%20sock%20flow%20enbtried%20for%20%20Receive%20Flow%20Steering%20%28RFS%29%0Anet.core.rps_sock_flow_entries%3D8192%0A filesystem: root mode: 0644 path: /etc/sysctl.d/95-enable-rps.conf
Create the MCO profile:
$ oc create -f enable-rfs.yaml
Verify that an entry named
50-enable-rfs
is listed:$ oc get mc
To deactivate, enter:
$ oc delete mc 50-enable-rfs
3.4. Choose your networking setup
The networking stack is one of the most important components for a Kubernetes-based product like OpenShift Container Platform. For IBM zSystems setups, the networking setup depends on the hypervisor of your choice. Depending on the workload and the application, the best fit usually changes with the use case and the traffic pattern.
Depending on your setup, consider these best practices:
- Consider all options regarding networking devices to optimize your traffic pattern. Explore the advantages of OSA-Express, RoCE Express, HiperSockets, z/VM VSwitch, Linux Bridge (KVM), and others to decide which option leads to the greatest benefit for your setup.
- Always use the latest available NIC version. For example, OSA Express 7S 10 GbE shows great improvement compared to OSA Express 6S 10 GbE with transactional workload types, although both are 10 GbE adapters.
- Each virtual switch adds an additional layer of latency.
- The load balancer plays an important role for network communication outside the cluster. Consider using a production-grade hardware load balancer if this is critical for your application.
- OpenShift Container Platform SDN introduces flows and rules, which impact the networking performance. Make sure to consider pod affinities and placements, to benefit from the locality of services where communication is critical.
- Balance the trade-off between performance and functionality.
3.5. Ensure high disk performance with HyperPAV on z/VM
DASD and ECKD devices are commonly used disk types in IBM zSystems environments. In a typical OpenShift Container Platform setup in z/VM environments, DASD disks are commonly used to support the local storage for the nodes. You can set up HyperPAV alias devices to provide more throughput and overall better I/O performance for the DASD disks that support the z/VM guests.
Using HyperPAV for the local storage devices leads to a significant performance benefit. However, you must be aware that there is a trade-off between throughput and CPU costs.
3.5.1. Use the Machine Config Operator (MCO) to activate HyperPAV aliases in nodes using z/VM full-pack minidisks
For z/VM-based OpenShift Container Platform setups that use full-pack minidisks, you can leverage the advantage of MCO profiles by activating HyperPAV aliases in all of the nodes. You must add YAML configurations for both control plane and compute nodes.
Procedure
Copy the following MCO sample profile into a YAML file for the control plane node. For example,
05-master-kernelarg-hpav.yaml
:$ cat 05-master-kernelarg-hpav.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 05-master-kernelarg-hpav spec: config: ignition: version: 3.1.0 kernelArguments: - rd.dasd=800-805
Copy the following MCO sample profile into a YAML file for the compute node. For example,
05-worker-kernelarg-hpav.yaml
:$ cat 05-worker-kernelarg-hpav.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 05-worker-kernelarg-hpav spec: config: ignition: version: 3.1.0 kernelArguments: - rd.dasd=800-805
NoteYou must modify the
rd.dasd
arguments to fit the device IDs.Create the MCO profiles:
$ oc create -f 05-master-kernelarg-hpav.yaml
$ oc create -f 05-worker-kernelarg-hpav.yaml
To deactivate, enter:
$ oc delete -f 05-master-kernelarg-hpav.yaml
$ oc delete -f 05-worker-kernelarg-hpav.yaml
Additional resources
3.6. RHEL KVM on IBM zSystems host recommendations
Optimizing a KVM virtual server environment strongly depends on the workloads of the virtual servers and on the available resources. The same action that enhances performance in one environment can have adverse effects in another. Finding the best balance for a particular setting can be a challenge and often involves experimentation.
The following section introduces some best practices when using OpenShift Container Platform with RHEL KVM on IBM zSystems and IBM® LinuxONE environments.
3.6.1. Use multiple queues for your VirtIO network interfaces
With multiple virtual CPUs, you can transfer packages in parallel if you provide multiple queues for incoming and outgoing packets. Use the queues
attribute of the driver
element to configure multiple queues. Specify an integer of at least 2 that does not exceed the number of virtual CPUs of the virtual server.
The following example specification configures two input and output queues for a network interface:
<interface type="direct"> <source network="net01"/> <model type="virtio"/> <driver ... queues="2"/> </interface>
Multiple queues are designed to provide enhanced performance for a network interface, but they also use memory and CPU resources. Start with defining two queues for busy interfaces. Next, try two queues for interfaces with less traffic or more than two queues for busy interfaces.
3.6.2. Use I/O threads for your virtual block devices
To make virtual block devices use I/O threads, you must configure one or more I/O threads for the virtual server and each virtual block device to use one of these I/O threads.
The following example specifies <iothreads>3</iothreads>
to configure three I/O threads, with consecutive decimal thread IDs 1, 2, and 3. The iothread="2"
parameter specifies the driver element of the disk device to use the I/O thread with ID 2.
Sample I/O thread specification
... <domain> <iothreads>3</iothreads>1 ... <devices> ... <disk type="block" device="disk">2 <driver ... iothread="2"/> </disk> ... </devices> ... </domain>
Threads can increase the performance of I/O operations for disk devices, but they also use memory and CPU resources. You can configure multiple devices to use the same thread. The best mapping of threads to devices depends on the available resources and the workload.
Start with a small number of I/O threads. Often, a single I/O thread for all disk devices is sufficient. Do not configure more threads than the number of virtual CPUs, and do not configure idle threads.
You can use the virsh iothreadadd
command to add I/O threads with specific thread IDs to a running virtual server.
3.6.3. Avoid virtual SCSI devices
Configure virtual SCSI devices only if you need to address the device through SCSI-specific interfaces. Configure disk space as virtual block devices rather than virtual SCSI devices, regardless of the backing on the host.
However, you might need SCSI-specific interfaces for:
- A LUN for a SCSI-attached tape drive on the host.
- A DVD ISO file on the host file system that is mounted on a virtual DVD drive.
3.6.4. Configure guest caching for disk
Configure your disk devices to do caching by the guest and not by the host.
Ensure that the driver element of the disk device includes the cache="none"
and io="native"
parameters.
<disk type="block" device="disk"> <driver name="qemu" type="raw" cache="none" io="native" iothread="1"/> ... </disk>
3.6.5. Exclude the memory balloon device
Unless you need a dynamic memory size, do not define a memory balloon device and ensure that libvirt does not create one for you. Include the memballoon
parameter as a child of the devices element in your domain configuration XML file.
Check the list of active profiles:
<memballoon model="none"/>
3.6.6. Tune the CPU migration algorithm of the host scheduler
Do not change the scheduler settings unless you are an expert who understands the implications. Do not apply changes to production systems without testing them and confirming that they have the intended effect.
The kernel.sched_migration_cost_ns
parameter specifies a time interval in nanoseconds. After the last execution of a task, the CPU cache is considered to have useful content until this interval expires. Increasing this interval results in fewer task migrations. The default value is 500000 ns.
If the CPU idle time is higher than expected when there are runnable processes, try reducing this interval. If tasks bounce between CPUs or nodes too often, try increasing it.
To dynamically set the interval to 60000 ns, enter the following command:
# sysctl kernel.sched_migration_cost_ns=60000
To persistently change the value to 60000 ns, add the following entry to /etc/sysctl.conf
:
kernel.sched_migration_cost_ns=60000
3.6.7. Disable the cpuset cgroup controller
This setting applies only to KVM hosts with cgroups version 1. To enable CPU hotplug on the host, disable the cgroup controller.
Procedure
-
Open
/etc/libvirt/qemu.conf
with an editor of your choice. -
Go to the
cgroup_controllers
line. - Duplicate the entire line and remove the leading number sign (#) from the copy.
Remove the
cpuset
entry, as follows:cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]
For the new setting to take effect, you must restart the libvirtd daemon:
- Stop all virtual machines.
Run the following command:
# systemctl restart libvirtd
- Restart the virtual machines.
This setting persists across host reboots.
3.6.8. Tune the polling period for idle virtual CPUs
When a virtual CPU becomes idle, KVM polls for wakeup conditions for the virtual CPU before allocating the host resource. You can specify the time interval, during which polling takes place in sysfs at /sys/module/kvm/parameters/halt_poll_ns
. During the specified time, polling reduces the wakeup latency for the virtual CPU at the expense of resource usage. Depending on the workload, a longer or shorter time for polling can be beneficial. The time interval is specified in nanoseconds. The default is 50000 ns.
To optimize for low CPU consumption, enter a small value or write 0 to disable polling:
# echo 0 > /sys/module/kvm/parameters/halt_poll_ns
To optimize for low latency, for example for transactional workloads, enter a large value:
# echo 80000 > /sys/module/kvm/parameters/halt_poll_ns
Additional resources
Chapter 4. Using the Node Tuning Operator
Learn about the Node Tuning Operator and how you can use it to manage node-level tuning by orchestrating the tuned daemon.
4.1. About the Node Tuning Operator
The Node Tuning Operator helps you manage node-level tuning by orchestrating the TuneD daemon and achieves low latency performance by using the Performance Profile controller. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs.
The Operator manages the containerized TuneD daemon for OpenShift Container Platform as a Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized TuneD daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.
Node-level settings applied by the containerized TuneD daemon are rolled back on an event that triggers a profile change or when the containerized TuneD daemon is terminated gracefully by receiving and handling a termination signal.
The Node Tuning Operator uses the Performance Profile controller to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications. The cluster administrator configures a performance profile to define node-level settings such as the following:
- Updating the kernel to kernel-rt.
- Choosing CPUs for housekeeping.
- Choosing CPUs for running workloads.
Currently, disabling CPU load balancing is not supported by cgroup v2. As a result, you might not get the desired behavior from performance profiles if you have cgroup v2 enabled. Enabling cgroup v2 is not recommended if you are using performace profiles.
The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1 and later.
In earlier versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.
4.2. Accessing an example Node Tuning Operator specification
Use this process to access an example Node Tuning Operator specification.
Procedure
Run the following command to access an example Node Tuning Operator specification:
$ oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator
The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift Container Platform nodes based on node or pod labels and profile priorities.
While in certain situations the support for pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without pod label matching. If a custom profile is created with pod label matching, then the functionality will be enabled at that time. The pod label functionality will be deprecated in future versions of the Node Tuning Operator.
4.3. Default profiles set on a cluster
The following are the default profiles set on a cluster.
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: default namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Optimize systems running OpenShift (provider specific parent profile) include=-provider-${f:exec:cat:/var/lib/tuned/provider},openshift name: openshift recommend: - profile: openshift-control-plane priority: 30 match: - label: node-role.kubernetes.io/master - label: node-role.kubernetes.io/infra - profile: openshift-node priority: 40
Starting with OpenShift Container Platform 4.9, all OpenShift TuneD profiles are shipped with the TuneD package. You can use the oc exec
command to view the contents of these profiles:
$ oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/openshift{,-control-plane,-node} -name tuned.conf -exec grep -H ^ {} \;
4.4. Verifying that the TuneD profiles are applied
Verify the TuneD profiles that are applied to your cluster node.
$ oc get profile -n openshift-cluster-node-tuning-operator
Example output
NAME TUNED APPLIED DEGRADED AGE master-0 openshift-control-plane True False 6h33m master-1 openshift-control-plane True False 6h33m master-2 openshift-control-plane True False 6h33m worker-a openshift-node True False 6h28m worker-b openshift-node True False 6h28m
-
NAME
: Name of the Profile object. There is one Profile object per node and their names match. -
TUNED
: Name of the desired TuneD profile to apply. -
APPLIED
:True
if the TuneD daemon applied the desired profile. (True/False/Unknown
). -
DEGRADED
:True
if any errors were reported during application of the TuneD profile (True/False/Unknown
). -
AGE
: Time elapsed since the creation of Profile object.
4.5. Custom tuning specification
The custom resource (CR) for the Operator has two major sections. The first section, profile:
, is a list of TuneD profiles and their names. The second, recommend:
, defines the profile selection logic.
Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized TuneD daemons are updated.
Management state
The Operator Management state is set by adjusting the default Tuned CR. By default, the Operator is in the Managed state and the spec.managementState
field is not present in the default Tuned CR. Valid values for the Operator Management state are as follows:
- Managed: the Operator will update its operands as configuration resources are updated
- Unmanaged: the Operator will ignore changes to the configuration resources
- Removed: the Operator will remove its operands and resources the Operator provisioned
Profile data
The profile:
section lists TuneD profiles and their names.
profile: - name: tuned_profile_1 data: | # TuneD profile specification [main] summary=Description of tuned_profile_1 profile [sysctl] net.ipv4.ip_forward=1 # ... other sysctl's or other TuneD daemon plugins supported by the containerized TuneD # ... - name: tuned_profile_n data: | # TuneD profile specification [main] summary=Description of tuned_profile_n profile # tuned_profile_n profile settings
Recommended profiles
The profile:
selection logic is defined by the recommend:
section of the CR. The recommend:
section is a list of items to recommend the profiles based on a selection criteria.
recommend: <recommend-item-1> # ... <recommend-item-n>
The individual items of the list:
- machineConfigLabels: 1 <mcLabels> 2 match: 3 <match> 4 priority: <priority> 5 profile: <tuned_profile_name> 6 operand: 7 debug: <bool> 8 tunedConfig: reapply_sysctl: <bool> 9
- 1
- Optional.
- 2
- A dictionary of key/value
MachineConfig
labels. The keys must be unique. - 3
- If omitted, profile match is assumed unless a profile with a higher priority matches first or
machineConfigLabels
is set. - 4
- An optional list.
- 5
- Profile ordering priority. Lower numbers mean higher priority (
0
is the highest priority). - 6
- A TuneD profile to apply on a match. For example
tuned_profile_1
. - 7
- Optional operand configuration.
- 8
- Turn debugging on or off for the TuneD daemon. Options are
true
for on orfalse
for off. The default isfalse
. - 9
- Turn
reapply_sysctl
functionality on or off for the TuneD daemon. Options aretrue
for on andfalse
for off.
<match>
is an optional list recursively defined as follows:
- label: <label_name> 1 value: <label_value> 2 type: <label_type> 3 <match> 4
If <match>
is not omitted, all nested <match>
sections must also evaluate to true
. Otherwise, false
is assumed and the profile with the respective <match>
section will not be applied or recommended. Therefore, the nesting (child <match>
sections) works as logical AND operator. Conversely, if any item of the <match>
list matches, the entire <match>
list evaluates to true
. Therefore, the list acts as logical OR operator.
If machineConfigLabels
is defined, machine config pool based matching is turned on for the given recommend:
list item. <mcLabels>
specifies the labels for a machine config. The machine config is created automatically to apply host settings, such as kernel boot parameters, for the profile <tuned_profile_name>
. This involves finding all machine config pools with machine config selector matching <mcLabels>
and setting the profile <tuned_profile_name>
on all nodes that are assigned the found machine config pools. To target nodes that have both master and worker roles, you must use the master role.
The list items match
and machineConfigLabels
are connected by the logical OR operator. The match
item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true
, the machineConfigLabels
item is not considered.
When using machine config pool based matching, it is advised to group nodes with the same hardware configuration into the same machine config pool. Not following this practice might result in TuneD operands calculating conflicting kernel parameters for two or more nodes sharing the same machine config pool.
Example: node or pod label based matching
- match: - label: tuned.openshift.io/elasticsearch match: - label: node-role.kubernetes.io/master - label: node-role.kubernetes.io/infra type: pod priority: 10 profile: openshift-control-plane-es - match: - label: node-role.kubernetes.io/master - label: node-role.kubernetes.io/infra priority: 20 profile: openshift-control-plane - priority: 30 profile: openshift-node
The CR above is translated for the containerized TuneD daemon into its recommend.conf
file based on the profile priorities. The profile with the highest priority (10
) is openshift-control-plane-es
and, therefore, it is considered first. The containerized TuneD daemon running on a given node looks to see if there is a pod running on the same node with the tuned.openshift.io/elasticsearch
label set. If not, the entire <match>
section evaluates as false
. If there is such a pod with the label, in order for the <match>
section to evaluate to true
, the node label also needs to be node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
If the labels for the profile with priority 10
matched, openshift-control-plane-es
profile is applied and no other profile is considered. If the node/pod label combination did not match, the second highest priority profile (openshift-control-plane
) is considered. This profile is applied if the containerized TuneD pod runs on a node with labels node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
Finally, the profile openshift-node
has the lowest priority of 30
. It lacks the <match>
section and, therefore, will always match. It acts as a profile catch-all to set openshift-node
profile, if no other profile with higher priority matches on a given node.

Example: machine config pool based matching
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-node-custom namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift node profile with an additional kernel parameter include=openshift-node [bootloader] cmdline_openshift_node_custom=+skew_tick=1 name: openshift-node-custom recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "worker-custom" priority: 20 profile: openshift-node-custom
To minimize node reboots, label the target nodes with a label the machine config pool’s node selector will match, then create the Tuned CR above and finally create the custom machine config pool itself.
Cloud provider-specific TuneD profiles
With this functionality, all Cloud provider-specific nodes can conveniently be assigned a TuneD profile specifically tailored to a given Cloud provider on a OpenShift Container Platform cluster. This can be accomplished without adding additional node labels or grouping nodes into machine config pools.
This functionality takes advantage of spec.providerID
node object values in the form of <cloud-provider>://<cloud-provider-specific-id>
and writes the file /var/lib/tuned/provider
with the value <cloud-provider>
in NTO operand containers. The content of this file is then used by TuneD to load provider-<cloud-provider>
profile if such profile exists.
The openshift
profile that both openshift-control-plane
and openshift-node
profiles inherit settings from is now updated to use this functionality through the use of conditional profile loading. Neither NTO nor TuneD currently ship any Cloud provider-specific profiles. However, it is possible to create a custom profile provider-<cloud-provider>
that will be applied to all Cloud provider-specific cluster nodes.
Example GCE Cloud provider profile
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: provider-gce namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=GCE Cloud provider-specific profile # Your tuning for GCE Cloud provider goes here. name: provider-gce
Due to profile inheritance, any setting specified in the provider-<cloud-provider>
profile will be overwritten by the openshift
profile and its child profiles.
4.6. Custom tuning examples
Using TuneD profiles from the default CR
The following CR applies custom node-level tuning for OpenShift Container Platform nodes with label tuned.openshift.io/ingress-node-label
set to any value.
Example: custom tuning using the openshift-control-plane TuneD profile
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: ingress namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=A custom OpenShift ingress profile include=openshift-control-plane [sysctl] net.ipv4.ip_local_port_range="1024 65535" net.ipv4.tcp_tw_reuse=1 name: openshift-ingress recommend: - match: - label: tuned.openshift.io/ingress-node-label priority: 10 profile: openshift-ingress
Custom profile writers are strongly encouraged to include the default TuneD daemon profiles shipped within the default Tuned CR. The example above uses the default openshift-control-plane
profile to accomplish this.
Using built-in TuneD profiles
Given the successful rollout of the NTO-managed daemon set, the TuneD operands all manage the same version of the TuneD daemon. To list the built-in TuneD profiles supported by the daemon, query any TuneD pod in the following way:
$ oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/ -name tuned.conf -printf '%h\n' | sed 's|^.*/||'
You can use the profile names retrieved by this in your custom tuning specification.
Example: using built-in hpc-compute TuneD profile
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-node-hpc-compute namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift node profile for HPC compute workloads include=openshift-node,hpc-compute name: openshift-node-hpc-compute recommend: - match: - label: tuned.openshift.io/openshift-node-hpc-compute priority: 20 profile: openshift-node-hpc-compute
In addition to the built-in hpc-compute
profile, the example above includes the openshift-node
TuneD daemon profile shipped within the default Tuned CR to use OpenShift-specific tuning for compute nodes.
4.7. Supported TuneD daemon plugins
Excluding the [main]
section, the following TuneD plugins are supported when using custom profiles defined in the profile:
section of the Tuned CR:
- audio
- cpu
- disk
- eeepc_she
- modules
- mounts
- net
- scheduler
- scsi_host
- selinux
- sysctl
- sysfs
- usb
- video
- vm
- bootloader
There is some dynamic tuning functionality provided by some of these plugins that is not supported. The following TuneD plugins are currently not supported:
- script
- systemd
The TuneD bootloader plugin is currently supported on Red Hat Enterprise Linux CoreOS (RHCOS) 8.x worker nodes. For Red Hat Enterprise Linux (RHEL) 7.x worker nodes, the TuneD bootloader plugin is currently not supported.
See Available TuneD Plugins and Getting Started with TuneD for more information.
4.8. Configuring node tuning in a hosted cluster
Hosted control planes is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
To set node-level tuning on the nodes in your hosted cluster, you can use the Node Tuning Operator. In hosted control planes, you can configure node tuning by creating config maps that contain Tuned
objects and referencing those config maps in your node pools.
Procedure
Create a config map that contains a valid tuned manifest, and reference the manifest in a node pool. In the following example, a
Tuned
manifest defines a profile that setsvm.dirty_ratio
to 55 on nodes that contain thetuned-1-node-label
node label with any value. Save the followingConfigMap
manifest in a file namedtuned-1.yaml
:apiVersion: v1 kind: ConfigMap metadata: name: tuned-1 namespace: clusters data: tuning: | apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: tuned-1 namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift profile include=openshift-node [sysctl] vm.dirty_ratio="55" name: tuned-1-profile recommend: - priority: 20 profile: tuned-1-profile
NoteIf you do not add any labels to an entry in the
spec.recommend
section of the Tuned spec, node-pool-based matching is assumed, so the highest priority profile in thespec.recommend
section is applied to nodes in the pool. Although you can achieve more fine-grained node-label-based matching by setting a label value in the Tuned.spec.recommend.match
section, node labels will not persist during an upgrade unless you set the.spec.management.upgradeType
value of the node pool toInPlace
.Create the
ConfigMap
object in the management cluster:$ oc --kubeconfig="$MGMT_KUBECONFIG" create -f tuned-1.yaml
Reference the
ConfigMap
object in thespec.tuningConfig
field of the node pool, either by editing a node pool or creating one. In this example, assume that you have only oneNodePool
, namednodepool-1
, which contains 2 nodes.apiVersion: hypershift.openshift.io/v1alpha1 kind: NodePool metadata: ... name: nodepool-1 namespace: clusters ... spec: ... tuningConfig: - name: tuned-1 status: ...
NoteYou can reference the same config map in multiple node pools. In hosted control planes, the Node Tuning Operator appends a hash of the node pool name and namespace to the name of the Tuned CRs to distinguish them. Outside of this case, do not create multiple TuneD profiles of the same name in different Tuned CRs for the same hosted cluster.
Verification
Now that you have created the ConfigMap
object that contains a Tuned
manifest and referenced it in a NodePool
, the Node Tuning Operator syncs the Tuned
objects into the hosted cluster. You can verify which Tuned
objects are defined and which TuneD profiles are applied to each node.
List the
Tuned
objects in the hosted cluster:$ oc --kubeconfig="$HC_KUBECONFIG" get Tuneds -n openshift-cluster-node-tuning-operator
Example output
NAME AGE default 7m36s rendered 7m36s tuned-1 65s
List the
Profile
objects in the hosted cluster:$ oc --kubeconfig="$HC_KUBECONFIG" get Profiles -n openshift-cluster-node-tuning-operator
Example output
NAME TUNED APPLIED DEGRADED AGE nodepool-1-worker-1 tuned-1-profile True False 7m43s nodepool-1-worker-2 tuned-1-profile True False 7m14s
NoteIf no custom profiles are created, the
openshift-node
profile is applied by default.To confirm that the tuning was applied correctly, start a debug shell on a node and check the sysctl values:
$ oc --kubeconfig="$HC_KUBECONFIG" debug node/nodepool-1-worker-1 -- chroot /host sysctl vm.dirty_ratio
Example output
vm.dirty_ratio = 55
4.9. Advanced node tuning for hosted clusters by setting kernel boot parameters
Hosted control planes is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
For more advanced tuning in hosted control planes, which requires setting kernel boot parameters, you can also use the Node Tuning Operator. The following example shows how you can create a node pool with huge pages reserved.
Procedure
Create a
ConfigMap
object that contains aTuned
object manifest for creating 10 huge pages that are 2 MB in size. Save thisConfigMap
manifest in a file namedtuned-hugepages.yaml
:apiVersion: v1 kind: ConfigMap metadata: name: tuned-hugepages namespace: clusters data: tuning: | apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: hugepages namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Boot time configuration for hugepages include=openshift-node [bootloader] cmdline_openshift_node_hugepages=hugepagesz=2M hugepages=50 name: openshift-node-hugepages recommend: - priority: 20 profile: openshift-node-hugepages
NoteThe
.spec.recommend.match
field is intentionally left blank. In this case, thisTuned
object is applied to all nodes in the node pool where thisConfigMap
object is referenced. Group nodes with the same hardware configuration into the same node pool. Otherwise, TuneD operands can calculate conflicting kernel parameters for two or more nodes that share the same node pool.Create the
ConfigMap
object in the management cluster:$ oc --kubeconfig="$MGMT_KUBECONFIG" create -f tuned-hugepages.yaml
Create a
NodePool
manifest YAML file, customize the upgrade type of theNodePool
, and reference theConfigMap
object that you created in thespec.tuningConfig
section. Create theNodePool
manifest and save it in a file namedhugepages-nodepool.yaml
by using thehypershift
CLI:NODEPOOL_NAME=hugepages-example INSTANCE_TYPE=m5.2xlarge NODEPOOL_REPLICAS=2 hypershift create nodepool aws \ --cluster-name $CLUSTER_NAME \ --name $NODEPOOL_NAME \ --node-count $NODEPOOL_REPLICAS \ --instance-type $INSTANCE_TYPE \ --render > hugepages-nodepool.yaml
In the
hugepages-nodepool.yaml
file, set.spec.management.upgradeType
toInPlace
, and set.spec.tuningConfig
to reference thetuned-hugepages
ConfigMap
object that you created.apiVersion: hypershift.openshift.io/v1alpha1 kind: NodePool metadata: name: hugepages-nodepool namespace: clusters ... spec: management: ... upgradeType: InPlace ... tuningConfig: - name: tuned-hugepages
NoteTo avoid the unnecessary re-creation of nodes when you apply the new
MachineConfig
objects, set.spec.management.upgradeType
toInPlace
. If you use theReplace
upgrade type, nodes are fully deleted and new nodes can replace them when you apply the new kernel boot parameters that the TuneD operand calculated.Create the
NodePool
in the management cluster:$ oc --kubeconfig="$MGMT_KUBECONFIG" create -f hugepages-nodepool.yaml
Verification
After the nodes are available, the containerized TuneD daemon calculates the required kernel boot parameters based on the applied TuneD profile. After the nodes are ready and reboot once to apply the generated MachineConfig
object, you can verify that the TuneD profile is applied and that the kernel boot parameters are set.
List the
Tuned
objects in the hosted cluster:$ oc --kubeconfig="$HC_KUBECONFIG" get Tuneds -n openshift-cluster-node-tuning-operator
Example output
NAME AGE default 123m hugepages-8dfb1fed 1m23s rendered 123m
List the
Profile
objects in the hosted cluster:$ oc --kubeconfig="$HC_KUBECONFIG" get Profiles -n openshift-cluster-node-tuning-operator
Example output
NAME TUNED APPLIED DEGRADED AGE nodepool-1-worker-1 openshift-node True False 132m nodepool-1-worker-2 openshift-node True False 131m hugepages-nodepool-worker-1 openshift-node-hugepages True False 4m8s hugepages-nodepool-worker-2 openshift-node-hugepages True False 3m57s
Both of the worker nodes in the new
NodePool
have theopenshift-node-hugepages
profile applied.To confirm that the tuning was applied correctly, start a debug shell on a node and check
/proc/cmdline
.$ oc --kubeconfig="$HC_KUBECONFIG" debug node/nodepool-1-worker-1 -- chroot /host cat /proc/cmdline
Example output
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-... hugepagesz=2M hugepages=50
Additional resources
For more information about hosted control planes, see Hosted control planes for Red Hat OpenShift Container Platform (Technology Preview).
Chapter 5. Using CPU Manager and Topology Manager
CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.
CPU Manager is useful for workloads that have some of these attributes:
- Require as much CPU time as possible.
- Are sensitive to processor cache misses.
- Are low-latency network applications.
- Coordinate with other processes and benefit from sharing a single processor cache.
Topology Manager collects hints from the CPU Manager, Device Manager, and other Hint Providers to align pod resources, such as CPU, SR-IOV VFs, and other device resources, for all Quality of Service (QoS) classes on the same non-uniform memory access (NUMA) node.
Topology Manager uses topology information from the collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and pod resources requested.
Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.
To use Topology Manager you must configure CPU Manager with the static
policy.
5.1. Setting up CPU Manager
Procedure
Optional: Label a node:
# oc label node perf-node.example.com cpumanager=true
Edit the
MachineConfigPool
of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:# oc edit machineconfigpool worker
Add a label to the worker machine config pool:
metadata: creationTimestamp: 2020-xx-xxx generation: 3 labels: custom-kubelet: cpumanager-enabled
Create a
KubeletConfig
,cpumanager-kubeletconfig.yaml
, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See themachineConfigPoolSelector
section:apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: cpumanager-enabled spec: machineConfigPoolSelector: matchLabels: custom-kubelet: cpumanager-enabled kubeletConfig: cpuManagerPolicy: static 1 cpuManagerReconcilePeriod: 5s 2
- 1
- Specify a policy:
-
none
. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically. This is the default policy. -
static
. This policy allows containers in guaranteed pods with integer CPU requests. It also limits access to exclusive CPUs on the node. Ifstatic
, you must use a lowercases
.
-
- 2
- Optional. Specify the CPU Manager reconcile frequency. The default is
5s
.
Create the dynamic kubelet config:
# oc create -f cpumanager-kubeletconfig.yaml
This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.
Check for the merged kubelet config:
# oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
Example output
"ownerReferences": [ { "apiVersion": "machineconfiguration.openshift.io/v1", "kind": "KubeletConfig", "name": "cpumanager-enabled", "uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878" } ]
Check the worker for the updated
kubelet.conf
:# oc debug node/perf-node.example.com sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
Example output
cpuManagerPolicy: static 1 cpuManagerReconcilePeriod: 5s 2
Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:
# cat cpumanager-pod.yaml
Example output
apiVersion: v1 kind: Pod metadata: generateName: cpumanager- spec: containers: - name: cpumanager image: gcr.io/google_containers/pause-amd64:3.0 resources: requests: cpu: 1 memory: "1G" limits: cpu: 1 memory: "1G" nodeSelector: cpumanager: "true"
Create the pod:
# oc create -f cpumanager-pod.yaml
Verify that the pod is scheduled to the node that you labeled:
# oc describe pod cpumanager
Example output
Name: cpumanager-6cqz7 Namespace: default Priority: 0 PriorityClassName: <none> Node: perf-node.example.com/xxx.xx.xx.xxx ... Limits: cpu: 1 memory: 1G Requests: cpu: 1 memory: 1G ... QoS Class: Guaranteed Node-Selectors: cpumanager=true
Verify that the
cgroups
are set up correctly. Get the process ID (PID) of thepause
process:# ├─init.scope │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17 └─kubepods.slice ├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice │ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope │ └─32706 /pause
Pods of quality of service (QoS) tier
Guaranteed
are placed within thekubepods.slice
. Pods of other QoS tiers end up in childcgroups
ofkubepods
:# cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
Example output
cpuset.cpus 1 tasks 32706
Check the allowed CPU list for the task:
# grep ^Cpus_allowed_list /proc/32706/status
Example output
Cpus_allowed_list: 1
Verify that another pod (in this case, the pod in the
burstable
QoS tier) on the system cannot run on the core allocated for theGuaranteed
pod:# cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus 0 # oc describe node perf-node.example.com
Example output
... Capacity: attachable-volumes-aws-ebs: 39 cpu: 2 ephemeral-storage: 124768236Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 8162900Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 1500m ephemeral-storage: 124768236Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7548500Ki pods: 250 ------- ---- ------------ ---------- --------------- ------------- --- default cpumanager-6cqz7 1 (66%) 1 (66%) 1G (12%) 1G (12%) 29m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1440m (96%) 1 (66%)
This VM has two CPU cores. The
system-reserved
setting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at theNode Allocatable
amount. You can see thatAllocatable CPU
is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:NAME READY STATUS RESTARTS AGE cpumanager-6cqz7 1/1 Running 0 33m cpumanager-7qc2t 0/1 Pending 0 11s
5.2. Topology Manager policies
Topology Manager aligns Pod
resources of all Quality of Service (QoS) classes by collecting topology hints from Hint Providers, such as CPU Manager and Device Manager, and using the collected hints to align the Pod
resources.
Topology Manager supports four allocation policies, which you assign in the cpumanager-enabled
custom resource (CR):
none
policy- This is the default policy and does not perform any topology alignment.
best-effort
policy-
For each container in a pod with the
best-effort
topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node. restricted
policy-
For each container in a pod with the
restricted
topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager rejects this pod from the node, resulting in a pod in aTerminated
state with a pod admission failure. single-numa-node
policy-
For each container in a pod with the
single-numa-node
topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure.
5.3. Setting up Topology Manager
To use Topology Manager, you must configure an allocation policy in the cpumanager-enabled
custom resource (CR). This file might exist if you have set up CPU Manager. If the file does not exist, you can create the file.
Prequisites
-
Configure the CPU Manager policy to be
static
.
Procedure
To activate Topololgy Manager:
Configure the Topology Manager allocation policy in the
cpumanager-enabled
custom resource (CR).$ oc edit KubeletConfig cpumanager-enabled
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: cpumanager-enabled spec: machineConfigPoolSelector: matchLabels: custom-kubelet: cpumanager-enabled kubeletConfig: cpuManagerPolicy: static 1 cpuManagerReconcilePeriod: 5s topologyManagerPolicy: single-numa-node 2
5.4. Pod interactions with Topology Manager policies
The example Pod
specs below help illustrate pod interactions with Topology Manager.
The following pod runs in the BestEffort
QoS class because no resource requests or limits are specified.
spec: containers: - name: nginx image: nginx
The next pod runs in the Burstable
QoS class because requests are less than limits.
spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" requests: memory: "100Mi"
If the selected policy is anything other than none
, Topology Manager would not consider either of these Pod
specifications.
The last example pod below runs in the Guaranteed QoS class because requests are equal to limits.
spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" cpu: "2" example.com/device: "1" requests: memory: "200Mi" cpu: "2" example.com/device: "1"
Topology Manager would consider this pod. The Topology Manager would consult the hint providers, which are CPU Manager and Device Manager, to get topology hints for the pod.
Topology Manager will use this information to store the best topology for this container. In the case of this pod, CPU Manager and Device Manager will use this stored information at the resource allocation stage.
Chapter 6. Scheduling NUMA-aware workloads
Learn about NUMA-aware scheduling and how you can use it to deploy high performance workloads in an OpenShift Container Platform cluster.
NUMA-aware scheduling is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
The NUMA Resources Operator allows you to schedule high-performance workloads in the same NUMA zone. It deploys a node resources exporting agent that reports on available cluster node NUMA resources, and a secondary scheduler that manages the workloads.
6.1. About NUMA-aware scheduling
Non-Uniform Memory Access (NUMA) is a compute platform architecture that allows different CPUs to access different regions of memory at different speeds. NUMA resource topology refers to the locations of CPUs, memory, and PCI devices relative to each other in the compute node. Co-located resources are said to be in the same NUMA zone. For high-performance applications, the cluster needs to process pod workloads in a single NUMA zone.
NUMA architecture allows a CPU with multiple memory controllers to use any available memory across CPU complexes, regardless of where the memory is located. This allows for increased flexibility at the expense of performance. A CPU processing a workload using memory that is outside its NUMA zone is slower than a workload processed in a single NUMA zone. Also, for I/O-constrained workloads, the network interface on a distant NUMA zone slows down how quickly information can reach the application. High-performance workloads, such as telecommunications workloads, cannot operate to specification under these conditions. NUMA-aware scheduling aligns the requested cluster compute resources (CPUs, memory, devices) in the same NUMA zone to process latency-sensitive or high-performance workloads efficiently. NUMA-aware scheduling also improves pod density per compute node for greater resource efficiency.
The default OpenShift Container Platform pod scheduler scheduling logic considers the available resources of the entire compute node, not individual NUMA zones. If the most restrictive resource alignment is requested in the kubelet topology manager, error conditions can occur when admitting the pod to a node. Conversely, if the most restrictive resource alignment is not requested, the pod can be admitted to the node without proper resource alignment, leading to worse or unpredictable performance. For example, runaway pod creation with Topology Affinity Error
statuses can occur when the pod scheduler makes suboptimal scheduling decisions for guaranteed pod workloads by not knowing if the pod’s requested resources are available. Scheduling mismatch decisions can cause indefinite pod startup delays. Also, depending on the cluster state and resource allocation, poor pod scheduling decisions can cause extra load on the cluster because of failed startup attempts.
The NUMA Resources Operator deploys a custom NUMA resources secondary scheduler and other resources to mitigate against the shortcomings of the default OpenShift Container Platform pod scheduler. The following diagram provides a high-level overview of NUMA-aware pod scheduling.
Figure 6.1. NUMA-aware scheduling overview

- NodeResourceTopology API
-
The
NodeResourceTopology
API describes the available NUMA zone resources in each compute node. - NUMA-aware scheduler
-
The NUMA-aware secondary scheduler receives information about the available NUMA zones from the
NodeResourceTopology
API and schedules high-performance workloads on a node where it can be optimally processed. - Node topology exporter
-
The node topology exporter exposes the available NUMA zone resources for each compute node to the
NodeResourceTopology
API. The node topology exporter daemon tracks the resource allocation from the kubelet by using thePodResources
API. - PodResources API
-
The
PodResources
API is local to each node and exposes the resource topology and available resources to the kubelet.
Additional resources
- For more information about running secondary pod schedulers in your cluster and how to deploy pods with a secondary pod scheduler, see Scheduling pods using a secondary scheduler.
6.2. Installing the NUMA Resources Operator
NUMA Resources Operator deploys resources that allow you to schedule NUMA-aware workloads and deployments. You can install the NUMA Resources Operator using the OpenShift Container Platform CLI or the web console.
6.2.1. Installing the NUMA Resources Operator using the CLI
As a cluster administrator, you can install the Operator using the CLI.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Create a namespace for the NUMA Resources Operator:
Save the following YAML in the
nro-namespace.yaml
file:apiVersion: v1 kind: Namespace metadata: name: openshift-numaresources
Create the
Namespace
CR by running the following command:$ oc create -f nro-namespace.yaml
Create the Operator group for the NUMA Resources Operator:
Save the following YAML in the
nro-operatorgroup.yaml
file:apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: numaresources-operator namespace: openshift-numaresources spec: targetNamespaces: - openshift-numaresources
Create the
OperatorGroup
CR by running the following command:$ oc create -f nro-operatorgroup.yaml
Create the subscription for the NUMA Resources Operator:
Save the following YAML in the
nro-sub.yaml
file:apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: numaresources-operator namespace: openshift-numaresources spec: channel: "4.12" name: numaresources-operator source: redhat-operators sourceNamespace: openshift-marketplace
Create the
Subscription
CR by running the following command:$ oc create -f nro-sub.yaml
Verification
Verify that the installation succeeded by inspecting the CSV resource in the
openshift-numaresources
namespace. Run the following command:$ oc get csv -n openshift-numaresources
Example output
NAME DISPLAY VERSION REPLACES PHASE numaresources-operator.v4.12.2 numaresources-operator 4.12.2 Succeeded
6.2.2. Installing the NUMA Resources Operator using the web console
As a cluster administrator, you can install the NUMA Resources Operator using the web console.
Procedure
Install the NUMA Resources Operator using the OpenShift Container Platform web console:
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose NUMA Resources Operator from the list of available Operators, and then click Install.
Optional: Verify that the NUMA Resources Operator installed successfully:
- Switch to the Operators → Installed Operators page.
Ensure that NUMA Resources Operator is listed in the default project with a Status of InstallSucceeded.
NoteDuring installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.
If the Operator does not appear as installed, to troubleshoot further:
- Go to the Operators → Installed Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
-
Go to the Workloads → Pods page and check the logs for pods in the
default
project.
6.3. Creating the NUMAResourcesOperator custom resource
When you have installed the NUMA Resources Operator, then create the NUMAResourcesOperator
custom resource (CR) that instructs the NUMA Resources Operator to install all the cluster infrastructure needed to support the NUMA-aware scheduler, including daemon sets and APIs.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges. - Install the NUMA Resources Operator.
Procedure
Create the
MachineConfigPool
custom resource that enables custom kubelet configurations for worker nodes:Save the following YAML in the
nro-machineconfig.yaml
file:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: labels: cnf-worker-tuning: enabled machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" name: worker spec: machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: ""
Create the
MachineConfigPool
CR by running the following command:$ oc create -f nro-machineconfig.yaml
Create the
NUMAResourcesOperator
custom resource:Save the following YAML in the
nrop.yaml
file:apiVersion: nodetopology.openshift.io/v1alpha1 kind: NUMAResourcesOperator metadata: name: numaresourcesoperator spec: nodeGroups: - machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" 1
- 1
- Should match the label applied to worker nodes in the related
MachineConfigPool
CR.
Create the
NUMAResourcesOperator
CR by running the following command:$ oc create -f nrop.yaml
Verification
Verify that the NUMA Resources Operator deployed successfully by running the following command:
$ oc get numaresourcesoperators.nodetopology.openshift.io
Example output
NAME AGE numaresourcesoperator 10m
6.4. Deploying the NUMA-aware secondary pod scheduler
After you install the NUMA Resources Operator, do the following to deploy the NUMA-aware secondary pod scheduler:
- Configure the pod admittance policy for the required machine profile
- Create the required machine config pool
- Deploy the NUMA-aware secondary scheduler
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges. - Install the NUMA Resources Operator.
Procedure
Create the
KubeletConfig
custom resource that configures the pod admittance policy for the machine profile:Save the following YAML in the
nro-kubeletconfig.yaml
file:apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: cnf-worker-tuning spec: machineConfigPoolSelector: matchLabels: cnf-worker-tuning: enabled kubeletConfig: cpuManagerPolicy: "static" 1 cpuManagerReconcilePeriod: "5s" reservedSystemCPUs: "0,1" memoryManagerPolicy: "Static" 2 evictionHard: memory.available: "100Mi" kubeReserved: memory: "512Mi" reservedMemory: - numaNode: 0 limits: memory: "1124Mi" systemReserved: memory: "512Mi" topologyManagerPolicy: "single-numa-node" 3 topologyManagerScope: "pod"
Create the
KubeletConfig
custom resource (CR) by running the following command:$ oc create -f nro-kubeletconfig.yaml
Create the
NUMAResourcesScheduler
custom resource that deploys the NUMA-aware custom pod scheduler:Save the following YAML in the
nro-scheduler.yaml
file:apiVersion: nodetopology.openshift.io/v1alpha1 kind: NUMAResourcesScheduler metadata: name: numaresourcesscheduler spec: imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.12"
Create the
NUMAResourcesScheduler
CR by running the following command:$ oc create -f nro-scheduler.yaml
Verification
Verify that the required resources deployed successfully by running the following command:
$ oc get all -n openshift-numaresources
Example output
NAME READY STATUS RESTARTS AGE pod/numaresources-controller-manager-7575848485-bns4s 1/1 Running 0 13m pod/numaresourcesoperator-worker-dvj4n 2/2 Running 0 16m pod/numaresourcesoperator-worker-lcg4t 2/2 Running 0 16m pod/secondary-scheduler-56994cf6cf-7qf4q 1/1 Running 0 16m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/numaresourcesoperator-worker 2 2 2 2 2 node-role.kubernetes.io/worker= 16m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/numaresources-controller-manager 1/1 1 1 13m deployment.apps/secondary-scheduler 1/1 1 1 16m NAME DESIRED CURRENT READY AGE replicaset.apps/numaresources-controller-manager-7575848485 1 1 1 13m replicaset.apps/secondary-scheduler-56994cf6cf 1 1 1 16m
6.5. Scheduling workloads with the NUMA-aware scheduler
You can schedule workloads with the NUMA-aware scheduler using Deployment
CRs that specify the minimum required resources to process the workload.
The following example deployment uses NUMA-aware scheduling for a sample workload.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges. - Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.
Procedure
Get the name of the NUMA-aware scheduler that is deployed in the cluster by running the following command:
$ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
Example output
topo-aware-scheduler
Create a
Deployment
CR that uses scheduler namedtopo-aware-scheduler
, for example:Save the following YAML in the
nro-deployment.yaml
file:apiVersion: apps/v1 kind: Deployment metadata: name: numa-deployment-1 namespace: openshift-numaresources spec: replicas: 1 selector: matchLabels: app: test template: metadata: labels: app: test spec: schedulerName: topo-aware-scheduler 1 containers: - name: ctnr image: quay.io/openshifttest/hello-openshift:openshift imagePullPolicy: IfNotPresent resources: limits: memory: "100Mi" cpu: "10" requests: memory: "100Mi" cpu: "10" - name: ctnr2 image: gcr.io/google_containers/pause-amd64:3.0 imagePullPolicy: IfNotPresent command: ["/bin/sh", "-c"] args: [ "while true; do sleep 1h; done;" ] resources: limits: memory: "100Mi" cpu: "8" requests: memory: "100Mi" cpu: "8"
- 1
schedulerName
must match the name of the NUMA-aware scheduler that is deployed in your cluster, for exampletopo-aware-scheduler
.
Create the
Deployment
CR by running the following command:$ oc create -f nro-deployment.yaml
Verification
Verify that the deployment was successful:
$ oc get pods -n openshift-numaresources
Example output
NAME READY STATUS RESTARTS AGE numa-deployment-1-56954b7b46-pfgw8 2/2 Running 0 129m numaresources-controller-manager-7575848485-bns4s 1/1 Running 0 15h numaresourcesoperator-worker-dvj4n 2/2 Running 0 18h numaresourcesoperator-worker-lcg4t 2/2 Running 0 16h secondary-scheduler-56994cf6cf-7qf4q 1/1 Running 0 18h
Verify that the
topo-aware-scheduler
is scheduling the deployed pod by running the following command:$ oc describe pod numa-deployment-1-56954b7b46-pfgw8 -n openshift-numaresources
Example output
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 130m topo-aware-scheduler Successfully assigned openshift-numaresources/numa-deployment-1-56954b7b46-pfgw8 to compute-0.example.com
NoteDeployments that request more resources than is available for scheduling will fail with a
MinimumReplicasUnavailable
error. The deployment succeeds when the required resources become available. Pods remain in thePending
state until the required resources are available.Verify that the expected allocated resources are listed for the node. Run the following command:
$ oc describe noderesourcetopologies.topology.node.k8s.io
Example output
... Zones: Costs: Name: node-0 Value: 10 Name: node-1 Value: 21 Name: node-0 Resources: Allocatable: 39 Available: 21 1 Capacity: 40 Name: cpu Allocatable: 6442450944 Available: 6442450944 Capacity: 6442450944 Name: hugepages-1Gi Allocatable: 134217728 Available: 134217728 Capacity: 134217728 Name: hugepages-2Mi Allocatable: 262415904768 Available: 262206189568 Capacity: 270146007040 Name: memory Type: Node
- 1
- The
Available
capacity is reduced because of the resources that have been allocated to the guaranteed pod.
Resources consumed by guaranteed pods are subtracted from the available node resources listed under
noderesourcetopologies.topology.node.k8s.io
.Resource allocations for pods with a
Best-effort
orBurstable
quality of service (qosClass
) are not reflected in the NUMA node resources undernoderesourcetopologies.topology.node.k8s.io
. If a pod’s consumed resources are not reflected in the node resource calculation, verify that the pod hasqosClass
ofGuaranteed
by running the following command:$ oc get pod <pod_name> -n <pod_namespace> -o jsonpath="{ .status.qosClass }"
Example output
Guaranteed
6.6. Troubleshooting NUMA-aware scheduling
To troubleshoot common problems with NUMA-aware pod scheduling, perform the following steps.
Prerequisites
-
Install the OpenShift Container Platform CLI (
oc
). - Log in as a user with cluster-admin privileges.
- Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.
Procedure
Verify that the
noderesourcetopologies
CRD is deployed in the cluster by running the following command:$ oc get crd | grep noderesourcetopologies
Example output
NAME CREATED AT noderesourcetopologies.topology.node.k8s.io 2022-01-18T08:28:06Z
Check that the NUMA-aware scheduler name matches the name specified in your NUMA-aware workloads by running the following command:
$ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
Example output
topo-aware-scheduler
Verify that NUMA-aware scheduable nodes have the
noderesourcetopologies
CR applied to them. Run the following command:$ oc get noderesourcetopologies.topology.node.k8s.io
Example output
NAME AGE compute-0.example.com 17h compute-1.example.com 17h
NoteThe number of nodes should equal the number of worker nodes that are configured by the machine config pool (
mcp
) worker definition.Verify the NUMA zone granularity for all scheduable nodes by running the following command:
$ oc get noderesourcetopologies.topology.node.k8s.io -o yaml
Example output
apiVersion: v1 items: - apiVersion: topology.node.k8s.io/v1alpha1 kind: NodeResourceTopology metadata: annotations: k8stopoawareschedwg/rte-update: periodic creationTimestamp: "2022-06-16T08:55:38Z" generation: 63760 name: worker-0 resourceVersion: "8450223" uid: 8b77be46-08c0-4074-927b-d49361471590 topologyPolicies: - SingleNUMANodeContainerLevel zones: - costs: - name: node-0 value: 10 - name: node-1 value: 21 name: node-0 resources: - allocatable: "38" available: "38" capacity: "40" name: cpu - allocatable: "134217728" available: "134217728" capacity: "134217728" name: hugepages-2Mi - allocatable: "262352048128" available: "262352048128" capacity: "270107316224" name: memory - allocatable: "6442450944" available: "6442450944" capacity: "6442450944" name: hugepages-1Gi type: Node - costs: - name: node-0 value: 21 - name: node-1 value: 10 name: node-1 resources: - allocatable: "268435456" available: "268435456" capacity: "268435456" name: hugepages-2Mi - allocatable: "269231067136" available: "269231067136" capacity: "270573244416" name: memory - allocatable: "40" available: "40" capacity: "40" name: cpu - allocatable: "1073741824" available: "1073741824" capacity: "1073741824" name: hugepages-1Gi type: Node - apiVersion: topology.node.k8s.io/v1alpha1 kind: NodeResourceTopology metadata: annotations: k8stopoawareschedwg/rte-update: periodic creationTimestamp: "2022-06-16T08:55:37Z" generation: 62061 name: worker-1 resourceVersion: "8450129" uid: e8659390-6f8d-4e67-9a51-1ea34bba1cc3 topologyPolicies: - SingleNUMANodeContainerLevel zones: 1 - costs: - name: node-0 value: 10 - name: node-1 value: 21 name: node-0 resources: 2 - allocatable: "38" available: "38" capacity: "40" name: cpu - allocatable: "6442450944" available: "6442450944" capacity: "6442450944" name: hugepages-1Gi - allocatable: "134217728" available: "134217728" capacity: "134217728" name: hugepages-2Mi - allocatable: "262391033856" available: "262391033856" capacity: "270146301952" name: memory type: Node - costs: - name: node-0 value: 21 - name: node-1 value: 10 name: node-1 resources: - allocatable: "40" available: "40" capacity: "40" name: cpu - allocatable: "1073741824" available: "1073741824" capacity: "1073741824" name: hugepages-1Gi - allocatable: "268435456" available: "268435456" capacity: "268435456" name: hugepages-2Mi - allocatable: "269192085504" available: "269192085504" capacity: "270534262784" name: memory type: Node kind: List metadata: resourceVersion: "" selfLink: ""
6.6.1. Checking the NUMA-aware scheduler logs
Troubleshoot problems with the NUMA-aware scheduler by reviewing the logs. If required, you can increase the scheduler log level by modifying the spec.logLevel
field of the NUMAResourcesScheduler
resource. Acceptable values are Normal
, Debug
, and Trace
, with Trace
being the most verbose option.
To change the log level of the secondary scheduler, delete the running scheduler resource and re-deploy it with the changed log level. The scheduler is unavailable for scheduling new workloads during this downtime.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Delete the currently running
NUMAResourcesScheduler
resource:Get the active
NUMAResourcesScheduler
by running the following command:$ oc get NUMAResourcesScheduler
Example output
NAME AGE numaresourcesscheduler 90m
Delete the secondary scheduler resource by running the following command:
$ oc delete NUMAResourcesScheduler numaresourcesscheduler
Example output
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
Save the following YAML in the file
nro-scheduler-debug.yaml
. This example changes the log level toDebug
:apiVersion: nodetopology.openshift.io/v1alpha1 kind: NUMAResourcesScheduler metadata: name: numaresourcesscheduler spec: imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.12" logLevel: Debug
Create the updated
Debug
loggingNUMAResourcesScheduler
resource by running the following command:$ oc create -f nro-scheduler-debug.yaml
Example output
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
Verification steps
Check that the NUMA-aware scheduler was successfully deployed:
Run the following command to check that the CRD is created succesfully:
$ oc get crd | grep numaresourcesschedulers
Example output
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
Check that the new custom scheduler is available by running the following command:
$ oc get numaresourcesschedulers.nodetopology.openshift.io
Example output
NAME AGE numaresourcesscheduler 3h26m
Check that the logs for the scheduler shows the increased log level:
Get the list of pods running in the
openshift-numaresources
namespace by running the following command:$ oc get pods -n openshift-numaresources
Example output
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
Get the logs for the secondary scheduler pod by running the following command:
$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
Example output
... I0223 11:04:55.614788 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received I0223 11:04:56.609114 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received I0223 11:05:22.626818 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received I0223 11:05:31.610356 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received I0223 11:05:31.713032 1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq" I0223 11:05:53.461016 1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
6.6.2. Troubleshooting the resource topology exporter
Troubleshoot noderesourcetopologies
objects where unexpected results are occurring by inspecting the corresponding resource-topology-exporter
logs.
It is recommended that NUMA resource topology exporter instances in the cluster are named for nodes they refer to. For example, a worker node with the name worker
should have a corresponding noderesourcetopologies
object called worker
.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Get the daemonsets managed by the NUMA Resources Operator. Each daemonset has a corresponding
nodeGroup
in theNUMAResourcesOperator
CR. Run the following command:$ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"
Example output
{"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}
Get the label for the daemonset of interest using the value for
name
from the previous step:$ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"
Example output
{"name":"resource-topology"}
Get the pods using the
resource-topology
label by running the following command:$ oc get pods -n openshift-numaresources -l name=resource-topology -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE numaresourcesoperator-worker-5wm2k 2/2 Running 0 2d1h 10.135.0.64 compute-0.example.com numaresourcesoperator-worker-pb75c 2/2 Running 0 2d1h 10.132.2.33 compute-1.example.com
Examine the logs of the
resource-topology-exporter
container running on the worker pod that corresponds to the node you are troubleshooting. Run the following command:$ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c
Example output
I0221 13:38:18.334140 1 main.go:206] using sysinfo: reservedCpus: 0,1 reservedMemory: "0": 1178599424 I0221 13:38:18.334370 1 main.go:67] === System information === I0221 13:38:18.334381 1 sysinfo.go:231] cpus: reserved "0-1" I0221 13:38:18.334493 1 sysinfo.go:237] cpus: online "0-103" I0221 13:38:18.546750 1 main.go:72] cpus: allocatable "2-103" hugepages-1Gi: numa cell 0 -> 6 numa cell 1 -> 1 hugepages-2Mi: numa cell 0 -> 64 numa cell 1 -> 128 memory: numa cell 0 -> 45758Mi numa cell 1 -> 48372Mi
6.6.3. Correcting a missing resource topology exporter config map
If you install the NUMA Resources Operator in a cluster with misconfigured cluster settings, in some circumstances, the Operator is shown as active but the logs of the resource topology exporter (RTE) daemon set pods show that the configuration for the RTE is missing, for example:
Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"
This log message indicates that the kubeletconfig
with the required configuration was not properly applied in the cluster, resulting in a missing RTE configmap
. For example, the following cluster is missing a numaresourcesoperator-worker
configmap
custom resource (CR):
$ oc get configmap
Example output
NAME DATA AGE 0e2a6bd3.openshift-kni.io 0 6d21h kube-root-ca.crt 1 6d21h openshift-service-ca.crt 1 6d21h topo-aware-scheduler-config 1 6d18h
In a correctly configured cluster, oc get configmap
also returns a numaresourcesoperator-worker
configmap
CR.
Prerequisites
-
Install the OpenShift Container Platform CLI (
oc
). - Log in as a user with cluster-admin privileges.
- Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.
Procedure
Compare the values for
spec.machineConfigPoolSelector.matchLabels
inkubeletconfig
andmetadata.labels
in theMachineConfigPool
(mcp
) worker CR using the following commands:Check the
kubeletconfig
labels by running the following command:$ oc get kubeletconfig -o yaml
Example output
machineConfigPoolSelector: matchLabels: cnf-worker-tuning: enabled
Check the
mcp
labels by running the following command:$ oc get mcp worker -o yaml
Example output
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: ""
The
cnf-worker-tuning: enabled
label is not present in theMachineConfigPool
object.
Edit the
MachineConfigPool
CR to include the missing label, for example:$ oc edit mcp worker -o yaml
Example output
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" cnf-worker-tuning: enabled
- Apply the label changes and wait for the cluster to apply the updated configuration. Run the following command:
Verification
Check that the missing
numaresourcesoperator-worker
configmap
CR is applied:$ oc get configmap
Example output
NAME DATA AGE 0e2a6bd3.openshift-kni.io 0 6d21h kube-root-ca.crt 1 6d21h numaresourcesoperator-worker 1 5m openshift-service-ca.crt 1 6d21h topo-aware-scheduler-config 1 6d18h
Chapter 7. Scalability and performance optimization
7.1. Optimizing storage
Optimizing storage helps to minimize storage use across all resources. By optimizing storage, administrators help ensure that existing storage resources are working in an efficient manner.
7.1.1. Available persistent storage options
Understand your persistent storage options so that you can optimize your OpenShift Container Platform environment.
Table 7.1. Available storage options
Storage type | Description | Examples |
---|---|---|
Block |
| AWS EBS and VMware vSphere support dynamic persistent volume (PV) provisioning natively in OpenShift Container Platform. |
File |
| RHEL NFS, NetApp NFS [1], and Vendor NFS |
Object |
| AWS S3 |
- NetApp NFS supports dynamic PV provisioning when using the Trident plugin.
Currently, CNS is not supported in OpenShift Container Platform 4.12.
7.1.2. Recommended configurable storage technology
The following table summarizes the recommended and configurable storage technologies for the given OpenShift Container Platform cluster application.
Table 7.2. Recommended and configurable storage technology
Storage type | Block | File | Object |
---|---|---|---|
1
2 3 Prometheus is the underlying technology used for metrics. 4 This does not apply to physical disk, VM physical disk, VMDK, loopback over NFS, AWS EBS, and Azure Disk.
5 For metrics, using file storage with the 6 For logging, review the recommended storage solution in Configuring persistent storage for the log store section. Using NFS storage as a persistent volume or through NAS, such as Gluster, can corrupt the data. Hence, NFS is not supported for Elasticsearch storage and LokiStack log store in OpenShift Container Platform Logging. You must use one persistent volume type per log store. 7 Object storage is not consumed through OpenShift Container Platform’s PVs or PVCs. Apps must integrate with the object storage REST API. | |||
ROX1 | Yes4 | Yes4 | Yes |
RWX2 | No | Yes | Yes |
Registry | Configurable | Configurable | Recommended |
Scaled registry | Not configurable | Configurable | Recommended |
Metrics3 | Recommended | Configurable5 | Not configurable |
Elasticsearch Logging | Recommended | Configurable6 | Not supported6 |
Loki Logging | Configurable | Not configurable | Recommended |
Apps | Recommended | Recommended | Not configurable7 |
A scaled registry is an OpenShift image registry where two or more pod replicas are running.
7.1.2.1. Specific application storage recommendations
Testing shows issues with using the NFS server on Red Hat Enterprise Linux (RHEL) as storage backend for core services. This includes the OpenShift Container Registry and Quay, Prometheus for monitoring storage, and Elasticsearch for logging storage. Therefore, using RHEL NFS to back PVs used by core services is not recommended.
Other NFS implementations on the marketplace might not have these issues. Contact the individual NFS implementation vendor for more information on any testing that was possibly completed against these OpenShift Container Platform core components.
7.1.2.1.1. Registry
In a non-scaled/high-availability (HA) OpenShift image registry cluster deployment:
- The storage technology does not have to support RWX access mode.
- The storage technology must ensure read-after-write consistency.
- The preferred storage technology is object storage followed by block storage.
- File storage is not recommended for OpenShift image registry cluster deployment with production workloads.
7.1.2.1.2. Scaled registry
In a scaled/HA OpenShift image registry cluster deployment:
- The storage technology must support RWX access mode.
- The storage technology must ensure read-after-write consistency.
- The preferred storage technology is object storage.
- Red Hat OpenShift Data Foundation (ODF), Amazon Simple Storage Service (Amazon S3), Google Cloud Storage (GCS), Microsoft Azure Blob Storage, and OpenStack Swift are supported.
- Object storage should be S3 or Swift compliant.
- For non-cloud platforms, such as vSphere and bare metal installations, the only configurable technology is file storage.
- Block storage is not configurable.
7.1.2.1.3. Metrics
In an OpenShift Container Platform hosted metrics cluster deployment:
- The preferred storage technology is block storage.
- Object storage is not configurable.
It is not recommended to use file storage for a hosted metrics cluster deployment with production workloads.
7.1.2.1.4. Logging
In an OpenShift Container Platform hosted logging cluster deployment:
- The preferred storage technology is block storage.
- Object storage is not configurable.
7.1.2.1.5. Applications
Application use cases vary from application to application, as described in the following examples:
- Storage technologies that support dynamic PV provisioning have low mount time latencies, and are not tied to nodes to support a healthy cluster.
- Application developers are responsible for knowing and understanding the storage requirements for their application, and how it works with the provided storage to ensure that issues do not occur when an application scales or interacts with the storage layer.
7.1.2.2. Other specific application storage recommendations
It is not recommended to use RAID configurations on Write
intensive workloads, such as etcd
. If you are running etcd
with a RAID configuration, you might be at risk of encountering performance issues with your workloads.
- Red Hat OpenStack Platform (RHOSP) Cinder: RHOSP Cinder tends to be adept in ROX access mode use cases.
- Databases: Databases (RDBMSs, NoSQL DBs, etc.) tend to perform best with dedicated block storage.
- The etcd database must have enough storage and adequate performance capacity to enable a large cluster. Information about monitoring and benchmarking tools to establish ample storage and a high-performance environment is described in Recommended etcd practices.
7.1.3. Data storage management
The following table summarizes the main directories that OpenShift Container Platform components write data to.
Table 7.3. Main directories for storing OpenShift Container Platform data
Directory | Notes | Sizing | Expected growth |
---|---|---|---|
/var/log | Log files for all components. | 10 to 30 GB. | Log files can grow quickly; size can be managed by growing disks or by using log rotate. |
/var/lib/etcd | Used for etcd storage when storing the database. | Less than 20 GB. Database can grow up to 8 GB. | Will grow slowly with the environment. Only storing metadata. Additional 20-25 GB for every additional 8 GB of memory. |
/var/lib/containers | This is the mount point for the CRI-O runtime. Storage used for active container runtimes, including pods, and storage of local images. Not used for registry storage. | 50 GB for a node with 16 GB memory. Note that this sizing should not be used to determine minimum cluster requirements. Additional 20-25 GB for every additional 8 GB of memory. | Growth is limited by capacity for running containers. |
/var/lib/kubelet | Ephemeral volume storage for pods. This includes anything external that is mounted into a container at runtime. Includes environment variables, kube secrets, and data volumes not backed by persistent volumes. | Varies | Minimal if pods requiring storage are using persistent volumes. If using ephemeral storage, this can grow quickly. |
7.1.4. Optimizing storage performance for Microsoft Azure
OpenShift Container Platform and Kubernetes are sensitive to disk performance, and faster storage is recommended, particularly for etcd on the control plane nodes.
For production Azure clusters and clusters with intensive workloads, the virtual machine operating system disk for control plane machines should be able to sustain a tested and recommended minimum throughput of 5000 IOPS / 200MBps. This throughput can be provided by having a minimum of 1 TiB Premium SSD (P30). In Azure and Azure Stack Hub, disk performance is directly dependent on SSD disk sizes. To achieve the throughput supported by a Standard_D8s_v3
virtual machine, or other similar machine types, and the target of 5000 IOPS, at least a P30 disk is required.
Host caching must be set to ReadOnly
for low latency and high IOPS and throughput when reading data. Reading data from the cache, which is present either in the VM memory or in the local SSD disk, is much faster than reading from the disk, which is in the blob storage.
7.1.5. Additional resources
7.2. Optimizing routing
The OpenShift Container Platform HAProxy router can be scaled or configured to optimize performance.
7.2.1. Baseline Ingress Controller (router) performance
The OpenShift Container Platform Ingress Controller, or router, is the ingress point for ingress traffic for applications and services that are configured using routes and ingresses.
When evaluating a single HAProxy router performance in terms of HTTP requests handled per second, the performance varies depending on many factors. In particular:
- HTTP keep-alive/close mode
- Route type
- TLS session resumption client support
- Number of concurrent connections per target route
- Number of target routes
- Back end server page size
- Underlying infrastructure (network/SDN solution, CPU, and so on)
While performance in your specific environment will vary, Red Hat lab tests on a public cloud instance of size 4 vCPU/16GB RAM. A single HAProxy router handling 100 routes terminated by backends serving 1kB static pages is able to handle the following number of transactions per second.
In HTTP keep-alive mode scenarios:
Encryption | LoadBalancerService | HostNetwork |
---|---|---|
none | 21515 | 29622 |
edge | 16743 | 22913 |
passthrough | 36786 | 53295 |
re-encrypt | 21583 | 25198 |
In HTTP close (no keep-alive) scenarios:
Encryption | LoadBalancerService | HostNetwork |
---|---|---|
none | 5719 | 8273 |
edge | 2729 | 4069 |
passthrough | 4121 | 5344 |
re-encrypt | 2320 | 2941 |
The default Ingress Controller configuration was used with the spec.tuningOptions.threadCount
field set to 4
. Two different endpoint publishing strategies were tested: Load Balancer Service and Host Network. TLS session resumption was used for encrypted routes. With HTTP keep-alive, a single HAProxy router is capable of saturating a 1 Gbit NIC at page sizes as small as 8 kB.
When running on bare metal with modern processors, you can expect roughly twice the performance of the public cloud instance above. This overhead is introduced by the virtualization layer in place on public clouds and holds mostly true for private cloud-based virtualization as well. The following table is a guide to how many applications to use behind the router:
Number of applications | Application type |
---|---|
5-10 | static file/web server or caching proxy |
100-1000 | applications generating dynamic content |
In general, HAProxy can support routes for up to 1000 applications, depending on the technology in use. Ingress Controller performance might be limited by the capabilities and performance of the applications behind it, such as language or static versus dynamic content.
Ingress, or router, sharding should be used to serve more routes towards applications and help horizontally scale the routing tier.
For more information on Ingress sharding, see Configuring Ingress Controller sharding by using route labels and Configuring Ingress Controller sharding by using namespace labels.
You can modify the Ingress Controller deployment using the information provided in Setting Ingress Controller thread count for threads and Ingress Controller configuration parameters for timeouts, and other tuning configurations in the Ingress Controller specification.
7.2.2. Configuring Ingress Controller liveness, readiness, and startup probes
Cluster administrators can configure the timeout values for the kubelet’s liveness, readiness, and startup probes for router deployments that are managed by the OpenShift Container Platform Ingress Controller (router). The liveness and readiness probes of the router use the default timeout value of 1 second, which is too brief when networking or runtime performance is severely degraded. Probe timeouts can cause unwanted router restarts that interrupt application connections. The ability to set larger timeout values can reduce the risk of unnecessary and unwanted restarts.
You can update the timeoutSeconds
value on the livenessProbe
, readinessProbe
, and startupProbe
parameters of the router container.
Parameter | Description |
---|---|
|
The |
|
The |
|
The |
The timeout configuration option is an advanced tuning technique that can be used to work around issues. However, these issues should eventually be diagnosed and possibly a support case or Jira issue opened for any issues that causes probes to time out.
The following example demonstrates how you can directly patch the default router deployment to set a 5-second timeout for the liveness and readiness probes:
$ oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}'
Verification
$ oc -n openshift-ingress describe deploy/router-default | grep -e Liveness: -e Readiness: Liveness: http-get http://:1936/healthz delay=0s timeout=5s period=10s #success=1 #failure=3 Readiness: http-get http://:1936/healthz/ready delay=0s timeout=5s period=10s #success=1 #failure=3
7.2.3. Configuring HAProxy reload interval
When you update a route or an endpoint associated with a route, OpenShift Container Platform router updates the configuration for HAProxy. Then, HAProxy reloads the updated configuration for those changes to take effect. When HAProxy reloads, it generates a new process that handles new connections using the updated configuration.
HAProxy keeps the old process running to handle existing connections until those connections are all closed. When old processes have long-lived connections, these processes can accumulate and consume resources.
The default minimum HAProxy reload interval is five seconds. You can configure an Ingress Controller using its spec.tuningOptions.reloadInterval
field to set a longer minimum reload interval.
Setting a large value for the minimum HAProxy reload interval can cause latency in observing updates to routes and their endpoints. To lessen the risk, avoid setting a value larger than the tolerable latency for updates.
Procedure
Change the minimum HAProxy reload interval of the default Ingress Controller to 15 seconds by running the following command:
$ oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"tuningOptions":{"reloadInterval":"15s"}}}'
7.3. Optimizing networking
The OpenShift SDN uses OpenvSwitch, virtual extensible LAN (VXLAN) tunnels, OpenFlow rules, and iptables. This network can be tuned by using jumbo frames, network interface controllers (NIC) offloads, multi-queue, and ethtool settings.
OVN-Kubernetes uses Geneve (Generic Network Virtualization Encapsulation) instead of VXLAN as the tunnel protocol.
VXLAN provides benefits over VLANs, such as an increase in networks from 4096 to over 16 million, and layer 2 connectivity across physical networks. This allows for all pods behind a service to communicate with each other, even if they are running on different systems.
VXLAN encapsulates all tunneled traffic in user datagram protocol (UDP) packets. However, this leads to increased CPU utilization. Both these outer- and inner-packets are subject to normal checksumming rules to guarantee data is not corrupted during transit. Depending on CPU performance, this additional processing overhead can cause a reduction in throughput and increased latency when compared to traditional, non-overlay networks.
Cloud, VM, and bare metal CPU performance can be capable of handling much more than one Gbps network throughput. When using higher bandwidth links such as 10 or 40 Gbps, reduced performance can occur. This is a known issue in VXLAN-based environments and is not specific to containers or OpenShift Container Platform. Any network that relies on VXLAN tunnels will perform similarly because of the VXLAN implementation.
If you are looking to push beyond one Gbps, you can:
- Evaluate network plugins that implement different routing techniques, such as border gateway protocol (BGP).
- Use VXLAN-offload capable network adapters. VXLAN-offload moves the packet checksum calculation and associated CPU overhead off of the system CPU and onto dedicated hardware on the network adapter. This frees up CPU cycles for use by pods and applications, and allows users to utilize the full bandwidth of their network infrastructure.
VXLAN-offload does not reduce latency. However, CPU utilization is reduced even in latency tests.
7.3.1. Optimizing the MTU for your network
There are two important maximum transmission units (MTUs): the network interface controller (NIC) MTU and the cluster network MTU.
The NIC MTU is only configured at the time of OpenShift Container Platform installation. The MTU must be less than or equal to the maximum supported value of the NIC of your network. If you are optimizing for throughput, choose the largest possible value. If you are optimizing for lowest latency, choose a lower value.
The OpenShift SDN network plugin overlay MTU must be less than the NIC MTU by 50 bytes at a minimum. This accounts for the SDN overlay header. So, on a normal ethernet network, this should be set to 1450
. On a jumbo frame ethernet network, this should be set to 8950
. These values should be set automatically by the Cluster Network Operator based on the NIC’s configured MTU. Therefore, cluster administrators do not typically update these values. Amazon Web Services (AWS) and bare-metal environments support jumbo frame ethernet networks. This setting will help throughput, especially with transmission control protocol (TCP).
For OVN and Geneve, the MTU must be less than the NIC MTU by 100 bytes at a minimum.
This 50 byte overlay header is relevant to the OpenShift SDN network plugin. Other SDN solutions might require the value to be more or less.
7.3.2. Recommended practices for installing large scale clusters
When installing large clusters or scaling the cluster to larger node counts, set the cluster network cidr
accordingly in your install-config.yaml
file before you install the cluster:
networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.0.0.0/16 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16
The default cluster network cidr
10.128.0.0/14
cannot be used if the cluster size is more than 500 nodes. It must be set to 10.128.0.0/12
or 10.128.0.0/10
to get to larger node counts beyond 500 nodes.
7.3.3. Impact of IPsec
Because encrypting and decrypting node hosts uses CPU power, performance is affected both in throughput and CPU usage on the nodes when encryption is enabled, regardless of the IP security system being used.
IPSec encrypts traffic at the IP payload level, before it hits the NIC, protecting fields that would otherwise be used for NIC offloading. This means that some NIC acceleration features might not be usable when IPSec is enabled and will lead to decreased throughput and increased CPU usage.
7.3.4. Additional resources
7.4. Optimizing CPU usage with mount namespace encapsulation
You can optimize CPU usage in OpenShift Container Platform clusters by using mount namespace encapsulation to provide a private namespace for kubelet and CRI-O processes. This reduces the cluster CPU resources used by systemd with no difference in functionality.
Mount namespace encapsulation is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
7.4.1. Encapsulating mount namespaces
Mount namespaces are used to isolate mount points so that processes in different namespaces cannot view each others' files. Encapsulation is the process of moving Kubernetes mount namespaces to an alternative location where they will not be constantly scanned by the host operating system.
The host operating system uses systemd to constantly scan all mount namespaces: both the standard Linux mounts and the numerous mounts that Kubernetes uses to operate. The current implementation of kubelet and CRI-O both use the top-level namespace for all container runtime and kubelet mount points. However, encapsulating these container-specific mount points in a private namespace reduces systemd overhead with no difference in functionality. Using a separate mount namespace for both CRI-O and kubelet can encapsulate container-specific mounts from any systemd or other host operating system interaction.
This ability to potentially achieve major CPU optimization is now available to all OpenShift Container Platform administrators. Encapsulation can also improve security by storing Kubernetes-specific mount points in a location safe from inspection by unprivileged users.
The following diagrams illustrate a Kubernetes installation before and after encapsulation. Both scenarios show example containers which have mount propagation settings of bidirectional, host-to-container, and none.

Here we see systemd, host operating system processes, kubelet, and the container runtime sharing a single mount namespace.
- systemd, host operating system processes, kubelet, and the container runtime each have access to and visibility of all mount points.
-
Container 1, configured with bidirectional mount propagation, can access systemd and host mounts, kubelet and CRI-O mounts. A mount originating in Container 1, such as
/run/a
is visible to systemd, host operating system processes, kubelet, container runtime, and other containers with host-to-container or bidirectional mount propagation configured (as in Container 2). -
Container 2, configured with host-to-container mount propagation, can access systemd and host mounts, kubelet and CRI-O mounts. A mount originating in Container 2, such as
/run/b
, is not visible to any other context. -
Container 3, configured with no mount propagation, has no visibility of external mount points. A mount originating in Container 3, such as
/run/c
, is not visible to any other context.
The following diagram illustrates the system state after encapsulation.

- The main systemd process is no longer devoted to unnecessary scanning of Kubernetes-specific mount points. It only monitors systemd-specific and host mount points.
- The host operating system processes can access only the systemd and host mount points.
- Using a separate mount namespace for both CRI-O and kubelet completely separates all container-specific mounts away from any systemd or other host operating system interaction whatsoever.
-
The behavior of Container 1 is unchanged, except a mount it creates such as
/run/a
is no longer visible to systemd or host operating system processes. It is still visible to kubelet, CRI-O, and other containers with host-to-container or bidirectional mount propagation configured (like Container 2). - The behavior of Container 2 and Container 3 is unchanged.
7.4.2. Configuring mount namespace encapsulation
You can configure mount namespace encapsulation so that a cluster runs with less resource overhead.
Mount namespace encapsulation is a Technology Preview feature and it is disabled by default. To use it, you must enable the feature manually.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
Create a file called
mount_namespace_config.yaml
with the following YAML:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 99-kubens-master spec: config: ignition: version: 3.2.0 systemd: units: - enabled: true name: kubens.service --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-kubens-worker spec: config: ignition: version: 3.2.0 systemd: units: - enabled: true name: kubens.service
Apply the mount namespace
MachineConfig
CR by running the following command:$ oc apply -f mount_namespace_config.yaml
Example output
machineconfig.machineconfiguration.openshift.io/99-kubens-master created machineconfig.machineconfiguration.openshift.io/99-kubens-worker created
The
MachineConfig
CR can take up to 30 minutes to finish being applied in the cluster. You can check the status of theMachineConfig
CR by running the following command:$ oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-03d4bc4befb0f4ed3566a2c8f7636751 False True False 3 0 0 0 45m worker rendered-worker-10577f6ab0117ed1825f8af2ac687ddf False True False 3 1 1
Wait for the
MachineConfig
CR to be applied successfully across all control plane and worker nodes after running the following command:$ oc wait --for=condition=Updated mcp --all --timeout=30m
Example output
machineconfigpool.machineconfiguration.openshift.io/master condition met machineconfigpool.machineconfiguration.openshift.io/worker condition met
Verification
To verify encapsulation for a cluster host, run the following commands:
Open a debug shell to the cluster host:
$ oc debug node/<node_name>
Open a
chroot
session:sh-4.4# chroot /host
Check the systemd mount namespace:
sh-4.4# readlink /proc/1/ns/mnt
Example output
mnt:[4026531953]
Check kubelet mount namespace:
sh-4.4# readlink /proc/$(pgrep kubelet)/ns/mnt
Example output
mnt:[4026531840]
Check the CRI-O mount namespace:
sh-4.4# readlink /proc/$(pgrep crio)/ns/mnt
Example output
mnt:[4026531840]
These commands return the mount namespaces associated with systemd, kubelet, and the container runtime. In OpenShift Container Platform, the container runtime is CRI-O.
Encapsulation is in effect if systemd is in a different mount namespace to kubelet and CRI-O as in the above example. Encapsulation is not in effect if all three processes are in the same mount namespace.
7.4.3. Inspecting encapsulated namespaces
You can inspect Kubernetes-specific mount points in the cluster host operating system for debugging or auditing purposes by using the kubensenter
script that is available in Red Hat Enterprise Linux CoreOS (RHCOS).
SSH shell sessions to the cluster host are in the default namespace. To inspect Kubernetes-specific mount points in an SSH shell prompt, you need to run the kubensenter
script as root. The kubensenter
script is aware of the state of the mount encapsulation, and is safe to run even if encapsulation is not enabled.
oc debug
remote shell sessions start inside the Kubernetes namespace by default. You do not need to run kubensenter
to inspect mount points when you use oc debug
.
If the encapsulation feature is not enabled, the kubensenter findmnt
and findmnt
commands return the same output, regardless of whether they are run in an oc debug
session or in an SSH shell prompt.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges. - You have configured SSH access to the cluster host.
Procedure
Open a remote SSH shell to the cluster host. For example:
$ ssh core@<node_name>
Run commands using the provided
kubensenter
script as the root user. To run a single command inside the Kubernetes namespace, provide the command and any arguments to thekubensenter
script. For example, to run thefindmnt
command inside the Kubernetes namespace, run the following command:[core@control-plane-1 ~]$ sudo kubensenter findmnt
Example output
kubensenter: Autodetect: kubens.service namespace found at /run/kubens/mnt TARGET SOURCE FSTYPE OPTIONS / /dev/sda4[/ostree/deploy/rhcos/deploy/32074f0e8e5ec453e56f5a8a7bc9347eaa4172349ceab9c22b709d9d71a3f4b0.0] | xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota shm tmpfs ...
To start a new interactive shell inside the Kubernetes namespace, run the
kubensenter
script without any arguments:[core@control-plane-1 ~]$ sudo kubensenter
Example output
kubensenter: Autodetect: kubens.service namespace found at /run/kubens/mnt
7.4.4. Running additional services in the encapsulated namespace
Any monitoring tool that relies on the ability to run in the host operating system and have visibility of mount points created by kubelet, CRI-O, or containers themselves, must enter the container mount namespace to see these mount points. The kubensenter
script that is provided with OpenShift Container Platform executes another command inside the Kubernetes mount point and can be used to adapt any existing tools.
The kubensenter
script is aware of the state of the mount encapsulation feature status, and is safe to run even if encapsulation is not enabled. In that case the script executes the provided command in the default mount namespace.
For example, if a systemd service needs to run inside the new Kubernetes mount namespace, edit the service file and use the ExecStart=
command line with kubensenter
.
[Unit] Description=Example service [Service] ExecStart=/usr/bin/kubensenter /path/to/original/command arg1 arg2
7.4.5. Additional resources
Chapter 8. Managing bare metal hosts
When you install OpenShift Container Platform on a bare metal cluster, you can provision and manage bare metal nodes using machine
and machineset
custom resources (CRs) for bare metal hosts that exist in the cluster.
8.1. About bare metal hosts and nodes
To provision a Red Hat Enterprise Linux CoreOS (RHCOS) bare metal host as a node in your cluster, first create a MachineSet
custom resource (CR) object that corresponds to the bare metal host hardware. Bare metal host compute machine sets describe infrastructure components specific to your configuration. You apply specific Kubernetes labels to these compute machine sets and then update the infrastructure components to run on only those machines.
Machine
CR’s are created automatically when you scale up the relevant MachineSet
containing a metal3.io/autoscale-to-hosts
annotation. OpenShift Container Platform uses Machine
CR’s to provision the bare metal node that corresponds to the host as specified in the MachineSet
CR.
8.2. Maintaining bare metal hosts
You can maintain the details of the bare metal hosts in your cluster from the OpenShift Container Platform web console. Navigate to Compute → Bare Metal Hosts, and select a task from the Actions drop down menu. Here you can manage items such as BMC details, boot MAC address for the host, enable power management, and so on. You can also review the details of the network interfaces and drives for the host.
You can move a bare metal host into maintenance mode. When you move a host into maintenance mode, the scheduler moves all managed workloads off the corresponding bare metal node. No new workloads are scheduled while in maintenance mode.
You can deprovision a bare metal host in the web console. Deprovisioning a host does the following actions:
-
Annotates the bare metal host CR with
cluster.k8s.io/delete-machine: true
- Scales down the related compute machine set
Powering off the host without first moving the daemon set and unmanaged static pods to another node can cause service disruption and loss of data.
Additional resources
8.2.1. Adding a bare metal host to the cluster using the web console
You can add bare metal hosts to the cluster in the web console.
Prerequisites
- Install an RHCOS cluster on bare metal.
-
Log in as a user with
cluster-admin
privileges.
Procedure
- In the web console, navigate to Compute → Bare Metal Hosts.
- Select Add Host → New with Dialog.
- Specify a unique name for the new bare metal host.
- Set the Boot MAC address.
- Set the Baseboard Management Console (BMC) Address.
- Enter the user credentials for the host’s baseboard management controller (BMC).
- Select to power on the host after creation, and select Create.
- Scale up the number of replicas to match the number of available bare metal hosts. Navigate to Compute → MachineSets, and increase the number of machine replicas in the cluster by selecting Edit Machine count from the Actions drop-down menu.
You can also manage the number of bare metal nodes using the oc scale
command and the appropriate bare metal compute machine set.
8.2.2. Adding a bare metal host to the cluster using YAML in the web console
You can add bare metal hosts to the cluster in the web console using a YAML file that describes the bare metal host.
Prerequisites
- Install a RHCOS compute machine on bare metal infrastructure for use in the cluster.
-
Log in as a user with
cluster-admin
privileges. -
Create a
Secret
CR for the bare metal host.
Procedure
- In the web console, navigate to Compute → Bare Metal Hosts.
- Select Add Host → New from YAML.
Copy and paste the below YAML, modifying the relevant fields with the details of your host:
apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: <bare_metal_host_name> spec: online: true bmc: address: <bmc_address> credentialsName: <secret_credentials_name> 1 disableCertificateVerification: True 2 bootMACAddress: <host_boot_mac_address>
- 1
credentialsName
must reference a validSecret
CR. Thebaremetal-operator
cannot manage the bare metal host without a validSecret
referenced in thecredentialsName
. For more information about secrets and how to create them, see Understanding secrets.- 2
- Setting
disableCertificateVerification
totrue
disables TLS host validation between the cluster and the baseboard management controller (BMC).
- Select Create to save the YAML and create the new bare metal host.
Scale up the number of replicas to match the number of available bare metal hosts. Navigate to Compute → MachineSets, and increase the number of machines in the cluster by selecting Edit Machine count from the Actions drop-down menu.
NoteYou can also manage the number of bare metal nodes using the
oc scale
command and the appropriate bare metal compute machine set.
8.2.3. Automatically scaling machines to the number of available bare metal hosts
To automatically create the number of Machine
objects that matches the number of available BareMetalHost
objects, add a metal3.io/autoscale-to-hosts
annotation to the MachineSet
object.
Prerequisites
-
Install RHCOS bare metal compute machines for use in the cluster, and create corresponding
BareMetalHost
objects. -
Install the OpenShift Container Platform CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Annotate the compute machine set that you want to configure for automatic scaling by adding the
metal3.io/autoscale-to-hosts
annotation. Replace<machineset>
with the name of the compute machine set.$ oc annotate machineset <machineset> -n openshift-machine-api 'metal3.io/autoscale-to-hosts=<any_value>'
Wait for the new scaled machines to start.
When you use a BareMetalHost
object to create a machine in the cluster and labels or selectors are subsequently changed on the BareMetalHost
, the BareMetalHost
object continues be counted against the MachineSet
that the Machine
object was created from.
8.2.4. Removing bare metal hosts from the provisioner node
In certain circumstances, you might want to temporarily remove bare metal hosts from the provisioner node. For example, during provisioning when a bare metal host reboot is triggered by using the OpenShift Container Platform administration console or as a result of a Machine Config Pool update, OpenShift Container Platform logs into the integrated Dell Remote Access Controller (iDrac) and issues a delete of the job queue.
To prevent the management of the number of Machine
objects that matches the number of available BareMetalHost
objects, add a baremetalhost.metal3.io/detached
annotation to the MachineSet
object.
This annotation has an effect for only BareMetalHost
objects that are in either Provisioned
, ExternallyProvisioned
or Ready/Available
state.
Prerequisites
-
Install RHCOS bare metal compute machines for use in the cluster and create corresponding
BareMetalHost
objects. -
Install the OpenShift Container Platform CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Annotate the compute machine set that you want to remove from the provisioner node by adding the
baremetalhost.metal3.io/detached
annotation.$ oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached'
Wait for the new machines to start.
NoteWhen you use a
BareMetalHost
object to create a machine in the cluster and labels or selectors are subsequently changed on theBareMetalHost
, theBareMetalHost
object continues be counted against theMachineSet
that theMachine
object was created from.In the provisioning use case, remove the annotation after the reboot is complete by using the following command:
$ oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached-'
Additional resources
Chapter 9. What huge pages do and how they are consumed by applications
9.1. What huge pages do
Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, and so on. CPUs have a built-in memory management unit that manages a list of these pages in hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of virtual-to-physical page mappings. If the virtual address passed in a hardware instruction can be found in the TLB, the mapping can be determined quickly. If not, a TLB miss occurs, and the system falls back to slower, software-based address translation, resulting in performance issues. Since the size of the TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the page size.
A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other architectures. To use huge pages, code must be written so that applications are aware of them. Transparent Huge Pages (THP) attempt to automate the management of huge pages without application knowledge, but they have limitations. In particular, they are limited to 2Mi page sizes. THP can lead to performance degradation on nodes with high memory utilization or fragmentation due to defragmenting efforts of THP, which can lock memory pages. For this reason, some applications may be designed to (or recommend) usage of pre-allocated huge pages instead of THP.
In OpenShift Container Platform, applications in a pod can allocate and consume pre-allocated huge pages.
9.2. How huge pages are consumed by apps
Nodes must pre-allocate huge pages in order for the node to report its huge page capacity. A node can only pre-allocate huge pages for a single size.
Huge pages can be consumed through container-level resource requirements using the resource name hugepages-<size>
, where size is the most compact binary notation using integer values supported on a particular node. For example, if a node supports 2048KiB page sizes, it exposes a schedulable resource hugepages-2Mi
. Unlike CPU or memory, huge pages do not support over-commitment.
apiVersion: v1
kind: Pod
metadata:
generateName: hugepages-volume-
spec:
containers:
- securityContext:
privileged: true
image: rhel7:latest
command:
- sleep
- inf
name: example
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
resources:
limits:
hugepages-2Mi: 100Mi 1
memory: "1Gi"
cpu: "1"
volumes:
- name: hugepage
emptyDir:
medium: HugePages
- 1
- Specify the amount of memory for
hugepages
as the exact amount to be allocated. Do not specify this value as the amount of memory forhugepages
multiplied by the size of the page. For example, given a huge page size of 2MB, if you want to use 100MB of huge-page-backed RAM for your application, then you would allocate 50 huge pages. OpenShift Container Platform handles the math for you. As in the above example, you can specify100MB
directly.
Allocating huge pages of a specific size
Some platforms support multiple huge page sizes. To allocate huge pages of a specific size, precede the huge pages boot command parameters with a huge page size selection parameter hugepagesz=<size>
. The <size>
value must be specified in bytes with an optional scale suffix [kKmMgG
]. The default huge page size can be defined with the default_hugepagesz=<size>
boot parameter.
Huge page requirements
- Huge page requests must equal the limits. This is the default if limits are specified, but requests are not.
- Huge pages are isolated at a pod scope. Container isolation is planned in a future iteration.
-
EmptyDir
volumes backed by huge pages must not consume more huge page memory than the pod request. -
Applications that consume huge pages via
shmget()
withSHM_HUGETLB
must run with a supplemental group that matches proc/sys/vm/hugetlb_shm_group.
9.3. Consuming huge pages resources using the Downward API
You can use the Downward API to inject information about the huge pages resources that are consumed by a container.
You can inject the resource allocation as environment variables, a volume plugin, or both. Applications that you develop and run in the container can determine the resources that are available by reading the environment variables or files in the specified volumes.
Procedure
Create a
hugepages-volume-pod.yaml
file that is similar to the following example:apiVersion: v1 kind: Pod metadata: generateName: hugepages-volume- labels: app: hugepages-example spec: containers: - securityContext: capabilities: add: [ "IPC_LOCK" ] image: rhel7:latest command: - sleep - inf name: example volumeMounts: - mountPath: /dev/hugepages name: hugepage - mountPath: /etc/podinfo name: podinfo resources: limits: hugepages-1Gi: 2Gi memory: "1Gi" cpu: "1" requests: hugepages-1Gi: 2Gi env: - name: REQUESTS_HUGEPAGES_1GI <.> valueFrom: resourceFieldRef: containerName: example resource: requests.hugepages-1Gi volumes: - name: hugepage emptyDir: medium: HugePages - name: podinfo downwardAPI: items: - path: "hugepages_1G_request" <.> resourceFieldRef: containerName: example resource: requests.hugepages-1Gi divisor: 1Gi
<.> Specifies to read the resource use from
requests.hugepages-1Gi
and expose the value as theREQUESTS_HUGEPAGES_1GI
environment variable. <.> Specifies to read the resource use fromrequests.hugepages-1Gi
and expose the value as the file/etc/podinfo/hugepages_1G_request
.Create the pod from the
hugepages-volume-pod.yaml
file:$ oc create -f hugepages-volume-pod.yaml
Verification
Check the value of the
REQUESTS_HUGEPAGES_1GI
environment variable:$ oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \ -- env | grep REQUESTS_HUGEPAGES_1GI
Example output
REQUESTS_HUGEPAGES_1GI=2147483648
Check the value of the
/etc/podinfo/hugepages_1G_request
file:$ oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \ -- cat /etc/podinfo/hugepages_1G_request
Example output
2
Additional resources
9.4. Configuring huge pages at boot time
Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. There are two ways of reserving huge pages: at boot time and at run time. Reserving at boot time increases the possibility of success because the memory has not yet been significantly fragmented. The Node Tuning Operator currently supports boot time allocation of huge pages on specific nodes.
Procedure
To minimize node reboots, the order of the steps below needs to be followed:
Label all nodes that need the same huge pages setting by a label.
$ oc label node <node_using_hugepages> node-role.kubernetes.io/worker-hp=
Create a file with the following content and name it
hugepages-tuned-boottime.yaml
:apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: hugepages 1 namespace: openshift-cluster-node-tuning-operator spec: profile: 2 - data: | [main] summary=Boot time configuration for hugepages include=openshift-node [bootloader] cmdline_openshift_node_hugepages=hugepagesz=2M hugepages=50 3 name: openshift-node-hugepages recommend: - machineConfigLabels: 4 machineconfiguration.openshift.io/role: "worker-hp" priority: 30 profile: openshift-node-hugepages
Create the Tuned
hugepages
object$ oc create -f hugepages-tuned-boottime.yaml
Create a file with the following content and name it
hugepages-mcp.yaml
:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-hp labels: worker-hp: "" spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-hp]} nodeSelector: matchLabels: node-role.kubernetes.io/worker-hp: ""
Create the machine config pool:
$ oc create -f hugepages-mcp.yaml
Given enough non-fragmented memory, all the nodes in the worker-hp
machine config pool should now have 50 2Mi huge pages allocated.
$ oc get node <node_using_hugepages> -o jsonpath="{.status.allocatable.hugepages-2Mi}" 100Mi
The TuneD bootloader plugin is currently supported on Red Hat Enterprise Linux CoreOS (RHCOS) 8.x worker nodes. For Red Hat Enterprise Linux (RHEL) 7.x worker nodes, the TuneD bootloader plugin is currently not supported.
9.5. Disabling Transparent Huge Pages
Transparent Huge Pages (THP) attempt to automate most aspects of creating, managing, and using huge pages. Since THP automatically manages the huge pages, this is not always handled optimally for all types of workloads. THP can lead to performance regressions, since many applications handle huge pages on their own. Therefore, consider disabling THP. The following steps describe how to disable THP using the Node Tuning Operator (NTO).
Procedure
Create a file with the following content and name it
thp-disable-tuned.yaml
:apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: thp-workers-profile namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom tuned profile for OpenShift to turn off THP on worker nodes include=openshift-node [vm] transparent_hugepages=never name: openshift-thp-never-worker recommend: - match: - label: node-role.kubernetes.io/worker priority: 25 profile: openshift-thp-never-worker
Create the Tuned object:
$ oc create -f thp-disable-tuned.yaml
Check the list of active profiles:
$ oc get profile -n openshift-cluster-node-tuning-operator
Verification
Log in to one of the nodes and do a regular THP check to verify if the nodes applied the profile successfully:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
Example output
always madvise [never]
Chapter 10. Low latency tuning
10.1. Understanding low latency
The emergence of Edge computing in the area of Telco / 5G plays a key role in reducing latency and congestion problems and improving application performance.
Simply put, latency determines how fast data (packets) moves from the sender to receiver and returns to the sender after processing by the receiver. Maintaining a network architecture with the lowest possible delay of latency speeds is key for meeting the network performance requirements of 5G. Compared to 4G technology, with an average latency of 50 ms, 5G is targeted to reach latency numbers of 1 ms or less. This reduction in latency boosts wireless throughput by a factor of 10.
Many of the deployed applications in the Telco space require low latency that can only tolerate zero packet loss. Tuning for zero packet loss helps mitigate the inherent issues that degrade network performance. For more information, see Tuning for Zero Packet Loss in Red Hat OpenStack Platform (RHOSP).
The Edge computing initiative also comes in to play for reducing latency rates. Think of it as being on the edge of the cloud and closer to the user. This greatly reduces the distance between the user and distant data centers, resulting in reduced application response times and performance latency.
Administrators must be able to manage their many Edge sites and local services in a centralized way so that all of the deployments can run at the lowest possible management cost. They also need an easy way to deploy and configure certain nodes of their cluster for real-time low latency and high-performance purposes. Low latency nodes are useful for applications such as Cloud-native Network Functions (CNF) and Data Plane Development Kit (DPDK).
OpenShift Container Platform currently provides mechanisms to tune software on an OpenShift Container Platform cluster for real-time running and low latency (around <20 microseconds reaction time). This includes tuning the kernel and OpenShift Container Platform set values, installing a kernel, and reconfiguring the machine. But this method requires setting up four different Operators and performing many configurations that, when done manually, is complex and could be prone to mistakes.
OpenShift Container Platform uses the Node Tuning Operator to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications. The cluster administrator uses this performance profile configuration that makes it easier to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt, reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolate CPUs for application containers to run the workloads.
Currently, disabling CPU load balancing is not supported by cgroup v2. As a result, you might not get the desired behavior from performance profiles if you have cgroup v2 enabled. Enabling cgroup v2 is not recommended if you are using performace profiles.
OpenShift Container Platform also supports workload hints for the Node Tuning Operator that can tune the PerformanceProfile
to meet the demands of different industry environments. Workload hints are available for highPowerConsumption
(very low latency at the cost of increased power consumption) and realTime
(priority given to optimum latency). A combination of true/false
settings for these hints can be used to deal with application-specific workload profiles and requirements.
Workload hints simplify the fine-tuning of performance to industry sector settings. Instead of a “one size fits all” approach, workload hints can cater to usage patterns such as placing priority on:
- Low latency
- Real-time capability
- Efficient use of power
In an ideal world, all of those would be prioritized: in real life, some come at the expense of others. The Node Tuning Operator is now aware of the workload expectations and better able to meet the demands of the workload. The cluster admin can now specify into which use case that workload falls. The Node Tuning Operator uses the PerformanceProfile
to fine tune the performance settings for the workload.
The environment in which an application is operating influences its behavior. For a typical data center with no strict latency requirements, only minimal default tuning is needed that enables CPU partitioning for some high performance workload pods. For data centers and workloads where latency is a higher priority, measures are still taken to optimize power consumption. The most complicated cases are clusters close to latency-sensitive equipment such as manufacturing machinery and software-defined radios. This last class of deployment is often referred to as Far edge. For Far edge deployments, ultra-low latency is the ultimate priority, and is achieved at the expense of power management.
In OpenShift Container Platform version 4.10 and previous versions, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance. Now this functionality is part of the Node Tuning Operator.
10.1.1. About hyperthreading for low latency and real-time applications
Hyperthreading is an Intel processor technology that allows a physical CPU processor core to function as two logical cores, executing two independent threads simultaneously. Hyperthreading allows for better system throughput for certain workload types where parallel processing is beneficial. The default OpenShift Container Platform configuration expects hyperthreading to be enabled by default.
For telecommunications applications, it is important to design your application infrastructure to minimize latency as much as possible. Hyperthreading can slow performance times and negatively affect throughput for compute intensive workloads that require low latency. Disabling hyperthreading ensures predictable performance and can decrease processing times for these workloads.
Hyperthreading implementation and configuration differs depending on the hardware you are running OpenShift Container Platform on. Consult the relevant host hardware tuning information for more details of the hyperthreading implementation specific to that hardware. Disabling hyperthreading can increase the cost per core of the cluster.
Additional resources
10.2. Provisioning real-time and low latency workloads
Many industries and organizations need extremely high performance computing and might require low and predictable latency, especially in the financial and telecommunications industries. For these industries, with their unique requirements, OpenShift Container Platform provides the Node Tuning Operator to implement automatic tuning to achieve low latency performance and consistent response time for OpenShift Container Platform applications.
The cluster administrator can use this performance profile configuration to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt (real-time), reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, isolate CPUs for application containers to run the workloads, and disable unused CPUs to reduce power consumption.
The usage of execution probes in conjunction with applications that require guaranteed CPUs can cause latency spikes. It is recommended to use other probes, such as a properly configured set of network probes, as an alternative.
In earlier versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, these functions are part of the Node Tuning Operator.
10.2.1. Known limitations for real-time
In most deployments, kernel-rt is supported only on worker nodes when you use a standard cluster with three control plane nodes and three worker nodes. There are exceptions for compact and single nodes on OpenShift Container Platform deployments. For installations on a single node, kernel-rt is supported on the single control plane node.
To fully utilize the real-time mode, the containers must run with elevated privileges. See Set capabilities for a Container for information on granting privileges.
OpenShift Container Platform restricts the allowed capabilities, so you might need to create a SecurityContext
as well.
This procedure is fully supported with bare metal installations using Red Hat Enterprise Linux CoreOS (RHCOS) systems.
Establishing the right performance expectations refers to the fact that the real-time kernel is not a panacea. Its objective is consistent, low-latency determinism offering predictable response times. There is some additional kernel overhead associated with the real-time kernel. This is due primarily to handling hardware interruptions in separately scheduled threads. The increased overhead in some workloads results in some degradation in overall throughput. The exact amount of degradation is very workload dependent, ranging from 0% to 30%. However, it is the cost of determinism.
10.2.2. Provisioning a worker with real-time capabilities
- Optional: Add a node to the OpenShift Container Platform cluster. See Setting BIOS parameters for system tuning.
-
Add the label
worker-rt
to the worker nodes that require the real-time capability by using theoc
command. Create a new machine config pool for real-time nodes:
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-rt labels: machineconfiguration.openshift.io/role: worker-rt spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker, worker-rt], } paused: false nodeSelector: matchLabels: node-role.kubernetes.io/worker-rt: ""
Note that a machine config pool worker-rt is created for group of nodes that have the label
worker-rt
.Add the node to the proper machine config pool by using node role labels.
NoteYou must decide which nodes are configured with real-time workloads. You could configure all of the nodes in the cluster, or a subset of the nodes. The Node Tuning Operator that expects all of the nodes are part of a dedicated machine config pool. If you use all of the nodes, you must point the Node Tuning Operator to the worker node role label. If you use a subset, you must group the nodes into a new machine config pool.
-
Create the
PerformanceProfile
with the proper set of housekeeping cores andrealTimeKernel: enabled: true
. You must set
machineConfigPoolSelector
inPerformanceProfile
:apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: example-performanceprofile spec: ... realTimeKernel: enabled: true nodeSelector: node-role.kubernetes.io/worker-rt: "" machineConfigPoolSelector: machineconfiguration.openshift.io/role: worker-rt
Verify that a matching machine config pool exists with a label:
$ oc describe mcp/worker-rt
Example output
Name: worker-rt Namespace: Labels: machineconfiguration.openshift.io/role=worker-rt
- OpenShift Container Platform will start configuring the nodes, which might involve multiple reboots. Wait for the nodes to settle. This can take a long time depending on the specific hardware you use, but 20 minutes per node is expected.
- Verify everything is working as expected.
10.2.3. Verifying the real-time kernel installation
Use this command to verify that the real-time kernel is installed:
$ oc get node -o wide
Note the worker with the role worker-rt
that contains the string 4.18.0-305.30.1.rt7.102.el8_4.x86_64 cri-o://1.25.0-99.rhaos4.10.gitc3131de.el8
:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME rt-worker-0.example.com Ready worker,worker-rt 5d17h v1.25.0 128.66.135.107 <none> Red Hat Enterprise Linux CoreOS 46.82.202008252340-0 (Ootpa) 4.18.0-305.30.1.rt7.102.el8_4.x86_64 cri-o://1.25.0-99.rhaos4.10.gitc3131de.el8 [...]
10.2.4. Creating a workload that works in real-time
Use the following procedures for preparing a workload that will use real-time capabilities.
Procedure
-
Create a pod with a QoS class of
Guaranteed
. - Optional: Disable CPU load balancing for DPDK.
- Assign a proper node selector.
When writing your applications, follow the general recommendations described in Application tuning and deployment.
10.2.5. Creating a pod with a QoS class of Guaranteed
Keep the following in mind when you create a pod that is given a QoS class of Guaranteed
:
- Every container in the pod must have a memory limit and a memory request, and they must be the same.
- Every container in the pod must have a CPU limit and a CPU request, and they must be the same.
The following example shows the configuration file for a pod that has one container. The container has a memory limit and a memory request, both equal to 200 MiB. The container has a CPU limit and a CPU request, both equal to 1 CPU.
apiVersion: v1 kind: Pod metadata: name: qos-demo namespace: qos-example spec: containers: - name: qos-demo-ctr image: <image-pull-spec> resources: limits: memory: "200Mi" cpu: "1" requests: memory: "200Mi" cpu: "1"
Create the pod:
$ oc apply -f qos-pod.yaml --namespace=qos-example
View detailed information about the pod:
$ oc get pod qos-demo --namespace=qos-example --output=yaml
Example output
spec: containers: ... status: qosClass: Guaranteed
NoteIf a container specifies its own memory limit, but does not specify a memory request, OpenShift Container Platform automatically assigns a memory request that matches the limit. Similarly, if a container specifies its own CPU limit, but does not specify a CPU request, OpenShift Container Platform automatically assigns a CPU request that matches the limit.
10.2.6. Optional: Disabling CPU load balancing for DPDK
Functionality to disable or enable CPU load balancing is implemented on the CRI-O level. The code under the CRI-O disables or enables CPU load balancing only when the following requirements are met.
The pod must use the
performance-<profile-name>
runtime class. You can get the proper name by looking at the status of the performance profile, as shown here:apiVersion: performance.openshift.io/v2 kind: PerformanceProfile ... status: ... runtimeClass: performance-manual
Currently, disabling CPU load balancing is not supported with cgroup v2.
The Node Tuning Operator is responsible for the creation of the high-performance runtime handler config snippet under relevant nodes and for creation of the high-performance runtime class under the cluster. It will have the same content as default runtime handler except it enables the CPU load balancing configuration functionality.
To disable the CPU load balancing for the pod, the Pod
specification must include the following fields:
apiVersion: v1 kind: Pod metadata: ... annotations: ... cpu-load-balancing.crio.io: "disable" ... ... spec: ... runtimeClassName: performance-<profile_name> ...
Only disable CPU load balancing when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU load balancing can affect the performance of other containers in the cluster.
10.2.7. Assigning a proper node selector
The preferred way to assign a pod to nodes is to use the same node selector the performance profile used, as shown here:
apiVersion: v1 kind: Pod metadata: name: example spec: # ... nodeSelector: node-role.kubernetes.io/worker-rt: ""
For more information, see Placing pods on specific nodes using node selectors.
10.2.8. Scheduling a workload onto a worker with real-time capabilities
Use label selectors that match the nodes attached to the machine config pool that was configured for low latency by the Node Tuning Operator. For more information, see Assigning pods to nodes.
10.2.9. Reducing power consumption by taking CPUs offline
You can generally anticipate telecommunication workloads. When not all of the CPU resources are required, the Node Tuning Operator allows you take unused CPUs offline to reduce power consumption by manually updating the performance profile.
To take unused CPUs offline, you must perform the following tasks:
Set the offline CPUs in the performance profile and save the contents of the YAML file:
Example performance profile with offlined CPUs
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: additionalKernelArgs: - nmi_watchdog=0 - audit=0 - mce=off - processor.max_cstate=1 - intel_idle.max_cstate=0 - idle=poll cpu: isolated: "2-23,26-47" reserved: "0,1,24,25" offlined: “48-59” 1 nodeSelector: node-role.kubernetes.io/worker-cnf: "" numa: topologyPolicy: single-numa-node realTimeKernel: enabled: true
- 1
- Optional. You can list CPUs in the
offlined
field to take the specified CPUs offline.
Apply the updated profile by running the following command:
$ oc apply -f my-performance-profile.yaml
10.2.10. Optional: Power saving configurations
You can enable power savings for a node that has low priority workloads that are colocated with high priority workloads without impacting the latency or throughput of the high priority workloads. Power saving is possible without modifications to the workloads themselves.
The feature is supported on Intel Ice Lake and later generations of Intel CPUs. The capabilities of the processor might impact the latency and throughput of the high priority workloads.
When you configure a node with a power saving configuration, you must configure high priority workloads with performance configuration at the pod level, which means that the configuration applies to all the cores used by the pod.
By disabling P-states and C-states at the pod level, you can configure high priority workloads for best performance and lowest latency.
Table 10.1. Configuration for high priority workloads
Annotation | Description |
---|---|
annotations: cpu-c-states.crio.io: "disable" cpu-freq-governor.crio.io: "<governor>" |
Provides the best performance for a pod by disabling C-states and specifying the governor type for CPU scaling. The |
Prerequisites
- You enabled C-states and OS-controlled P-states in the BIOS
Procedure
Generate a
PerformanceProfile
withper-pod-power-management
set totrue
:$ podman run --entrypoint performance-profile-creator -v \ /must-gather:/must-gather:z registry.redhat.io/openshift4/performance-addon-rhel8-operator:v4.12 \ --mcp-name=worker-cnf --reserved-cpu-count=20 --rt-kernel=true \ --split-reserved-cpus-across-numa=false --topology-manager-policy=single-numa-node \ --must-gather-dir-path /must-gather -power-consumption-mode=low-latency \ 1 --per-pod-power-management=true > my-performance-profile.yaml
- 1
- The
power-consumption-mode
must bedefault
orlow-latency
when theper-pod-power-management
is set totrue
.
Example
PerformanceProfile
withperPodPowerManagement
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: [.....] workloadHints: realTime: true highPowerConsumption: false perPodPowerManagement: true
Set the default
cpufreq
governor as an additional kernel argument in thePerformanceProfile
custom resource (CR):apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: ... additionalKernelArgs: - cpufreq.default_governor=schedutil 1
- 1
- Using the
schedutil
governor is recommended, however, you can use other governors such as theondemand
orpowersave
governors.
Set the maximum CPU frequency in the
TunedPerformancePatch
CR:spec: profile: - data: | [sysfs] /sys/devices/system/cpu/intel_pstate/max_perf_pct = <x> 1
- 1
- The
max_perf_pct
controls the maximum frequency thecpufreq
driver is allowed to set as a percentage of the maximum supported cpu frequency. This value applies to all CPUs. You can check the maximum supported frequency in/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
. As a starting point, you can use a percentage that caps all CPUs at theAll Cores Turbo
frequency. TheAll Cores Turbo
frequency is the frequency that all cores will run at when the cores are all fully occupied.
Add the desired annotations to your high priority workload pods. The annotations override the
default
settings.Example high priority workload annotation
apiVersion: v1 kind: Pod metadata: ... annotations: ... cpu-c-states.crio.io: "disable" cpu-freq-governor.crio.io: "<governor>" ... ... spec: ... runtimeClassName: performance-<profile_name> ...
- Restart the pods.
Additional resources
- For more information about recommended firmware configuration, see Recommended firmware configuration for vDU cluster hosts.
10.2.11. Managing device interrupt processing for guaranteed pod isolated CPUs
The Node Tuning Operator can manage host CPUs by dividing them into reserved CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolated CPUs for application containers to run the workloads. This allows you to set CPUs for low latency workloads as isolated.
Device interrupts are load balanced between all isolated and reserved CPUs to avoid CPUs being overloaded, with the exception of CPUs where there is a guaranteed pod running. Guaranteed pod CPUs are prevented from processing device interrupts when the relevant annotations are set for the pod.
In the performance profile, globallyDisableIrqLoadBalancing
is used to manage whether device interrupts are processed or not. For certain workloads, the reserved CPUs are not always sufficient for dealing with device interrupts, and for this reason, device interrupts are not globally disabled on the isolated CPUs. By default, Node Tuning Operator does not disable device interrupts on isolated CPUs.
To achieve low latency for workloads, some (but not all) pods require the CPUs they are running on to not process device interrupts. A pod annotation, irq-load-balancing.crio.io
, is used to define whether device interrupts are processed or not. When configured, CRI-O disables device interrupts only as long as the pod is running.
10.2.11.1. Disabling CPU CFS quota
To reduce CPU throttling for individual guaranteed pods, create a pod specification with the annotation cpu-quota.crio.io: "disable"
. This annotation disables the CPU completely fair scheduler (CFS) quota at the pod run time. The following pod specification contains this annotation:
apiVersion: performance.openshift.io/v2 kind: Pod metadata: annotations: cpu-quota.crio.io: "disable" spec: runtimeClassName: performance-<profile_name> ...
Only disable CPU CFS quota when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU CFS quota can affect the performance of other containers in the cluster.
10.2.11.2. Disabling global device interrupts handling in Node Tuning Operator
To configure Node Tuning Operator to disable global device interrupts for the isolated CPU set, set the globallyDisableIrqLoadBalancing
field in the performance profile to true
. When true
, conflicting pod annotations are ignored. When false
, IRQ loads are balanced across all CPUs.
A performance profile snippet illustrates this setting:
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: manual spec: globallyDisableIrqLoadBalancing: true ...
10.2.11.3. Disabling interrupt processing for individual pods
To disable interrupt processing for individual pods, ensure that globallyDisableIrqLoadBalancing
is set to false
in the performance profile. Then, in the pod specification, set the irq-load-balancing.crio.io
pod annotation to disable
. The following pod specification contains this annotation:
apiVersion: performance.openshift.io/v2 kind: Pod metadata: annotations: irq-load-balancing.crio.io: "disable" spec: runtimeClassName: performance-<profile_name> ...
10.2.12. Upgrading the performance profile to use device interrupt processing
When you upgrade the Node Tuning Operator performance profile custom resource definition (CRD) from v1 or v1alpha1 to v2, globallyDisableIrqLoadBalancing
is set to true
on existing profiles.
globallyDisableIrqLoadBalancing
toggles whether IRQ load balancing will be disabled for the Isolated CPU set. When the option is set to true
it disables IRQ load balancing for the Isolated CPU set. Setting the option to false
allows the IRQs to be balanced across all CPUs.
10.2.12.1. Supported API Versions
The Node Tuning Operator supports v2
, v1
, and v1alpha1
for the performance profile apiVersion
field. The v1 and v1alpha1 APIs are identical. The v2 API includes an optional boolean field globallyDisableIrqLoadBalancing
with a default value of false
.
10.2.12.1.1. Upgrading Node Tuning Operator API from v1alpha1 to v1
When upgrading Node Tuning Operator API version from v1alpha1 to v1, the v1alpha1 performance profiles are converted on-the-fly using a "None" Conversion strategy and served to the Node Tuning Operator with API version v1.
10.2.12.1.2. Upgrading Node Tuning Operator API from v1alpha1 or v1 to v2
When upgrading from an older Node Tuning Operator API version, the existing v1 and v1alpha1 performance profiles are converted using a conversion webhook that injects the globallyDisableIrqLoadBalancing
field with a value of true
.
10.3. Tuning nodes for low latency with the performance profile
The performance profile lets you control latency tuning aspects of nodes that belong to a certain machine config pool. After you specify your settings, the PerformanceProfile
object is compiled into multiple objects that perform the actual node level tuning:
-
A
MachineConfig
file that manipulates the nodes. -
A
KubeletConfig
file that configures the Topology Manager, the CPU Manager, and the OpenShift Container Platform nodes. - The Tuned profile that configures the Node Tuning Operator.
You can use a performance profile to specify whether to update the kernel to kernel-rt, to allocate huge pages, and to partition the CPUs for performing housekeeping duties or running workloads.
You can manually create the PerformanceProfile
object or use the Performance Profile Creator (PPC) to generate a performance profile. See the additional resources below for more information on the PPC.
Sample performance profile
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: cpu: isolated: "5-15" 1 reserved: "0-4" 2 hugepages: defaultHugepagesSize: "1G" pages: - size: "1G" count: 16 node: 0 realTimeKernel: enabled: true 3 numa: 4 topologyPolicy: "best-effort" nodeSelector: node-role.kubernetes.io/worker-cnf: "" 5
- 1
- Use this field to isolate specific CPUs to use with application containers for workloads.
- 2
- Use this field to reserve specific CPUs to use with infra containers for housekeeping.
- 3
- Use this field to install the real-time kernel on the node. Valid values are
true
orfalse
. Setting thetrue
value installs the real-time kernel. - 4
- Use this field to configure the topology manager policy. Valid values are
none
(default),best-effort
,restricted
, andsingle-numa-node
. For more information, see Topology Manager Policies. - 5
- Use this field to specify a node selector to apply the performance profile to specific nodes.
Additional resources
- For information on using the Performance Profile Creator (PPC) to generate a performance profile, see Creating a performance profile.
10.3.1. Configuring huge pages
Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. Use the Node Tuning Operator to allocate huge pages on a specific node.
OpenShift Container Platform provides a method for creating and allocating huge pages. Node Tuning Operator provides an easier method for doing this using the performance profile.
For example, in the hugepages
pages
section of the performance profile, you can specify multiple blocks of size
, count
, and, optionally, node
:
hugepages:
defaultHugepagesSize: "1G"
pages:
- size: "1G"
count: 4
node: 0 1
- 1
node
is the NUMA node in which the huge pages are allocated. If you omitnode
, the pages are evenly spread across all NUMA nodes.
Wait for the relevant machine config pool status that indicates the update is finished.
These are the only configuration steps you need to do to allocate huge pages.
Verification
To verify the configuration, see the
/proc/meminfo
file on the node:$ oc debug node/ip-10-0-141-105.ec2.internal
# grep -i huge /proc/meminfo
Example output
AnonHugePages: ###### ## ShmemHugePages: 0 kB HugePages_Total: 2 HugePages_Free: 2 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: #### ## Hugetlb: #### ##
Use
oc describe
to report the new size:$ oc describe node worker-0.ocp4poc.example.com | grep -i huge
Example output
hugepages-1g=true hugepages-###: ### hugepages-###: ###
10.3.2. Allocating multiple huge page sizes
You can request huge pages with different sizes under the same container. This allows you to define more complicated pods consisting of containers with different huge page size needs.
For example, you can define sizes 1G
and 2M
and the Node Tuning Operator will configure both sizes on the node, as shown here:
spec: hugepages: defaultHugepagesSize: 1G pages: - count: 1024 node: 0 size: 2M - count: 4 node: 1 size: 1G
10.3.3. Configuring a node for IRQ dynamic load balancing
Configure a cluster node for IRQ dynamic load balancing to control which cores can receive device interrupt requests (IRQ).
Prerequisites
- For core isolation, all server hardware components must support IRQ affinity. To check if the hardware components of your server support IRQ affinity, view the server’s hardware specifications or contact your hardware provider.
Procedure
- Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.
-
Set the performance profile
apiVersion
to useperformance.openshift.io/v2
. -
Remove the
globallyDisableIrqLoadBalancing
field or set it tofalse
. Set the appropriate isolated and reserved CPUs. The following snippet illustrates a profile that reserves 2 CPUs. IRQ load-balancing is enabled for pods running on the
isolated
CPU set:apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: dynamic-irq-profile spec: cpu: isolated: 2-5 reserved: 0-1 ...
NoteWhen you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.
Create the pod that uses exclusive CPUs, and set
irq-load-balancing.crio.io
andcpu-quota.crio.io
annotations todisable
. For example:apiVersion: v1 kind: Pod metadata: name: dynamic-irq-pod annotations: irq-load-balancing.crio.io: "disable" cpu-quota.crio.io: "disable" spec: containers: - name: dynamic-irq-pod image: "registry.redhat.io/openshift4/cnf-tests-rhel8:v4.4.12" command: ["sleep", "10h"] resources: requests: cpu: 2 memory: "200M" limits: cpu: 2 memory: "200M" nodeSelector: node-role.kubernetes.io/worker-cnf: "" runtimeClassName: performance-dynamic-irq-profile ...
-
Enter the pod
runtimeClassName
in the form performance-<profile_name>, where <profile_name> is thename
from thePerformanceProfile
YAML, in this example,performance-dynamic-irq-profile
. - Set the node selector to target a cnf-worker.
Ensure the pod is running correctly. Status should be
running
, and the correct cnf-worker node should be set:$ oc get pod -o wide
Expected output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dynamic-irq-pod 1/1 Running 0 5h33m <ip-address> <node-name> <none> <none>
Get the CPUs that the pod configured for IRQ dynamic load balancing runs on:
$ oc exec -it dynamic-irq-pod -- /bin/bash -c "grep Cpus_allowed_list /proc/self/status | awk '{print $2}'"
Expected output
Cpus_allowed_list: 2-3
Ensure the node configuration is applied correctly. Log in to the node to verify the configuration.
$ oc debug node/<node-name>
Expected output
Starting pod/<node-name>-debug ... To use host binaries, run `chroot /host` Pod IP: <ip-address> If you don't see a command prompt, try pressing enter. sh-4.4#
Verify that you can use the node file system:
sh-4.4# chroot /host
Expected output
sh-4.4#
Ensure the default system CPU affinity mask does not include the
dynamic-irq-pod
CPUs, for example, CPUs 2 and 3.$ cat /proc/irq/default_smp_affinity
Example output
33
Ensure the system IRQs are not configured to run on the
dynamic-irq-pod
CPUs:find /proc/irq/ -name smp_affinity_list -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;
Example output
/proc/irq/0/smp_affinity_list: 0-5 /proc/irq/1/smp_affinity_list: 5 /proc/irq/2/smp_affinity_list: 0-5 /proc/irq/3/smp_affinity_list: 0-5 /proc/irq/4/smp_affinity_list: 0 /proc/irq/5/smp_affinity_list: 0-5 /proc/irq/6/smp_affinity_list: 0-5 /proc/irq/7/smp_affinity_list: 0-5 /proc/irq/8/smp_affinity_list: 4 /proc/irq/9/smp_affinity_list: 4 /proc/irq/10/smp_affinity_list: 0-5 /proc/irq/11/smp_affinity_list: 0 /proc/irq/12/smp_affinity_list: 1 /proc/irq/13/smp_affinity_list: 0-5 /proc/irq/14/smp_affinity_list: 1 /proc/irq/15/smp_affinity_list: 0 /proc/irq/24/smp_affinity_list: 1 /proc/irq/25/smp_affinity_list: 1 /proc/irq/26/smp_affinity_list: 1 /proc/irq/27/smp_affinity_list: 5 /proc/irq/28/smp_affinity_list: 1 /proc/irq/29/smp_affinity_list: 0 /proc/irq/30/smp_affinity_list: 0-5
10.3.4. About support of IRQ affinity setting
Some IRQ controllers lack support for IRQ affinity setting and will always expose all online CPUs as the IRQ mask. These IRQ controllers effectively run on CPU 0.
The following are examples of drivers and hardware that Red Hat are aware lack support for IRQ affinity setting. The list is, by no means, exhaustive:
-
Some RAID controller drivers, such as
megaraid_sas
- Many non-volatile memory express (NVMe) drivers
- Some LAN on motherboard (LOM) network controllers
-
The driver uses
managed_irqs
The reason they do not support IRQ affinity setting might be associated with factors such as the type of processor, the IRQ controller, or the circuitry connections in the motherboard.
If the effective affinity of any IRQ is set to an isolated CPU, it might be a sign of some hardware or driver not supporting IRQ affinity setting. To find the effective affinity, log in to the host and run the following command:
$ find /proc/irq/ -name effective_affinity -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;
Example output
/proc/irq/0/effective_affinity: 1 /proc/irq/1/effective_affinity: 8 /proc/irq/2/effective_affinity: 0 /proc/irq/3/effective_affinity: 1 /proc/irq/4/effective_affinity: 2 /proc/irq/5/effective_affinity: 1 /proc/irq/6/effective_affinity: 1 /proc/irq/7/effective_affinity: 1 /proc/irq/8/effective_affinity: 1 /proc/irq/9/effective_affinity: 2 /proc/irq/10/effective_affinity: 1 /proc/irq/11/effective_affinity: 1 /proc/irq/12/effective_affinity: 4 /proc/irq/13/effective_affinity: 1 /proc/irq/14/effective_affinity: 1 /proc/irq/15/effective_affinity: 1 /proc/irq/24/effective_affinity: 2 /proc/irq/25/effective_affinity: 4 /proc/irq/26/effective_affinity: 2 /proc/irq/27/effective_affinity: 1 /proc/irq/28/effective_affinity: 8 /proc/irq/29/effective_affinity: 4 /proc/irq/30/effective_affinity: 4 /proc/irq/31/effective_affinity: 8 /proc/irq/32/effective_affinity: 8 /proc/irq/33/effective_affinity: 1 /proc/irq/34/effective_affinity: 2
Some drivers use managed_irqs
, whose affinity is managed internally by the kernel and userspace cannot change the affinity. In some cases, these IRQs might be assigned to isolated CPUs. For more information about managed_irqs
, see Affinity of managed interrupts cannot be changed even if they target isolated CPU.
10.3.5. Configuring hyperthreading for a cluster
To configure hyperthreading for an OpenShift Container Platform cluster, set the CPU threads in the performance profile to the same cores that are configured for the reserved or isolated CPU pools.
If you configure a performance profile, and subsequently change the hyperthreading configuration for the host, ensure that you update the CPU isolated
and reserved
fields in the PerformanceProfile
YAML to match the new configuration.
Disabling a previously enabled host hyperthreading configuration can cause the CPU core IDs listed in the PerformanceProfile
YAML to be incorrect. This incorrect configuration can cause the node to become unavailable because the listed CPUs can no longer be found.
Prerequisites
-
Access to the cluster as a user with the
cluster-admin
role. - Install the OpenShift CLI (oc).
Procedure
Ascertain which threads are running on what CPUs for the host you want to configure.
You can view which threads are running on the host CPUs by logging in to the cluster and running the following command:
$ lscpu --all --extended
Example output
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ 0 0 0 0 0:0:0:0 yes 4800.0000 400.0000 1 0 0 1 1:1:1:0 yes 4800.0000 400.0000 2 0 0 2 2:2:2:0 yes 4800.0000 400.0000 3 0 0 3 3:3:3:0 yes 4800.0000 400.0000 4 0 0 0 0:0:0:0 yes 4800.0000 400.0000 5 0 0 1 1:1:1:0 yes 4800.0000 400.0000 6 0 0 2 2:2:2:0 yes 4800.0000 400.0000 7 0 0 3 3:3:3:0 yes 4800.0000 400.0000
In this example, there are eight logical CPU cores running on four physical CPU cores. CPU0 and CPU4 are running on physical Core0, CPU1 and CPU5 are running on physical Core 1, and so on.
Alternatively, to view the threads that are set for a particular physical CPU core (
cpu0
in the example below), open a command prompt and run the following:$ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
Example output
0-4
Apply the isolated and reserved CPUs in the
PerformanceProfile
YAML. For example, you can set logical cores CPU0 and CPU4 asisolated
, and logical cores CPU1 to CPU3 and CPU5 to CPU7 asreserved
. When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.... cpu: isolated: 0,4 reserved: 1-3,5-7 ...
NoteThe reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node.
Hyperthreading is enabled by default on most Intel processors. If you enable hyperthreading, all threads processed by a particular core must be isolated or processed on the same core.
10.3.5.1. Disabling hyperthreading for low latency applications
When configuring clusters for low latency processing, consider whether you want to disable hyperthreading before you deploy the cluster. To disable hyperthreading, do the following:
- Create a performance profile that is appropriate for your hardware and topology.
Set
nosmt
as an additional kernel argument. The following example performance profile illustrates this setting:apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: example-performanceprofile spec: additionalKernelArgs: - nmi_watchdog=0 - audit=0 - mce=off - processor.max_cstate=1 - idle=poll - intel_idle.max_cstate=0 - nosmt cpu: isolated: 2-3 reserved: 0-1 hugepages: defaultHugepagesSize: 1G pages: - count: 2 node: 0 size: 1G nodeSelector: node-role.kubernetes.io/performance: '' realTimeKernel: enabled: true
NoteWhen you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.
10.3.6. Understanding workload hints
The following table describes how combinations of power consumption and real-time settings impact on latency.
The following workload hints can be configured manually. You can also work with workload hints using the Performance Profile Creator. For more information about the performance profile, see the "Creating a performance profile" section.
Performance Profile creator setting | Hint | Environment | Description |
---|---|---|---|
Default |
workloadHints: highPowerConsumption: false realTime: false | High throughput cluster without latency requirements | Performance achieved through CPU partitioning only. |
Low-latency |
workloadHints: highPowerConsumption: false realTime: true | Regional datacenters | Both energy savings and low-latency are desirable: compromise between power management, latency and throughput. |
Ultra-low-latency |
workloadHints: highPowerConsumption: true realTime: true | Far edge clusters, latency critical workloads | Optimized for absolute minimal latency and maximum determinism at the cost of increased power consumption. |
Per-pod power management |
workloadHints: realTime: true highPowerConsumption: false perPodPowerManagement: true | Critical and non-critical workloads | Allows for power management per pod. |
Additional resources
- For information on using the Performance Profile Creator (PPC) to generate a performance profile, see Creating a performance profile.
10.3.7. Configuring workload hints manually
Procedure
-
Create a
PerformanceProfile
appropriate for the environment’s hardware and topology as described in the table in "Understanding workload hints". Adjust the profile to match the expected workload. In this example, we tune for the lowest possible latency. Add the
highPowerConsumption
andrealTime
workload hints. Both are set totrue
here.apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: workload-hints spec: ... workloadHints: highPowerConsumption: true 1 realTime: true 2
10.3.8. Restricting CPUs for infra and application containers
Generic housekeeping and workload tasks use CPUs in a way that may impact latency-sensitive processes. By default, the container runtime uses all online CPUs to run all containers together, which can result in context switches and spikes in latency. Partitioning the CPUs prevents noisy processes from interfering with latency-sensitive processes by separating them from each other. The following table describes how processes run on a CPU after you have tuned the node using the Node Tuning Operator:
Table 10.2. Process' CPU assignments
Process type | Details |
---|---|
| Runs on any CPU except where low latency workload is running |
Infrastructure pods | Runs on any CPU except where low latency workload is running |
Interrupts | Redirects to reserved CPUs (optional in OpenShift Container Platform 4.7 and later) |
Kernel processes | Pins to reserved CPUs |
Latency-sensitive workload pods | Pins to a specific set of exclusive CPUs from the isolated pool |
OS processes/systemd services | Pins to reserved CPUs |
The allocatable capacity of cores on a node for pods of all QoS process types, Burstable
, BestEffort
, or Guaranteed
, is equal to the capacity of the isolated pool. The capacity of the reserved pool is removed from the node’s total core capacity for use by the cluster and operating system housekeeping duties.
Example 1
A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 25 cores to QoS Guaranteed
pods and 25 cores for BestEffort
or Burstable
pods. This matches the capacity of the isolated pool.
Example 2
A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 50 cores to QoS Guaranteed
pods and one core for BestEffort
or Burstable
pods. This exceeds the capacity of the isolated pool by one core. Pod scheduling fails because of insufficient CPU capacity.
The exact partitioning pattern to use depends on many factors like hardware, workload characteristics and the expected system load. Some sample use cases are as follows:
- If the latency-sensitive workload uses specific hardware, such as a network interface controller (NIC), ensure that the CPUs in the isolated pool are as close as possible to this hardware. At a minimum, you should place the workload in the same Non-Uniform Memory Access (NUMA) node.
- The reserved pool is used for handling all interrupts. When depending on system networking, allocate a sufficiently-sized reserve pool to handle all the incoming packet interrupts. In 4.12 and later versions, workloads can optionally be labeled as sensitive.
The decision regarding which specific CPUs should be used for reserved and isolated partitions requires detailed analysis and measurements. Factors like NUMA affinity of devices and memory play a role. The selection also depends on the workload architecture and the specific use case.
The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node.
To ensure that housekeeping tasks and workloads do not interfere with each other, specify two groups of CPUs in the spec
section of the performance profile.
-
isolated
- Specifies the CPUs for the application container workloads. These CPUs have the lowest latency. Processes in this group have no interruptions and can, for example, reach much higher DPDK zero packet loss bandwidth. -
reserved
- Specifies the CPUs for the cluster and operating system housekeeping duties. Threads in thereserved
group are often busy. Do not run latency-sensitive applications in thereserved
group. Latency-sensitive applications run in theisolated
group.
Procedure
- Create a performance profile appropriate for the environment’s hardware and topology.
Add the
reserved
andisolated
parameters with the CPUs you want reserved and isolated for the infra and application containers:apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: infra-cpus spec: cpu: reserved: "0-4,9" 1 isolated: "5-8" 2 nodeSelector: 3 node-role.kubernetes.io/worker: ""
10.4. Reducing NIC queues using the Node Tuning Operator
The Node Tuning Operator allows you to adjust the network interface controller (NIC) queue count for each network device by configuring the performance profile. Device network queues allows the distribution of packets among different physical queues and each queue gets a separate thread for packet processing.
In real-time or low latency systems, all the unnecessary interrupt request lines (IRQs) pinned to the isolated CPUs must be moved to reserved or housekeeping CPUs.
In deployments with applications that require system, OpenShift Container Platform networking or in mixed deployments with Data Plane Development Kit (DPDK) workloads, multiple queues are needed to achieve good throughput and the number of NIC queues should be adjusted or remain unchanged. For example, to achieve low latency the number of NIC queues for DPDK based workloads should be reduced to just the number of reserved or housekeeping CPUs.
Too many queues are created by default for each CPU and these do not fit into the interrupt tables for housekeeping CPUs when tuning for low latency. Reducing the number of queues makes proper tuning possible. Smaller number of queues means a smaller number of interrupts that then fit in the IRQ table.
In earlier versions of OpenShift Container Platform, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.
10.4.1. Adjusting the NIC queues with the performance profile
The performance profile lets you adjust the queue count for each network device.
Supported network devices:
- Non-virtual network devices
- Network devices that support multiple queues (channels)
Unsupported network devices:
- Pure software network interfaces
- Block devices
- Intel DPDK virtual functions
Prerequisites
-
Access to the cluster as a user with the
cluster-admin
role. -
Install the OpenShift CLI (
oc
).
Procedure
-
Log in to the OpenShift Container Platform cluster running the Node Tuning Operator as a user with
cluster-admin
privileges. - Create and apply a performance profile appropriate for your hardware and topology. For guidance on creating a profile, see the "Creating a performance profile" section.
Edit this created performance profile:
$ oc edit -f <your_profile_name>.yaml
Populate the
spec
field with thenet
object. The object list can contain two fields:-
userLevelNetworking
is a required field specified as a boolean flag. IfuserLevelNetworking
istrue
, the queue count is set to the reserved CPU count for all supported devices. The default isfalse
. devices
is an optional field specifying a list of devices that will have the queues set to the reserved CPU count. If the device list is empty, the configuration applies to all network devices. The configuration is as follows:interfaceName
: This field specifies the interface name, and it supports shell-style wildcards, which can be positive or negative.-
Example wildcard syntax is as follows:
<string> .*
-
Negative rules are prefixed with an exclamation mark. To apply the net queue changes to all devices other than the excluded list, use
!<device>
, for example,!eno1
.
-
Example wildcard syntax is as follows:
-
vendorID
: The network device vendor ID represented as a 16-bit hexadecimal number with a0x
prefix. deviceID
: The network device ID (model) represented as a 16-bit hexadecimal number with a0x
prefix.NoteWhen a
deviceID
is specified, thevendorID
must also be defined. A device that matches all of the device identifiers specified in a device entryinterfaceName
,vendorID
, or a pair ofvendorID
plusdeviceID
qualifies as a network device. This network device then has its net queues count set to the reserved CPU count.When two or more devices are specified, the net queues count is set to any net device that matches one of them.
-
Set the queue count to the reserved CPU count for all devices by using this example performance profile:
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: manual spec: cpu: isolated: 3-51,54-103 reserved: 0-2,52-54 net: userLevelNetworking: true nodeSelector: node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices matching any of the defined device identifiers by using this example performance profile:
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: manual spec: cpu: isolated: 3-51,54-103 reserved: 0-2,52-54 net: userLevelNetworking: true devices: - interfaceName: “eth0” - interfaceName: “eth1” - vendorID: “0x1af4” - deviceID: “0x1000” nodeSelector: node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices starting with the interface name
eth
by using this example performance profile:apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: manual spec: cpu: isolated: 3-51,54-103 reserved: 0-2,52-54 net: userLevelNetworking: true devices: - interfaceName: “eth*” nodeSelector: node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices with an interface named anything other than
eno1
by using this example performance profile:apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: manual spec: cpu: isolated: 3-51,54-103 reserved: 0-2,52-54 net: userLevelNetworking: true devices: - interfaceName: “!eno1” nodeSelector: node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices that have an interface name
eth0
,vendorID
of0x1af4
, anddeviceID
of0x1000
by using this example performance profile:apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: manual spec: cpu: isolated: 3-51,54-103 reserved: 0-2,52-54 net: userLevelNetworking: true devices: - interfaceName: “eth0” - vendorID: “0x1af4” - deviceID: “0x1000” nodeSelector: node-role.kubernetes.io/worker-cnf: ""
Apply the updated performance profile:
$ oc apply -f <your_profile_name>.yaml
Additional resources
10.4.2. Verifying the queue status
In this section, a number of examples illustrate different performance profiles and how to verify the changes are applied.
Example 1
In this example, the net queue count is set to the reserved CPU count (2) for all supported devices.
The relevant section from the performance profile is:
apiVersion: performance.openshift.io/v2 metadata: name: performance spec: kind: PerformanceProfile spec: cpu: reserved: 0-1 #total = 2 isolated: 2-8 net: userLevelNetworking: true # ...
Display the status of the queues associated with a device using the following command:
NoteRun this command on the node where the performance profile was applied.
$ ethtool -l <device>
Verify the queue status before the profile is applied:
$ ethtool -l ens4
Example output
Channel parameters for ens4: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 4 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 4
Verify the queue status after the profile is applied:
$ ethtool -l ens4
Example output
Channel parameters for ens4: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 4 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 2 1
- 1
- The combined channel shows that the total count of reserved CPUs for all supported devices is 2. This matches what is configured in the performance profile.
Example 2
In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices with a specific vendorID
.
The relevant section from the performance profile is:
apiVersion: performance.openshift.io/v2 metadata: name: performance spec: kind: PerformanceProfile spec: cpu: reserved: 0-1 #total = 2 isolated: 2-8 net: userLevelNetworking: true devices: - vendorID = 0x1af4 # ...
Display the status of the queues associated with a device using the following command:
NoteRun this command on the node where the performance profile was applied.
$ ethtool -l <device>
Verify the queue status after the profile is applied:
$ ethtool -l ens4
Example output
Channel parameters for ens4: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 4 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 2 1
- 1
- The total count of reserved CPUs for all supported devices with
vendorID=0x1af4
is 2. For example, if there is another network deviceens2
withvendorID=0x1af4
it will also have total net queues of 2. This matches what is configured in the performance profile.
Example 3
In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices that match any of the defined device identifiers.
The command udevadm info
provides a detailed report on a device. In this example the devices are:
# udevadm info -p /sys/class/net/ens4 ... E: ID_MODEL_ID=0x1000 E: ID_VENDOR_ID=0x1af4 E: INTERFACE=ens4 ...
# udevadm info -p /sys/class/net/eth0 ... E: ID_MODEL_ID=0x1002 E: ID_VENDOR_ID=0x1001 E: INTERFACE=eth0 ...
Set the net queues to 2 for a device with
interfaceName
equal toeth0
and any devices that have avendorID=0x1af4
with the following performance profile:apiVersion: performance.openshift.io/v2 metadata: name: performance spec: kind: PerformanceProfile spec: cpu: reserved: 0-1 #total = 2 isolated: 2-8 net: userLevelNetworking: true devices: - interfaceName = eth0 - vendorID = 0x1af4 ...
Verify the queue status after the profile is applied:
$ ethtool -l ens4
Example output
Channel parameters for ens4: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 4 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 2 1
- 1
- The total count of reserved CPUs for all supported devices with
vendorID=0x1af4
is set to 2. For example, if there is another network deviceens2
withvendorID=0x1af4
, it will also have the total net queues set to 2. Similarly, a device withinterfaceName
equal toeth0
will have total net queues set to 2.
10.4.3. Logging associated with adjusting NIC queues
Log messages detailing the assigned devices are recorded in the respective Tuned daemon logs. The following messages might be recorded to the /var/log/tuned/tuned.log
file:
An
INFO
message is recorded detailing the successfully assigned devices:INFO tuned.plugins.base: instance net_test (net): assigning devices ens1, ens2, ens3
A
WARNING
message is recorded if none of the devices can be assigned:WARNING tuned.plugins.base: instance net_test: no matching devices available
10.5. Debugging low latency CNF tuning status
The PerformanceProfile
custom resource (CR) contains status fields for reporting tuning status and debugging latency degradation issues. These fields report on conditions that describe the state of the operator’s reconciliation functionality.
A typical issue can arise when the status of machine config pools that are attached to the performance profile are in a degraded state, causing the PerformanceProfile
status to degrade. In this case, the machine config pool issues a failure message.
The Node Tuning Operator contains the performanceProfile.spec.status.Conditions
status field:
Status: Conditions: Last Heartbeat Time: 2020-06-02T10:01:24Z Last Transition Time: 2020-06-02T10:01:24Z Status: True Type: Available Last Heartbeat Time: 2020-06-02T10:01:24Z Last Transition Time: 2020-06-02T10:01:24Z Status: True Type: Upgradeable Last Heartbeat Time: 2020-06-02T10:01:24Z Last Transition Time: 2020-06-02T10:01:24Z Status: False Type: Progressing Last Heartbeat Time: 2020-06-02T10:01:24Z Last Transition Time: 2020-06-02T10:01:24Z Status: False Type: Degraded
The Status
field contains Conditions
that specify Type
values that indicate the status of the performance profile:
Available
- All machine configs and Tuned profiles have been created successfully and are available for cluster components are responsible to process them (NTO, MCO, Kubelet).
Upgradeable
- Indicates whether the resources maintained by the Operator are in a state that is safe to upgrade.
Progressing
- Indicates that the deployment process from the performance profile has started.
Degraded
Indicates an error if:
- Validation of the performance profile has failed.
- Creation of all relevant components did not complete successfully.
Each of these types contain the following fields:
Status
-
The state for the specific type (
true
orfalse
). Timestamp
- The transaction timestamp.
Reason string
- The machine readable reason.
Message string
- The human readable reason describing the state and error details, if any.
10.5.1. Machine config pools
A performance profile and its created products are applied to a node according to an associated machine config pool (MCP). The MCP holds valuable information about the progress of applying the machine configurations created by performance profiles that encompass kernel args, kube config, huge pages allocation, and deployment of rt-kernel. The Performance Profile controller monitors changes in the MCP and updates the performance profile status accordingly.
The only conditions returned by the MCP to the performance profile status is when the MCP is Degraded
, which leads to performaceProfile.status.condition.Degraded = true
.
Example
The following example is for a performance profile with an associated machine config pool (worker-cnf
) that was created for it:
The associated machine config pool is in a degraded state:
# oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-2ee57a93fa6c9181b546ca46e1571d2d True False False 3 3 3 0 2d21h worker rendered-worker-d6b2bdc07d9f5a59a6b68950acf25e5f True False False 2 2 2 0 2d21h worker-cnf rendered-worker-cnf-6c838641b8a08fff08dbd8b02fb63f7c False True True 2 1 1 1 2d20h
The
describe
section of the MCP shows the reason:# oc describe mcp worker-cnf
Example output
Message: Node node-worker-cnf is reporting: "prepping update: machineconfig.machineconfiguration.openshift.io \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not found" Reason: 1 nodes are reporting degraded status on sync
The degraded state should also appear under the performance profile
status
field marked asdegraded = true
:# oc describe performanceprofiles performance
Example output
Message: Machine config pool worker-cnf Degraded Reason: 1 nodes are reporting degraded status on sync. Machine config pool worker-cnf Degraded Message: Node yquinn-q8s5v-w-b-z5lqn.c.openshift-gce-devel.internal is reporting: "prepping update: machineconfig.machineconfiguration.openshift.io \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not found". Reason: MCPDegraded Status: True Type: Degraded
10.6. Collecting low latency tuning debugging data for Red Hat Support
When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.
The must-gather
tool enables you to collect diagnostic information about your OpenShift Container Platform cluster, including node tuning, NUMA topology, and other information needed to debug issues with low latency setup.
For prompt support, supply diagnostic information for both OpenShift Container Platform and low latency tuning.
10.6.1. About the must-gather tool
The oc adm must-gather
CLI command collects the information from your cluster that is most likely needed for debugging issues, such as:
- Resource definitions
- Audit logs
- Service logs
You can specify one or more images when you run the command by including the --image
argument. When you specify an image, the tool collects data related to that feature or product. When you run oc adm must-gather
, a new pod is created on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local
. This directory is created in your current working directory.
10.6.2. About collecting low latency tuning data
Use the oc adm must-gather
CLI command to collect information about your cluster, including features and objects associated with low latency tuning, including:
- The Node Tuning Operator namespaces and child objects.
-
MachineConfigPool
and associatedMachineConfig
objects. - The Node Tuning Operator and associated Tuned objects.
- Linux Kernel command line options.
- CPU and NUMA topology
- Basic PCI device information and NUMA locality.
To collect debugging information with must-gather
, you must specify the Performance Addon Operator must-gather
image:
--image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8:v4.12.
In earlier versions of OpenShift Container Platform, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator. However, you must still use the performance-addon-operator-must-gather
image when running the must-gather
command.
10.6.3. Gathering data about specific features
You can gather debugging information about specific features by using the oc adm must-gather
CLI command with the --image
or --image-stream
argument. The must-gather
tool supports multiple images, so you can gather data about more than one feature by running a single command.
To collect the default must-gather
data in addition to specific feature data, add the --image-stream=openshift/must-gather
argument.
In earlier versions of OpenShift Container Platform, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OpenShift Container Platform 4.11, these functions are part of the Node Tuning Operator. However, you must still use the performance-addon-operator-must-gather
image when running the must-gather
command.
Prerequisites
-
Access to the cluster as a user with the
cluster-admin
role. - The OpenShift Container Platform CLI (oc) installed.
Procedure
-
Navigate to the directory where you want to store the
must-gather
data. Run the
oc adm must-gather
command with one or more--image
or--image-stream
arguments. For example, the following command gathers both the default cluster data and information specific to the Node Tuning Operator:$ oc adm must-gather \ --image-stream=openshift/must-gather \ 1 --image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8:v4.12 2
Create a compressed file from the
must-gather
directory that was created in your working directory. For example, on a computer that uses a Linux operating system, run the following command:$ tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/ 1
- 1
- Replace
must-gather-local.5421342344627712289/
with the actual directory name.
- Attach the compressed file to your support case on the Red Hat Customer Portal.
Additional resources
- For more information about MachineConfig and KubeletConfig, see Managing nodes.
- For more information about the Node Tuning Operator, see Using the Node Tuning Operator.
- For more information about the PerformanceProfile, see Configuring huge pages.
- For more information about consuming huge pages from your containers, see How huge pages are consumed by apps.
Chapter 11. Performing latency tests for platform verification
You can use the Cloud-native Network Functions (CNF) tests image to run latency tests on a CNF-enabled OpenShift Container Platform cluster, where all the components required for running CNF workloads are installed. Run the latency tests to validate node tuning for your workload.
The cnf-tests
container image is available at registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12
.
The cnf-tests
image also includes several tests that are not supported by Red Hat at this time. Only the latency tests are supported by Red Hat.
11.1. Prerequisites for running latency tests
Your cluster must meet the following requirements before you can run the latency tests:
- You have configured a performance profile with the Node Tuning Operator.
- You have applied all the required CNF configurations in the cluster.
-
You have a pre-existing
MachineConfigPool
CR applied in the cluster. The default worker pool isworker-cnf
.
Additional resources
- For more information about creating the cluster performance profile, see Provisioning a worker with real-time capabilities.
11.2. About discovery mode for latency tests
Use discovery mode to validate the functionality of a cluster without altering its configuration. Existing environment configurations are used for the tests. The tests can find the configuration items needed and use those items to execute the tests. If resources needed to run a specific test are not found, the test is skipped, providing an appropriate message to the user. After the tests are finished, no cleanup of the pre-configured configuration items is done, and the test environment can be immediately used for another test run.
When running the latency tests, always run the tests with -e DISCOVERY_MODE=true
and -ginkgo.focus
set to the appropriate latency test. If you do not run the latency tests in discovery mode, your existing live cluster performance profile configuration will be modified by the test run.
Limiting the nodes used during tests
The nodes on which the tests are executed can be limited by specifying a NODES_SELECTOR
environment variable, for example, -e NODES_SELECTOR=node-role.kubernetes.io/worker-cnf
. Any resources created by the test are limited to nodes with matching labels.
If you want to override the default worker pool, pass the -e ROLE_WORKER_CNF=<custom_worker_pool>
variable to the command specifying an appropriate label.
11.3. Measuring latency
The cnf-tests
image uses three tools to measure the latency of the system:
-
hwlatdetect
-
cyclictest
-
oslat
Each tool has a specific use. Use the tools in sequence to achieve reliable test results.
- hwlatdetect
-
Measures the baseline that the bare-metal hardware can achieve. Before proceeding with the next latency test, ensure that the latency reported by
hwlatdetect
meets the required threshold because you cannot fix hardware latency spikes by operating system tuning. - cyclictest
-
Verifies the real-time kernel scheduler latency after
hwlatdetect
passes validation. Thecyclictest
tool schedules a repeated timer and measures the difference between the desired and the actual trigger times. The difference can uncover basic issues with the tuning caused by interrupts or process priorities. The tool must run on a real-time kernel. - oslat
- Behaves similarly to a CPU-intensive DPDK application and measures all the interruptions and disruptions to the busy loop that simulates CPU heavy data processing.
The tests introduce the following environment variables:
Table 11.1. Latency test environment variables
Environment variables | Description |
---|---|
| Specifies the amount of time in seconds after which the test starts running. You can use the variable to allow the CPU manager reconcile loop to update the default CPU pool. The default value is 0. |
| Specifies the number of CPUs that the pod running the latency tests uses. If you do not set the variable, the default configuration includes all isolated CPUs. |
| Specifies the amount of time in seconds that the latency test must run. The default value is 300 seconds. |
|
Specifies the maximum acceptable hardware latency in microseconds for the workload and operating system. If you do not set the value of |
|
Specifies the maximum latency in microseconds that all threads expect before waking up during the |
|
Specifies the maximum acceptable latency in microseconds for the |
| Unified variable that specifies the maximum acceptable latency in microseconds. Applicable for all available latency tools. |
|
Boolean parameter that indicates whether the tests should run. |
Variables that are specific to a latency tool take precedence over unified variables. For example, if OSLAT_MAXIMUM_LATENCY
is set to 30 microseconds and MAXIMUM_LATENCY
is set to 10 microseconds, the oslat
test will run with maximum acceptable latency of 30 microseconds.
11.4. Running the latency tests
Run the cluster latency tests to validate node tuning for your Cloud-native Network Functions (CNF) workload.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Procedure
Open a shell prompt in the directory containing the
kubeconfig
file.You provide the test image with a
kubeconfig
file in current directory and its related$KUBECONFIG
environment variable, mounted through a volume. This allows the running container to use thekubeconfig
file from inside the container.Run the latency tests by entering the following command:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
-
Optional: Append
-ginkgo.dryRun
to run the latency tests in dry-run mode. This is useful for checking what the tests run. -
Optional: Append
-ginkgo.v
to run the tests with increased verbosity. Optional: To run the latency tests against a specific performance profile, run the following command, substituting appropriate values:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e FEATURES=performance -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ -e PERF_TEST_PROFILE=<performance_profile> registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ /usr/bin/test-run.sh -ginkgo.focus="[performance]\ Latency\ Test"
where:
- <performance_profile>
- Is the name of the performance profile you want to run the latency tests against.
ImportantFor valid latency test results, run the tests for at least 12 hours.
11.4.1. Running hwlatdetect
The hwlatdetect
tool is available in the rt-kernel
package with a regular subscription of Red Hat Enterprise Linux (RHEL) 8.x.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Prerequisites
- You have installed the real-time kernel in the cluster.
-
You have logged in to
registry.redhat.io
with your Customer Portal credentials.
Procedure
To run the
hwlatdetect
tests, run the following command, substituting variable values as appropriate:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="hwlatdetect"
The
hwlatdetect
test runs for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower thanMAXIMUM_LATENCY
(20 μs).If the results exceed the latency threshold, the test fails.
ImportantFor valid results, the test should run for at least 12 hours.
Example failure output
running /usr/bin/cnftests -ginkgo.v -ginkgo.focus=hwlatdetect I0908 15:25:20.023712 27 request.go:601] Waited for 1.046586367s due to client-side throttling, not priority and fairness, request: GET:https://api.hlxcl6.lab.eng.tlv2.redhat.com:6443/apis/imageregistry.operator.openshift.io/v1?timeout=32s Running Suite: CNF Features e2e integration tests ================================================= Random Seed: 1662650718 Will run 1 of 194 specs [...] • Failure [283.574 seconds] [performance] Latency Test /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:62 with the hwlatdetect image /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:228 should succeed [It] /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:236 Log file created at: 2022/09/08 15:25:27 Running on machine: hwlatdetect-b6n4n Binary: Built with gc go1.17.12 for linux/amd64 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0908 15:25:27.160620 1 node.go:39] Environment information: /proc/cmdline: BOOT_IMAGE=(hd1,gpt3)/ostree/rhcos-c6491e1eedf6c1f12ef7b95e14ee720bf48359750ac900b7863c625769ef5fb9/vmlinuz-4.18.0-372.19.1.el8_6.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/c6491e1eedf6c1f12ef7b95e14ee720bf48359750ac900b7863c625769ef5fb9/0 ip=dhcp root=UUID=5f80c283-f6e6-4a27-9b47-a287157483b2 rw rootflags=prjquota boot=UUID=773bf59a-bafd-48fc-9a87-f62252d739d3 skew_tick=1 nohz=on rcu_nocbs=0-3 tuned.non_isolcpus=0000ffff,ffffffff,fffffff0 systemd.cpu_affinity=4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79 intel_iommu=on iommu=pt isolcpus=managed_irq,0-3 nohz_full=0-3 tsc=nowatchdog nosoftlockup nmi_watchdog=0 mce=off skew_tick=1 rcutree.kthread_prio=11 + + I0908 15:25:27.160830 1 node.go:46] Environment information: kernel version 4.18.0-372.19.1.el8_6.x86_64 I0908 15:25:27.160857 1 main.go:50] running the hwlatdetect command with arguments [/usr/bin/hwlatdetect --threshold 1 --hardlimit 1 --duration 100 --window 10000000us --width 950000us] F0908 15:27:10.603523 1 main.go:53] failed to run hwlatdetect command; out: hwlatdetect: test duration 100 seconds detector: tracer parameters: Latency threshold: 1us 1 Sample window: 10000000us Sample width: 950000us Non-sampling period: 9050000us Output File: None Starting test test finished Max Latency: 326us 2 Samples recorded: 5 Samples exceeding threshold: 5 ts: 1662650739.017274507, inner:6, outer:6 ts: 1662650749.257272414, inner:14, outer:326 ts: 1662650779.977272835, inner:314, outer:12 ts: 1662650800.457272384, inner:3, outer:9 ts: 1662650810.697273520, inner:3, outer:2 [...] JUnit report was created: /junit.xml/cnftests-junit.xml Summarizing 1 Failure: [Fail] [performance] Latency Test with the hwlatdetect image [It] should succeed /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:476 Ran 1 of 194 Specs in 365.797 seconds FAIL! -- 0 Passed | 1 Failed | 0 Pending | 193 Skipped --- FAIL: TestTest (366.08s) FAIL
Example hwlatdetect test results
You can capture the following types of results:
- Rough results that are gathered after each run to create a history of impact on any changes made throughout the test.
- The combined set of the rough tests with the best results and configuration settings.
Example of good results
hwlatdetect: test duration 3600 seconds detector: tracer parameters: Latency threshold: 10us Sample window: 1000000us Sample width: 950000us Non-sampling period: 50000us Output File: None Starting test test finished Max Latency: Below threshold Samples recorded: 0
The hwlatdetect
tool only provides output if the sample exceeds the specified threshold.
Example of bad results
hwlatdetect: test duration 3600 seconds detector: tracer parameters:Latency threshold: 10usSample window: 1000000us Sample width: 950000usNon-sampling period: 50000usOutput File: None Starting tests:1610542421.275784439, inner:78, outer:81 ts: 1610542444.330561619, inner:27, outer:28 ts: 1610542445.332549975, inner:39, outer:38 ts: 1610542541.568546097, inner:47, outer:32 ts: 1610542590.681548531, inner:13, outer:17 ts: 1610543033.818801482, inner:29, outer:30 ts: 1610543080.938801990, inner:90, outer:76 ts: 1610543129.065549639, inner:28, outer:39 ts: 1610543474.859552115, inner:28, outer:35 ts: 1610543523.973856571, inner:52, outer:49 ts: 1610543572.089799738, inner:27, outer:30 ts: 1610543573.091550771, inner:34, outer:28 ts: 1610543574.093555202, inner:116, outer:63
The output of hwlatdetect
shows that multiple samples exceed the threshold. However, the same output can indicate different results based on the following factors:
- The duration of the test
- The number of CPU cores
- The host firmware settings
Before proceeding with the next latency test, ensure that the latency reported by hwlatdetect
meets the required threshold. Fixing latencies introduced by hardware might require you to contact the system vendor support.
Not all latency spikes are hardware related. Ensure that you tune the host firmware to meet your workload requirements. For more information, see Setting firmware parameters for system tuning.
11.4.2. Running cyclictest
The cyclictest
tool measures the real-time kernel scheduler latency on the specified CPUs.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Prerequisites
-
You have logged in to
registry.redhat.io
with your Customer Portal credentials. - You have installed the real-time kernel in the cluster.
- You have applied a cluster performance profile by using Node Tuning Operator.
Procedure
To perform the
cyclictest
, run the following command, substituting variable values as appropriate:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_CPUS=10 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="cyclictest"
The command runs the
cyclictest
tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower thanMAXIMUM_LATENCY
(in this example, 20 μs). Latency spikes of 20 μs and above are generally not acceptable for telco RAN workloads.If the results exceed the latency threshold, the test fails.
ImportantFor valid results, the test should run for at least 12 hours.
Example failure output
running /usr/bin/cnftests -ginkgo.v -ginkgo.focus=cyclictest I0908 13:01:59.193776 27 request.go:601] Waited for 1.046228824s due to client-side throttling, not priority and fairness, request: GET:https://api.compute-1.example.com:6443/apis/packages.operators.coreos.com/v1?timeout=32s Running Suite: CNF Features e2e integration tests ================================================= Random Seed: 1662642118 Will run 1 of 194 specs [...] Summarizing 1 Failure: [Fail] [performance] Latency Test with the cyclictest image [It] should succeed /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:220 Ran 1 of 194 Specs in 161.151 seconds FAIL! -- 0 Passed | 1 Failed | 0 Pending | 193 Skipped --- FAIL: TestTest (161.48s) FAIL
Example cyclictest results
The same output can indicate different results for different workloads. For example, spikes up to 18μs are acceptable for 4G DU workloads, but not for 5G DU workloads.
Example of good results
running cmd: cyclictest -q -D 10m -p 1 -t 16 -a 2,4,6,8,10,12,14,16,54,56,58,60,62,64,66,68 -h 30 -i 1000 -m # Histogram 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000001 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000002 579506 535967 418614 573648 532870 529897 489306 558076 582350 585188 583793 223781 532480 569130 472250 576043 More histogram entries ... # Total: 000600000 000600000 000600000 000599999 000599999 000599999 000599998 000599998 000599998 000599997 000599997 000599996 000599996 000599995 000599995 000599995 # Min Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 # Avg Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 # Max Latencies: 00005 00005 00004 00005 00004 00004 00005 00005 00006 00005 00004 00005 00004 00004 00005 00004 # Histogram Overflows: 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 # Histogram Overflow at cycle number: # Thread 0: # Thread 1: # Thread 2: # Thread 3: # Thread 4: # Thread 5: # Thread 6: # Thread 7: # Thread 8: # Thread 9: # Thread 10: # Thread 11: # Thread 12: # Thread 13: # Thread 14: # Thread 15:
Example of bad results
running cmd: cyclictest -q -D 10m -p 1 -t 16 -a 2,4,6,8,10,12,14,16,54,56,58,60,62,64,66,68 -h 30 -i 1000 -m # Histogram 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000001 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000002 564632 579686 354911 563036 492543 521983 515884 378266 592621 463547 482764 591976 590409 588145 589556 353518 More histogram entries ... # Total: 000599999 000599999 000599999 000599997 000599997 000599998 000599998 000599997 000599997 000599996 000599995 000599996 000599995 000599995 000599995 000599993 # Min Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 # Avg Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 # Max Latencies: 00493 00387 00271 00619 00541 00513 00009 00389 00252 00215 00539 00498 00363 00204 00068 00520 # Histogram Overflows: 00001 00001 00001 00002 00002 00001 00000 00001 00001 00001 00002 00001 00001 00001 00001 00002 # Histogram Overflow at cycle number: # Thread 0: 155922 # Thread 1: 110064 # Thread 2: 110064 # Thread 3: 110063 155921 # Thread 4: 110063 155921 # Thread 5: 155920 # Thread 6: # Thread 7: 110062 # Thread 8: 110062 # Thread 9: 155919 # Thread 10: 110061 155919 # Thread 11: 155918 # Thread 12: 155918 # Thread 13: 110060 # Thread 14: 110060 # Thread 15: 110059 155917
11.4.3. Running oslat
The oslat
test simulates a CPU-intensive DPDK application and measures all the interruptions and disruptions to test how the cluster handles CPU heavy data processing.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Prerequisites
-
You have logged in to
registry.redhat.io
with your Customer Portal credentials. - You have applied a cluster performance profile by using the Node Tuning Operator.
Procedure
To perform the
oslat
test, run the following command, substituting variable values as appropriate:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_CPUS=10 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="oslat"
LATENCY_TEST_CPUS
specifies the list of CPUs to test with theoslat
command.The command runs the
oslat
tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower thanMAXIMUM_LATENCY
(20 μs).If the results exceed the latency threshold, the test fails.
ImportantFor valid results, the test should run for at least 12 hours.
Example failure output
running /usr/bin/cnftests -ginkgo.v -ginkgo.focus=oslat I0908 12:51:55.999393 27 request.go:601] Waited for 1.044848101s due to client-side throttling, not priority and fairness, request: GET:https://compute-1.example.com:6443/apis/machineconfiguration.openshift.io/v1?timeout=32s Running Suite: CNF Features e2e integration tests ================================================= Random Seed: 1662641514 Will run 1 of 194 specs [...] • Failure [77.833 seconds] [performance] Latency Test /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:62 with the oslat image /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:128 should succeed [It] /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:153 The current latency 304 is bigger than the expected one 1 : 1 [...] Summarizing 1 Failure: [Fail] [performance] Latency Test with the oslat image [It] should succeed /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:177 Ran 1 of 194 Specs in 161.091 seconds FAIL! -- 0 Passed | 1 Failed | 0 Pending | 193 Skipped --- FAIL: TestTest (161.42s) FAIL
- 1
- In this example, the measured latency is outside the maximum allowed value.
11.5. Generating a latency test failure report
Use the following procedures to generate a JUnit latency test output and test failure report.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
Create a test failure report with information about the cluster state and resources for troubleshooting by passing the
--report
parameter with the path to where the report is dumped:$ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/reportdest:<report_folder_path> \ -e KUBECONFIG=/kubeconfig/kubeconfig -e DISCOVERY_MODE=true -e FEATURES=performance \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ /usr/bin/test-run.sh --report <report_folder_path> \ -ginkgo.focus="\[performance\]\ Latency\ Test"
where:
- <report_folder_path>
- Is the path to the folder where the report is generated.
11.6. Generating a JUnit latency test report
Use the following procedures to generate a JUnit latency test output and test failure report.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
Create a JUnit-compliant XML report by passing the
--junit
parameter together with the path to where the report is dumped:$ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/junitdest:<junit_folder_path> \ -e KUBECONFIG=/kubeconfig/kubeconfig -e DISCOVERY_MODE=true -e FEATURES=performance \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ /usr/bin/test-run.sh --junit <junit_folder_path> \ -ginkgo.focus="\[performance\]\ Latency\ Test"
where:
- <junit_folder_path>
- Is the path to the folder where the junit report is generated
11.7. Running latency tests on a single-node OpenShift cluster
You can run latency tests on single-node OpenShift clusters.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
To run the latency tests on a single-node OpenShift cluster, run the following command:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=master \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
NoteROLE_WORKER_CNF=master
is required because master is the only machine pool to which the node belongs. For more information about setting the requiredMachineConfigPool
for the latency tests, see "Prerequisites for running latency tests".After running the test suite, all the dangling resources are cleaned up.
11.8. Running latency tests in a disconnected cluster
The CNF tests image can run tests in a disconnected cluster that is not able to reach external registries. This requires two steps:
-
Mirroring the
cnf-tests
image to the custom disconnected registry. - Instructing the tests to consume the images from the custom disconnected registry.
Mirroring the images to a custom registry accessible from the cluster
A mirror
executable is shipped in the image to provide the input required by oc
to mirror the test image to a local registry.
Run this command from an intermediate machine that has access to the cluster and registry.redhat.io:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ /usr/bin/mirror -registry <disconnected_registry> | oc image mirror -f -
where:
- <disconnected_registry>
-
Is the disconnected mirror registry you have configured, for example,
my.local.registry:5000/
.
When you have mirrored the
cnf-tests
image into the disconnected registry, you must override the original registry used to fetch the images when running the tests, for example:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e FEATURES=performance -e IMAGE_REGISTRY="<disconnected_registry>" \ -e CNF_TESTS_IMAGE="cnf-tests-rhel8:v4.12" \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
Configuring the tests to consume images from a custom registry
You can run the latency tests using a custom test image and image registry using CNF_TESTS_IMAGE
and IMAGE_REGISTRY
variables.
To configure the latency tests to use a custom test image and image registry, run the following command:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e IMAGE_REGISTRY="<custom_image_registry>" \ -e CNF_TESTS_IMAGE="<custom_cnf-tests_image>" \ -e FEATURES=performance \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 /usr/bin/test-run.sh
where:
- <custom_image_registry>
-
is the custom image registry, for example,
custom.registry:5000/
. - <custom_cnf-tests_image>
-
is the custom cnf-tests image, for example,
custom-cnf-tests-image:latest
.
Mirroring images to the cluster OpenShift image registry
OpenShift Container Platform provides a built-in container image registry, which runs as a standard workload on the cluster.
Procedure
Gain external access to the registry by exposing it with a route:
$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
Fetch the registry endpoint by running the following command:
$ REGISTRY=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
Create a namespace for exposing the images:
$ oc create ns cnftests
Make the image stream available to all the namespaces used for tests. This is required to allow the tests namespaces to fetch the images from the
cnf-tests
image stream. Run the following commands:$ oc policy add-role-to-user system:image-puller system:serviceaccount:cnf-features-testing:default --namespace=cnftests
$ oc policy add-role-to-user system:image-puller system:serviceaccount:performance-addon-operators-testing:default --namespace=cnftests
Retrieve the docker secret name and auth token by running the following commands:
$ SECRET=$(oc -n cnftests get secret | grep builder-docker | awk {'print $1'}
$ TOKEN=$(oc -n cnftests get secret $SECRET -o jsonpath="{.data['\.dockercfg']}" | base64 --decode | jq '.["image-registry.openshift-image-registry.svc:5000"].auth')
Create a
dockerauth.json
file, for example:$ echo "{\"auths\": { \"$REGISTRY\": { \"auth\": $TOKEN } }}" > dockerauth.json
Do the image mirroring:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:4.12 \ /usr/bin/mirror -registry $REGISTRY/cnftests | oc image mirror --insecure=true \ -a=$(pwd)/dockerauth.json -f -
Run the tests:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e FEATURES=performance -e IMAGE_REGISTRY=image-registry.openshift-image-registry.svc:5000/cnftests \ cnf-tests-local:latest /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
Mirroring a different set of test images
You can optionally change the default upstream images that are mirrored for the latency tests.
Procedure
The
mirror
command tries to mirror the upstream images by default. This can be overridden by passing a file with the following format to the image:[ { "registry": "public.registry.io:5000", "image": "imageforcnftests:4.12" } ]
Pass the file to the
mirror
command, for example saving it locally asimages.json
. With the following command, the local path is mounted in/kubeconfig
inside the container and that can be passed to the mirror command.$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 /usr/bin/mirror \ --registry "my.local.registry:5000/" --images "/kubeconfig/images.json" \ | oc image mirror -f -
11.9. Troubleshooting errors with the cnf-tests container
To run latency tests, the cluster must be accessible from within the cnf-tests
container.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
Verify that the cluster is accessible from inside the
cnf-tests
container by running the following command:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.12 \ oc get nodes
If this command does not work, an error related to spanning across DNS, MTU size, or firewall access might be occurring.
Chapter 12. Improving cluster stability in high latency environments using worker latency profiles
All nodes send heartbeats to the Kubernetes Controller Manager Operator (kube controller) in the OpenShift Container Platform cluster every 10 seconds, by default. If the cluster does not receive heartbeats from a node, OpenShift Container Platform responds using several default mechanisms.
For example, if the Kubernetes Controller Manager Operator loses contact with a node after a configured period:
-
The node controller on the control plane updates the node health to
Unhealthy
and marks the nodeReady
condition asUnknown
. - In response, the scheduler stops scheduling pods to that node.
-
The on-premise node controller adds a
node.kubernetes.io/unreachable
taint with aNoExecute
effect to the node and schedules any pods on the node for eviction after five minutes, by default.
This behavior can cause problems if your network is prone to latency issues, especially if you have nodes at the network edge. In some cases, the Kubernetes Controller Manager Operator might not receive an update from a healthy node due to network latency. The Kubernetes Controller Manager Operator would then evict pods from the node even though the node is healthy. To avoid this problem, you can use worker latency profiles to adjust the frequency that the kubelet and the Kubernetes Controller Manager Operator wait for status updates before taking action. These adjustments help to ensure that your cluster runs properly in the event that network latency between the control plane and the worker nodes is not optimal.
These worker latency profiles are three sets of parameters that are pre-defined with carefully tuned values that let you control the reaction of the cluster to latency issues without needing to determine the best values manually.
You can configure worker latency profiles when installing a cluster or at any time you notice increased latency in your cluster network.
12.1. Understanding worker latency profiles
Worker latency profiles are multiple sets of carefully-tuned values for the node-status-update-frequency
, node-monitor-grace-period
, default-not-ready-toleration-seconds
and default-unreachable-toleration-seconds
parameters. These parameters let you control the reaction of the cluster to latency issues without needing to determine the best values manually.
All worker latency profiles configure the following parameters:
-
node-status-update-frequency
. Specifies the amount of time in seconds that a kubelet updates its status to the Kubernetes Controller Manager Operator. -
node-monitor-grace-period
. Specifies the amount of time in seconds that the Kubernetes Controller Manager Operator waits for an update from a kubelet before marking the node unhealthy and adding thenode.kubernetes.io/not-ready
ornode.kubernetes.io/unreachable
taint to the node. -
default-not-ready-toleration-seconds
. Specifies the amount of time in seconds after marking a node unhealthy that the Kubernetes Controller Manager Operator waits before evicting pods from that node. -
default-unreachable-toleration-seconds
. Specifies the amount of time in seconds after marking a node unreachable that the Kubernetes Controller Manager Operator waits before evicting pods from that node.
Manually modifying the node-monitor-grace-period
parameter is not supported.
The following Operators monitor the changes to the worker latency profiles and respond accordingly:
-
The Machine Config Operator (MCO) updates the
node-status-update-frequency
parameter on the worker nodes. -
The Kubernetes Controller Manager Operator updates the
node-monitor-grace-period
parameter on the control plane nodes. -
The Kubernetes API Server Operator updates the
default-not-ready-toleration-seconds
anddefault-unreachable-toleration-seconds
parameters on the control plance nodes.
While the default configuration works in most cases, OpenShift Container Platform offers two other worker latency profiles for situations where the network is experiencing higher latency than usual. The three worker latency profiles are described in the following sections:
- Default worker latency profile
With the
Default
profile, each kubelet reports its node status to the Kubelet Controller Manager Operator (kube controller) every 10 seconds. The Kubelet Controller Manager Operator checks the kubelet for a status every 5 seconds.The Kubernetes Controller Manager Operator waits 40 seconds for a status update before considering that node unhealthy. It marks the node with the
node.kubernetes.io/not-ready
ornode.kubernetes.io/unreachable
taint and evicts the pods on that node. If a pod on that node has theNoExecute
toleration, the pod gets evicted in 300 seconds. If the pod has thetolerationSeconds
parameter, the eviction waits for the period specified by that parameter.Profile Component Parameter Value Default
kubelet
node-status-update-frequency
10s
Kubelet Controller Manager
node-monitor-grace-period
40s
Kubernetes API Server
default-not-ready-toleration-seconds
300s
Kubernetes API Server
default-unreachable-toleration-seconds
300s
- Medium worker latency profile
Use the
MediumUpdateAverageReaction
profile if the network latency is slightly higher than usual.The
MediumUpdateAverageReaction
profile reduces the frequency of kubelet updates to 20 seconds and changes the period that the Kubernetes Controller Manager Operator waits for those updates to 2 minutes. The pod eviction period for a pod on that node is reduced to 60 seconds. If the pod has thetolerationSeconds
parameter, the eviction waits for the period specified by that parameter.The Kubernetes Controller Manager Operator waits for 2 minutes to consider a node unhealthy. In another minute, the eviction process starts.
Profile Component Parameter Value MediumUpdateAverageReaction
kubelet
node-status-update-frequency
20s
Kubelet Controller Manager
node-monitor-grace-period
2m
Kubernetes API Server
default-not-ready-toleration-seconds
60s
Kubernetes API Server
default-unreachable-toleration-seconds
60s
- Low worker latency profile
Use the
LowUpdateSlowReaction
profile if the network latency is extremely high.The
LowUpdateSlowReaction
profile reduces the frequency of kubelet updates to 1 minute and changes the period that the Kubernetes Controller Manager Operator waits for those updates to 5 minutes. The pod eviction period for a pod on that node is reduced to 60 seconds. If the pod has thetolerationSeconds
parameter, the eviction waits for the period specified by that parameter.The Kubernetes Controller Manager Operator waits for 5 minutes to consider a node unhealthy. In another minute, the eviction process starts.
Profile Component Parameter Value LowUpdateSlowReaction
kubelet
node-status-update-frequency
1m
Kubelet Controller Manager
node-monitor-grace-period
5m
Kubernetes API Server
default-not-ready-toleration-seconds
60s
Kubernetes API Server
default-unreachable-toleration-seconds
60s
12.2. Using worker latency profiles
To implement a worker latency profile to deal with network latency, edit the node.config
object to add the name of the profile. You can change the profile at any time as latency increases or decreases.
You must move one worker latency profile at a time. For example, you cannot move directly from the Default
profile to the LowUpdateSlowReaction
worker latency profile. You must move from the default
worker latency profile to the MediumUpdateAverageReaction
profile first, then to LowUpdateSlowReaction
. Similarly, when returning to the default profile, you must move from the low profile to the medium profile first, then to the default.
You can also configure worker latency profiles upon installing an OpenShift Container Platform cluster.
Procedure
To move from the default worker latency profile:
Move to the medium worker latency profile:
Edit the
node.config
object:$ oc edit nodes.config/cluster
Add
spec.workerLatencyProfile: MediumUpdateAverageReaction
:Example
node.config
objectapiVersion: config.openshift.io/v1 kind: Node metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2022-07-08T16:02:51Z" generation: 1 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 36282574-bf9f-409e-a6cd-3032939293eb resourceVersion: "1865" uid: 0c0f7a4c-4307-4187-b591-6155695ac85b spec: workerLatencyProfile: MediumUpdateAverageReaction 1 ...
- 1
- Specifies the medium worker latency policy.
Scheduling on each worker node is disabled as the change is being applied.
When all nodes return to the
Ready
condition, you can use the following command to look in the Kubernetes Controller Manager to ensure it was applied:$ oc get KubeControllerManager -o yaml | grep -i workerlatency -A 5 -B 5
Example output
... - lastTransitionTime: "2022-07-11T19:47:10Z" reason: ProfileUpdated status: "False" type: WorkerLatencyProfileProgressing - lastTransitionTime: "2022-07-11T19:47:10Z" 1 message: all static pod revision(s) have updated latency profile reason: ProfileUpdated status: "True" type: WorkerLatencyProfileComplete - lastTransitionTime: "2022-07-11T19:20:11Z" reason: AsExpected status: "False" type: WorkerLatencyProfileDegraded - lastTransitionTime: "2022-07-11T19:20:36Z" status: "False" ...
- 1
- Specifies that the profile is applied and active.
Optional: Move to the low worker latency profile:
Edit the
node.config
object:$ oc edit nodes.config/cluster
Change the
spec.workerLatencyProfile
value toLowUpdateSlowReaction
:Example
node.config
objectapiVersion: config.openshift.io/v1 kind: Node metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2022-07-08T16:02:51Z" generation: 1 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 36282574-bf9f-409e-a6cd-3032939293eb resourceVersion: "1865" uid: 0c0f7a4c-4307-4187-b591-6155695ac85b spec: workerLatencyProfile: LowUpdateSlowReaction 1 ...
- 1
- Specifies to use the low worker latency policy.
Scheduling on each worker node is disabled as the change is being applied.
To change the low profile to medium or change the medium to low, edit the node.config
object and set the spec.workerLatencyProfile
parameter to the appropriate value.
Chapter 13. Topology Aware Lifecycle Manager for cluster updates
You can use the Topology Aware Lifecycle Manager (TALM) to manage the software lifecycle of multiple clusters. TALM uses Red Hat Advanced Cluster Management (RHACM) policies to perform changes on the target clusters.
13.1. About the Topology Aware Lifecycle Manager configuration
The Topology Aware Lifecycle Manager (TALM) manages the deployment of Red Hat Advanced Cluster Management (RHACM) policies for one or more OpenShift Container Platform clusters. Using TALM in a large network of clusters allows the phased rollout of policies to the clusters in limited batches. This helps to minimize possible service disruptions when updating. With TALM, you can control the following actions:
- The timing of the update
- The number of RHACM-managed clusters
- The subset of managed clusters to apply the policies to
- The update order of the clusters
- The set of policies remediated to the cluster
- The order of policies remediated to the cluster
- The assignment of a canary cluster
For single-node OpenShift, the Topology Aware Lifecycle Manager (TALM) offers the following features:
- Create a backup of a deployment before an upgrade
- Pre-caching images for clusters with limited bandwidth
TALM supports the orchestration of the OpenShift Container Platform y-stream and z-stream updates, and day-two operations on y-streams and z-streams.
13.2. About managed policies used with Topology Aware Lifecycle Manager
The Topology Aware Lifecycle Manager (TALM) uses RHACM policies for cluster updates.
TALM can be used to manage the rollout of any policy CR where the remediationAction
field is set to inform
. Supported use cases include the following:
- Manual user creation of policy CRs
-
Automatically generated policies from the
PolicyGenTemplate
custom resource definition (CRD)
For policies that update an Operator subscription with manual approval, TALM provides additional functionality that approves the installation of the updated Operator.
For more information about managed policies, see Policy Overview in the RHACM documentation.
For more information about the PolicyGenTemplate
CRD, see the "About the PolicyGenTemplate CRD" section in "Configuring managed clusters with policies and PolicyGenTemplate resources".
13.3. Installing the Topology Aware Lifecycle Manager by using the web console
You can use the OpenShift Container Platform web console to install the Topology Aware Lifecycle Manager.
Prerequisites
- Install the latest version of the RHACM Operator.
- Set up a hub cluster with disconnected regitry.
-
Log in as a user with
cluster-admin
privileges.
Procedure
- In the OpenShift Container Platform web console, navigate to Operators → OperatorHub.
- Search for the Topology Aware Lifecycle Manager from the list of available Operators, and then click Install.
- Keep the default selection of Installation mode ["All namespaces on the cluster (default)"] and Installed Namespace ("openshift-operators") to ensure that the Operator is installed properly.
- Click Install.
Verification
To confirm that the installation is successful:
- Navigate to the Operators → Installed Operators page.
-
Check that the Operator is installed in the
All Namespaces
namespace and its status isSucceeded
.
If the Operator is not installed successfully:
-
Navigate to the Operators → Installed Operators page and inspect the
Status
column for any errors or failures. -
Navigate to the Workloads → Pods page and check the logs in any containers in the
cluster-group-upgrades-controller-manager
pod that are reporting issues.
13.4. Installing the Topology Aware Lifecycle Manager by using the CLI
You can use the OpenShift CLI (oc
) to install the Topology Aware Lifecycle Manager (TALM).
Prerequisites
-
Install the OpenShift CLI (
oc
). - Install the latest version of the RHACM Operator.
- Set up a hub cluster with disconnected registry.
-
Log in as a user with
cluster-admin
privileges.
Procedure
Create a
Subscription
CR:Define the
Subscription
CR and save the YAML file, for example,talm-subscription.yaml
:apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: openshift-topology-aware-lifecycle-manager-subscription namespace: openshift-operators spec: channel: "stable" name: topology-aware-lifecycle-manager source: redhat-operators sourceNamespace: openshift-marketplace
Create the
Subscription
CR by running the following command:$ oc create -f talm-subscription.yaml
Verification
Verify that the installation succeeded by inspecting the CSV resource:
$ oc get csv -n openshift-operators
Example output
NAME DISPLAY VERSION REPLACES PHASE topology-aware-lifecycle-manager.4.12.x Topology Aware Lifecycle Manager 4.12.x Succeeded
Verify that the TALM is up and running:
$ oc get deploy -n openshift-operators
Example output
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE openshift-operators cluster-group-upgrades-controller-manager 1/1 1 1 14s
13.5. About the ClusterGroupUpgrade CR
The Topology Aware Lifecycle Manager (TALM) builds the remediation plan from the ClusterGroupUpgrade
CR for a group of clusters. You can define the following specifications in a ClusterGroupUpgrade
CR:
- Clusters in the group
-
Blocking
ClusterGroupUpgrade
CRs - Applicable list of managed policies
- Number of concurrent updates
- Applicable canary updates
- Actions to perform before and after the update
- Update timing
You can control the start time of an update using the enable
field in the ClusterGroupUpgrade
CR. For example, if you have a scheduled maintenance window of four hours, you can prepare a ClusterGroupUpgrade
CR with the enable
field set to false
.
You can set the timeout by configuring the spec.remediationStrategy.timeout
setting as follows:
spec remediationStrategy: maxConcurrency: 1 timeout: 240
You can use the batchTimeoutAction
to determine what happens if an update fails for a cluster. You can specify continue
to skip the failing cluster and continue to upgrade other clusters, or abort
to stop policy remediation for all clusters. Once the timeout elapses, TALM removes all enforce
policies to ensure that no further updates are made to clusters.
To apply the changes, you set the enabled
field to true
.
For more information see the "Applying update policies to managed clusters" section.
As TALM works through remediation of the policies to the specified clusters, the ClusterGroupUpgrade
CR can report true or false statuses for a number of conditions.
After TALM completes a cluster update, the cluster does not update again under the control of the same ClusterGroupUpgrade
CR. You must create a new ClusterGroupUpgrade
CR in the following cases:
- When you need to update the cluster again
-
When the cluster changes to non-compliant with the
inform
policy after being updated
13.5.1. Selecting clusters
TALM builds a remediation plan and selects clusters based on the following fields:
-
The
clusterLabelSelector
field specifies the labels of the clusters that you want to update. This consists of a list of the standard label selectors fromk8s.io/apimachinery/pkg/apis/meta/v1
. Each selector in the list uses either label value pairs or label expressions. Matches from each selector are added to the final list of clusters along with the matches from theclusterSelector
field and thecluster
field. -
The
clusters
field specifies a list of clusters to update. -
The
canaries
field specifies the clusters for canary updates. -
The
maxConcurrency
field specifies the number of clusters to update in a batch.
You can use the clusters
, clusterLabelSelector
, and clusterSelector
fields together to create a combined list of clusters.
The remediation plan starts with the clusters listed in the canaries
field. Each canary cluster forms a single-cluster batch.
Sample ClusterGroupUpgrade
CR with the enabled field
set to false
apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: creationTimestamp: '2022-11-18T16:27:15Z' finalizers: - ran.openshift.io/cleanup-finalizer generation: 1 name: talm-cgu namespace: talm-namespace resourceVersion: '40451823' uid: cca245a5-4bca-45fa-89c0-aa6af81a596c Spec: actions: afterCompletion: deleteObjects: true beforeEnable: {} backup: false clusters: 1 - spoke1 enable: false 2 managedPolicies: 3 - talm-policy preCaching: false remediationStrategy: 4 canaries: 5 - spoke1 maxConcurrency: 2 6 timeout: 240 clusterLabelSelectors: 7 - matchExpressions: - key: label1 operator: In values: - value1a - value1b batchTimeoutAction: 8 status: 9 computedMaxConcurrency: 2 conditions: - lastTransitionTime: '2022-11-18T16:27:15Z' message: All selected clusters are valid reason: ClusterSelectionCompleted status: 'True' type: ClustersSelected 10 - lastTransitionTime: '2022-11-18T16:27:15Z' message: Completed validation reason: ValidationCompleted status: 'True' type: Validated 11 - lastTransitionTime: '2022-11-18T16:37:16Z' message: Not enabled reason: NotEnabled status: 'False' type: Progressing managedPoliciesForUpgrade: - name: talm-policy namespace: talm-namespace managedPoliciesNs: talm-policy: talm-namespace remediationPlan: - - spoke1 - - spoke2 - spoke3 status:
- 1
- Defines the list of clusters to update.
- 2
- The
enable
field is set tofalse
. - 3
- Lists the user-defined set of policies to remediate.
- 4
- Defines the specifics of the cluster updates.
- 5
- Defines the clusters for canary updates.
- 6
- Defines the maximum number of concurrent updates in a batch. The number of remediation batches is the number of canary clusters, plus the number of clusters, except the canary clusters, divided by the maxConcurrency value. The clusters that are already compliant with all the managed policies are excluded from the remediation plan.
- 7
- Displays the parameters for selecting clusters.
- 8
- Controls what happens if a batch times out. Possible values are
abort
orcontinue
. If unspecified, the default iscontinue
. - 9
- Displays information about the status of the updates.
- 10
- The
ClustersSelected
condition shows that all selected clusters are valid. - 11
- The
Validated
condition shows that all selected clusters have been validated.
Any failures during the update of a canary cluster stops the update process.
When the remediation plan is successfully created, you can you set the enable
field to true
and TALM starts to update the non-compliant clusters with the specified managed policies.
You can only make changes to the spec
fields if the enable
field of the ClusterGroupUpgrade
CR is set to false
.
13.5.2. Validating
TALM checks that all specified managed policies are available and correct, and uses the Validated
condition to report the status and reasons as follows:
true
Validation is completed.
false
Policies are missing or invalid, or an invalid platform image has been specified.
13.5.3. Pre-caching
Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed. On single-node OpenShift clusters, you can use pre-caching to avoid this. The container image pre-caching starts when you create a ClusterGroupUpgrade
CR with the preCaching
field set to true
.
TALM uses the PrecacheSpecValid
condition to report status information as follows:
true
The pre-caching spec is valid and consistent.
false
The pre-caching spec is incomplete.
TALM uses the PrecachingSucceeded
condition to report status information as follows:
true
TALM has concluded the pre-caching process. If pre-caching fails for any cluster, the update fails for that cluster but proceeds for all other clusters. A message informs you if pre-caching has failed for any clusters.
false
Pre-caching is still in progress for one or more clusters or has failed for all clusters.
For more information see the "Using the container image pre-cache feature" section.
13.5.4. Creating a backup
For single-node OpenShift, TALM can create a backup of a deployment before an update. If the update fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications. To use the backup feature you first create a ClusterGroupUpgrade
CR with the backup
field set to true
. To ensure that the contents of the backup are up to date, the backup is not taken until you set the enable
field in the ClusterGroupUpgrade
CR to true
.
TALM uses the BackupSucceeded
condition to report the status and reasons as follows:
true
Backup is completed for all clusters or the backup run has completed but failed for one or more clusters. If backup fails for any cluster, the update fails for that cluster but proceeds for all other clusters.
false
Backup is still in progress for one or more clusters or has failed for all clusters.
For more information, see the "Creating a backup of cluster resources before upgrade" section.
13.5.5. Updating clusters
TALM enforces the policies following the remediation plan. Enforcing the policies for subsequent batches starts immediately after all the clusters of the current batch are compliant with all the managed policies. If the batch times out, TALM moves on to the next batch. The timeout value of a batch is the spec.timeout
field divided by the number of batches in the remediation plan.
TALM uses the Progressing
condition to report the status and reasons as follows:
true
TALM is remediating non-compliant policies.
false
The update is not in progress. Possible reasons for this are:
- All clusters are compliant with all the managed policies.
- The update has timed out as policy remediation took too long.
- Blocking CRs are missing from the system or have not yet completed.
-
The
ClusterGroupUpgrade
CR is not enabled. - Backup is still in progress.
The managed policies apply in the order that they are listed in the managedPolicies
field in the ClusterGroupUpgrade
CR. One managed policy is applied to the specified clusters at a time. When a cluster complies with the current policy, the next managed policy is applied to it.
Sample ClusterGroupUpgrade
CR in the Progressing
state
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
creationTimestamp: '2022-11-18T16:27:15Z'
finalizers:
- ran.openshift.io/cleanup-finalizer
generation: 1
name: talm-cgu
namespace: talm-namespace
resourceVersion: '40451823'
uid: cca245a5-4bca-45fa-89c0-aa6af81a596c
Spec:
actions:
afterCompletion:
deleteObjects: true
beforeEnable: {}
backup: false
clusters:
- spoke1
enable: true
managedPolicies:
- talm-policy
preCaching: true
remediationStrategy:
canaries:
- spoke1
maxConcurrency: 2
timeout: 240
clusterLabelSelectors:
- matchExpressions:
- key: label1
operator: In
values:
- value1a
- value1b
batchTimeoutAction:
status:
clusters:
- name: spoke1
state: complete
computedMaxConcurrency: 2
conditions:
- lastTransitionTime: '2022-11-18T16:27:15Z'
message: All selected clusters are valid
reason: ClusterSelectionCompleted
status: 'True'
type: ClustersSelected
- lastTransitionTime: '2022-11-18T16:27:15Z'
message: Completed validation
reason: ValidationCompleted
status: 'True'
type: Validated
- lastTransitionTime: '2022-11-18T16:37:16Z'
message: Remediating non-compliant policies
reason: InProgress
status: 'True'
type: Progressing 1
managedPoliciesForUpgrade:
- name: talm-policy
namespace: talm-namespace
managedPoliciesNs:
talm-policy: talm-namespace
remediationPlan:
- - spoke1
- - spoke2
- spoke3
status:
currentBatch: 2
currentBatchRemediationProgress:
spoke2:
state: Completed
spoke3:
policyIndex: 0
state: InProgress
currentBatchStartedAt: '2022-11-18T16:27:16Z'
startedAt: '2022-11-18T16:27:15Z'
- 1
- The
Progressing
fields show that TALM is in the process of remediating policies.
13.5.6. Update status
TALM uses the Succeeded
condition to report the status and reasons as follows:
true
All clusters are compliant with the specified managed policies.
false
Policy remediation failed as there were no clusters available for remediation, or because policy remediation took too long for one of the following reasons:
- The current batch contains canary updates and the cluster in the batch does not comply with all the managed policies within the batch timeout.
-
Clusters did not comply with the managed policies within the
timeout
value specified in theremediationStrategy
field.
Sample ClusterGroupUpgrade
CR in the Succeeded
state
apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-upgrade-complete namespace: default spec: clusters: - spoke1 - spoke4 enable: true managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy remediationStrategy: maxConcurrency: 1 timeout: 240 status: 1 clusters: - name: spoke1 state: complete - name: spoke4 state: complete conditions: - message: All selected clusters are valid reason: ClusterSelectionCompleted status: "True" type: ClustersSelected - message: Completed validation reason: ValidationCompleted status: "True" type: Validated - message: All clusters are compliant with all the managed policies reason: Completed status: "False" type: Progressing 2 - message: All clusters are compliant with all the managed policies reason: Completed status: "True" type: Succeeded 3 managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy2-common-pao-sub-policy namespace: default remediationPlan: - - spoke1 - - spoke4 status: completedAt: '2022-11-18T16:27:16Z' startedAt: '2022-11-18T16:27:15Z'
- 2
- In the
Progressing
fields, the status isfalse
as the update has completed; clusters are compliant with all the managed policies. - 3
- The
Succeeded
fields show that the validations completed successfully. - 1
- The
status
field includes a list of clusters and their respective statuses. The status of a cluster can becomplete
ortimedout
.
Sample ClusterGroupUpgrade
CR in the timedout
state
apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: creationTimestamp: '2022-11-18T16:27:15Z' finalizers: - ran.openshift.io/cleanup-finalizer generation: 1 name: talm-cgu namespace: talm-namespace resourceVersion: '40451823' uid: cca245a5-4bca-45fa-89c0-aa6af81a596c spec: actions: afterCompletion: deleteObjects: true beforeEnable: {} backup: false clusters: - spoke1 - spoke2 enable: true managedPolicies: - talm-policy preCaching: false remediationStrategy: maxConcurrency: 2 timeout: 240 status: clusters: - name: spoke1 state: complete - currentPolicy: 1 name: talm-policy status: NonCompliant name: spoke2 state: timedout computedMaxConcurrency: 2 conditions: - lastTransitionTime: '2022-11-18T16:27:15Z' message: All selected clusters are valid reason: ClusterSelectionCompleted status: 'True' type: ClustersSelected - lastTransitionTime: '2022-11-18T16:27:15Z' message: Completed validation reason: ValidationCompleted status: 'True' type: Validated - lastTransitionTime: '2022-11-18T16:37:16Z' message: Policy remediation took too long reason: TimedOut status: 'False' type: Progressing - lastTransitionTime: '2022-11-18T16:37:16Z' message: Policy remediation took too long reason: TimedOut status: 'False' type: Succeeded 2 managedPoliciesForUpgrade: - name: talm-policy namespace: talm-namespace managedPoliciesNs: talm-policy: talm-namespace remediationPlan: - - spoke1 - spoke2 status: startedAt: '2022-11-18T16:27:15Z' completedAt: '2022-11-18T20:27:15Z'
13.5.7. Blocking ClusterGroupUpgrade CRs
You can create multiple ClusterGroupUpgrade
CRs and control their order of application.
For example, if you create ClusterGroupUpgrade
CR C that blocks the start of ClusterGroupUpgrade
CR A, then ClusterGroupUpgrade
CR A cannot start until the status of ClusterGroupUpgrade
CR C becomes UpgradeComplete
.
One ClusterGroupUpgrade
CR can have multiple blocking CRs. In this case, all the blocking CRs must complete before the upgrade for the current CR can start.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-admin
privileges. - Create RHACM policies in the hub cluster.
Procedure
Save the content of the
ClusterGroupUpgrade
CRs in thecgu-a.yaml
,cgu-b.yaml
, andcgu-c.yaml
files.apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-a namespace: default spec: blockingCRs: 1 - name: cgu-c namespace: default clusters: - spoke1 - spoke2 - spoke3 enable: false managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy remediationStrategy: canaries: - spoke1 maxConcurrency: 2 timeout: 240 status: conditions: - message: The ClusterGroupUpgrade CR is not enabled reason: UpgradeNotStarted status: "False" type: Ready copiedPolicies: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy2-common-pao-sub-policy namespace: default - name: policy3-common-ptp-sub-policy namespace: default placementBindings: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy placementRules: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy remediationPlan: - - spoke1 - - spoke2
- 1
- Defines the blocking CRs. The
cgu-a
update cannot start untilcgu-c
is complete.
apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-b namespace: default spec: blockingCRs: 1 - name: cgu-a namespace: default clusters: - spoke4 - spoke5 enable: false managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy - policy4-common-sriov-sub-policy remediationStrategy: maxConcurrency: 1 timeout: 240 status: conditions: - message: The ClusterGroupUpgrade CR is not enabled reason: UpgradeNotStarted status: "False" type: Ready copiedPolicies: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy2-common-pao-sub-policy namespace: default - name: policy3-common-ptp-sub-policy namespace: default - name: policy4-common-sriov-sub-policy namespace: default placementBindings: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy placementRules: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy remediationPlan: - - spoke4 - - spoke5 status: {}
- 1
- The
cgu-b
update cannot start untilcgu-a
is complete.
apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-c namespace: default spec: 1 clusters: - spoke6 enable: false managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy - policy4-common-sriov-sub-policy remediationStrategy: maxConcurrency: 1 timeout: 240 status: conditions: - message: The ClusterGroupUpgrade CR is not enabled reason: UpgradeNotStarted status: "False" type: Ready copiedPolicies: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy managedPoliciesCompliantBeforeUpgrade: - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy4-common-sriov-sub-policy namespace: default placementBindings: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy placementRules: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy remediationPlan: - - spoke6 status: {}
- 1
- The
cgu-c
update does not have any blocking CRs. TALM starts thecgu-c
update when theenable
field is set totrue
.
Create the
ClusterGroupUpgrade
CRs by running the following command for each relevant CR:$ oc apply -f <name>.yaml
Start the update process by running the following command for each relevant CR:
$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/<name> \ --type merge -p '{"spec":{"enable":true}}'
The following examples show
ClusterGroupUpgrade
CRs where theenable
field is set totrue
:Example for
cgu-a
with blocking CRsapiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-a namespace: default spec: blockingCRs: - name: cgu-c namespace: default clusters: - spoke1 - spoke2 - spoke3 enable: true managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy remediationStrategy: canaries: - spoke1 maxConcurrency: 2 timeout: 240 status: conditions: - message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet completed: [cgu-c]' 1 reason: UpgradeCannotStart status: "False" type: Ready copiedPolicies: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy2-common-pao-sub-policy namespace: default - name: policy3-common-ptp-sub-policy namespace: default placementBindings: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy placementRules: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy remediationPlan: - - spoke1 - - spoke2 status: {}
- 1
- Shows the list of blocking CRs.
Example for
cgu-b
with blocking CRsapiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-b namespace: default spec: blockingCRs: - name: cgu-a namespace: default clusters: - spoke4 - spoke5 enable: true managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy - policy4-common-sriov-sub-policy remediationStrategy: maxConcurrency: 1 timeout: 240 status: conditions: - message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet completed: [cgu-a]' 1 reason: UpgradeCannotStart status: "False" type: Ready copiedPolicies: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy2-common-pao-sub-policy namespace: default - name: policy3-common-ptp-sub-policy namespace: default - name: policy4-common-sriov-sub-policy namespace: default placementBindings: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy placementRules: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy remediationPlan: - - spoke4 - - spoke5 status: {}
- 1
- Shows the list of blocking CRs.
Example for
cgu-c
with blocking CRsapiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-c namespace: default spec: clusters: - spoke6 enable: true managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy - policy4-common-sriov-sub-policy remediationStrategy: maxConcurrency: 1 timeout: 240 status: conditions: - message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant 1 reason: UpgradeNotCompleted status: "False" type: Ready copiedPolicies: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy managedPoliciesCompliantBeforeUpgrade: - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy4-common-sriov-sub-policy namespace: default placementBindings: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy placementRules: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy remediationPlan: - - spoke6 status: currentBatch: 1 remediationPlanForBatch: spoke6: 0
- 1
- The
cgu-c
update does not have any blocking CRs.
13.6. Update policies on managed clusters
The Topology Aware Lifecycle Manager (TALM) remediates a set of inform
policies for the clusters specified in the ClusterGroupUpgrade
CR. TALM remediates inform
policies by making enforce
copies of the managed RHACM policies. Each copied policy has its own corresponding RHACM placement rule and RHACM placement binding.
One by one, TALM adds each cluster from the current batch to the placement rule that corresponds with the applicable managed policy. If a cluster is already compliant with a policy, TALM skips applying that policy on the compliant cluster. TALM then moves on to applying the next policy to the non-compliant cluster. After TALM completes the updates in a batch, all clusters are removed from the placement rules associated with the copied policies. Then, the update of the next batch starts.
If a spoke cluster does not report any compliant state to RHACM, the managed policies on the hub cluster can be missing status information that TALM needs. TALM handles these cases in the following ways:
-
If a policy’s
status.compliant
field is missing, TALM ignores the policy and adds a log entry. Then, TALM continues looking at the policy’sstatus.status
field. -
If a policy’s
status.status
is missing, TALM produces an error. -
If a cluster’s compliance status is missing in the policy’s
status.status
field, TALM considers that cluster to be non-compliant with that policy.
The ClusterGroupUpgrade
CR’s batchTimeoutAction
determines what happens if an upgrade fails for a cluster. You can specify continue
to skip the failing cluster and continue to upgrade other clusters, or specify abort
to stop the policy remediation for all clusters. Once the timeout elapses, TALM removes all enforce policies to ensure that no further updates are made to clusters.
For more information about RHACM policies, see Policy overview.
Additional resources
For more information about the PolicyGenTemplate
CRD, see About the PolicyGenTemplate CRD.
13.6.1. Applying update policies to managed clusters
You can update your managed clusters by applying your policies.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-admin
privileges. - Create RHACM policies in the hub cluster.
Procedure
Save the contents of the
ClusterGroupUpgrade
CR in thecgu-1.yaml
file.apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-1 namespace: default spec: managedPolicies: 1 - policy1-common-cluster-version-policy - policy2-common-nto-sub-policy - policy3-common-ptp-sub-policy - policy4-common-sriov-sub-policy enable: false clusters: 2 - spoke1 - spoke2 - spoke5 - spoke6 remediationStrategy: maxConcurrency: 2 3 timeout: 240 4 batchTimeoutAction: 5
- 1
- The name of the policies to apply.
- 2
- The list of clusters to update.
- 3
- The
maxConcurrency
field signifies the number of clusters updated at the same time. - 4
- The update timeout in minutes.
- 5
- Controls what happens if a batch times out. Possible values are
abort
orcontinue
. If unspecified, the default iscontinue
.
Create the
ClusterGroupUpgrade
CR by running the following command:$ oc create -f cgu-1.yaml
Check if the
ClusterGroupUpgrade
CR was created in the hub cluster by running the following command:$ oc get cgu --all-namespaces
Example output
NAMESPACE NAME AGE STATE DETAILS default cgu-1 8m55 NotEnabled Not Enabled
Check the status of the update by running the following command:
$ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
Example output
{ "computedMaxConcurrency": 2, "conditions": [ { "lastTransitionTime": "2022-02-25T15:34:07Z", "message": "Not enabled", 1 "reason": "NotEnabled", "status": "False", "type": "Progressing" } ], "copiedPolicies": [ "cgu-policy1-common-cluster-version-policy", "cgu-policy2-common-nto-sub-policy", "cgu-policy3-common-ptp-sub-policy", "cgu-policy4-common-sriov-sub-policy" ], "managedPoliciesContent": { "policy1-common-cluster-version-policy": "null", "policy2-common-nto-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"node-tuning-operator\",\"namespace\":\"openshift-cluster-node-tuning-operator\"}]", "policy3-common-ptp-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"ptp-operator-subscription\",\"namespace\":\"openshift-ptp\"}]", "policy4-common-sriov-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"sriov-network-operator-subscription\",\"namespace\":\"openshift-sriov-network-operator\"}]" }, "managedPoliciesForUpgrade": [ { "name": "policy1-common-cluster-version-policy", "namespace": "default" }, { "name": "policy2-common-nto-sub-policy", "namespace": "default" }, { "name": "policy3-common-ptp-sub-policy", "namespace": "default" }, { "name": "policy4-common-sriov-sub-policy", "namespace": "default" } ], "managedPoliciesNs": { "policy1-common-cluster-version-policy": "default", "policy2-common-nto-sub-policy": "default", "policy3-common-ptp-sub-policy": "default", "policy4-common-sriov-sub-policy": "default" }, "placementBindings": [ "cgu-policy1-common-cluster-version-policy", "cgu-policy2-common-nto-sub-policy", "cgu-policy3-common-ptp-sub-policy", "cgu-policy4-common-sriov-sub-policy" ], "placementRules": [ "cgu-policy1-common-cluster-version-policy", "cgu-policy2-common-nto-sub-policy", "cgu-policy3-common-ptp-sub-policy", "cgu-policy4-common-sriov-sub-policy" ], "precaching": { "spec": {} }, "remediationPlan": [ [ "spoke1", "spoke2" ], [ "spoke5", "spoke6" ] ], "status": {} }
- 1
- The
spec.enable
field in theClusterGroupUpgrade
CR is set tofalse
.
Check the status of the policies by running the following command:
$ oc get policies -A
Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default cgu-policy1-common-cluster-version-policy enforce 17m 1 default cgu-policy2-common-nto-sub-policy enforce 17m default cgu-policy3-common-ptp-sub-policy enforce 17m default cgu-policy4-common-sriov-sub-policy enforce 17m default policy1-common-cluster-version-policy inform NonCompliant 15h default policy2-common-nto-sub-policy inform NonCompliant 15h default policy3-common-ptp-sub-policy inform NonCompliant 18m default policy4-common-sriov-sub-policy inform NonCompliant 18m
- 1
- The
spec.remediationAction
field of policies currently applied on the clusters is set toenforce
. The managed policies ininform
mode from theClusterGroupUpgrade
CR remain ininform
mode during the update.
Change the value of the
spec.enable
field totrue
by running the following command:$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-1 \ --patch '{"spec":{"enable":true}}' --type=merge
Verification
Check the status of the update again by running the following command:
$ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
Example output
{ "computedMaxConcurrency": 2, "conditions": [ 1 { "lastTransitionTime": "2022-02-25T15:33:07Z", "message": "All selected clusters are valid", "reason": "ClusterSelectionCompleted", "status": "True", "type": "ClustersSelected", "lastTransitionTime": "2022-02-25T15:33:07Z", "message": "Completed validation", "reason": "ValidationCompleted", "status": "True", "type": "Validated", "lastTransitionTime": "2022-02-25T15:34:07Z", "message": "Remediating non-compliant policies", "reason": "InProgress", "status": "True", "type": "Progressing" } ], "copiedPolicies": [ "cgu-policy1-common-cluster-version-policy", "cgu-policy2-common-nto-sub-policy", "cgu-policy3-common-ptp-sub-policy", "cgu-policy4-common-sriov-sub-policy" ], "managedPoliciesContent": { "policy1-common-cluster-version-policy": "null", "policy2-common-nto-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"node-tuning-operator\",\"namespace\":\"openshift-cluster-node-tuning-operator\"}]", "policy3-common-ptp-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"ptp-operator-subscription\",\"namespace\":\"openshift-ptp\"}]", "policy4-common-sriov-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"sriov-network-operator-subscription\",\"namespace\":\"openshift-sriov-network-operator\"}]" }, "managedPoliciesForUpgrade": [ { "name": "policy1-common-cluster-version-policy", "namespace": "default" }, { "name": "policy2-common-nto-sub-policy", "namespace": "default" }, { "name": "policy3-common-ptp-sub-policy", "namespace": "default" }, { "name": "policy4-common-sriov-sub-policy", "namespace": "default" } ], "managedPoliciesNs": { "policy1-common-cluster-version-policy": "default", "policy2-common-nto-sub-policy": "default", "policy3-common-ptp-sub-policy": "default", "policy4-common-sriov-sub-policy": "default" }, "placementBindings": [ "cgu-policy1-common-cluster-version-policy", "cgu-policy2-common-nto-sub-policy", "cgu-policy3-common-ptp-sub-policy", "cgu-policy4-common-sriov-sub-policy" ], "placementRules": [ "cgu-policy1-common-cluster-version-policy", "cgu-policy2-common-nto-sub-policy", "cgu-policy3-common-ptp-sub-policy", "cgu-policy4-common-sriov-sub-policy" ], "precaching": { "spec": {} }, "remediationPlan": [ [ "spoke1", "spoke2" ], [ "spoke5", "spoke6" ] ], "status": { "currentBatch": 1, "currentBatchStartedAt": "2022-02-25T15:54:16Z", "remediationPlanForBatch": { "spoke1": 0, "spoke2": 1 }, "startedAt": "2022-02-25T15:54:16Z" } }
- 1
- Reflects the update progress of the current batch. Run this command again to receive updated information about the progress.
If the policies include Operator subscriptions, you can check the installation progress directly on the single-node cluster.
Export the
KUBECONFIG
file of the single-node cluster you want to check the installation progress for by running the following command:$ export KUBECONFIG=<cluster_kubeconfig_absolute_path>
Check all the subscriptions present on the single-node cluster and look for the one in the policy you are trying to install through the
ClusterGroupUpgrade
CR by running the following command:$ oc get subs -A | grep -i <subscription_name>
Example output for
cluster-logging
policyNAMESPACE NAME PACKAGE SOURCE CHANNEL openshift-logging cluster-logging cluster-logging redhat-operators stable
If one of the managed policies includes a
ClusterVersion
CR, check the status of platform updates in the current batch by running the following command against the spoke cluster:$ oc get clusterversion
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.5 True True 43s Working towards 4.9.7: 71 of 735 done (9% complete)
Check the Operator subscription by running the following command:
$ oc get subs -n <operator-namespace> <operator-subscription> -ojsonpath="{.status}"
Check the install plans present on the single-node cluster that is associated with the desired subscription by running the following command:
$ oc get installplan -n <subscription_namespace>
Example output for
cluster-logging
OperatorNAMESPACE NAME CSV APPROVAL APPROVED openshift-logging install-6khtw cluster-logging.5.3.3-4 Manual true 1
- 1
- The install plans have their
Approval
field set toManual
and theirApproved
field changes fromfalse
totrue
after TALM approves the install plan.
NoteWhen TALM is remediating a policy containing a subscription, it automatically approves any install plans attached to that subscription. Where multiple install plans are needed to get the operator to the latest known version, TALM might approve multiple install plans, upgrading through one or more intermediate versions to get to the final version.
Check if the cluster service version for the Operator of the policy that the
ClusterGroupUpgrade
is installing reached theSucceeded
phase by running the following command:$ oc get csv -n <operator_namespace>
Example output for OpenShift Logging Operator
NAME DISPLAY VERSION REPLACES PHASE cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 Succeeded
13.7. Creating a backup of cluster resources before upgrade
For single-node OpenShift, the Topology Aware Lifecycle Manager (TALM) can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.
To use the backup feature you first create a ClusterGroupUpgrade
CR with the backup
field set to true
. To ensure that the contents of the backup are up to date, the backup is not taken until you set the enable
field in the ClusterGroupUpgrade
CR to true
.
TALM uses the BackupSucceeded
condition to report the status and reasons as follows:
true
Backup is completed for all clusters or the backup run has completed but failed for one or more clusters. If backup fails for any cluster, the update does not proceed for that cluster.
false
Backup is still in progress for one or more clusters or has failed for all clusters. The backup process running in the spoke clusters can have the following statuses:
PreparingToStart
The first reconciliation pass is in progress. The TALM deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
Starting
The backup prerequisites and backup job are being created.
Active
The backup is in progress.
Succeeded
The backup succeeded.
BackupTimeout
Artifact backup is partially done.
UnrecoverableError
The backup has ended with a non-zero exit code.
If the backup of a cluster fails and enters the BackupTimeout
or UnrecoverableError
state, the cluster update does not proceed for that cluster. Updates to other clusters are not affected and continue.
13.7.1. Creating a ClusterGroupUpgrade CR with backup
You can create a backup of a deployment before an upgrade on single-node OpenShift clusters. If the upgrade fails you can use the upgrade-recovery.sh
script generated by Topology Aware Lifecycle Manager (TALM) to return the system to its preupgrade state. The backup consists of the following items:
- Cluster backup
-
A snapshot of
etcd
and static pod manifests. - Content backup
-
Backups of folders, for example,
/etc
,/usr/local
,/var/lib/kubelet
. - Changed files backup
-
Any files managed by
machine-config
that have been changed. - Deployment
-
A pinned
ostree
deployment. - Images (Optional)
- Any container images that are in use.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-admin
privileges. - Install Red Hat Advanced Cluster Management (RHACM).
It is highly recommended that you create a recovery partition. The following is an example SiteConfig
custom resource (CR) for a recovery partition of 50 GB:
nodes: - hostName: "snonode.sno-worker-0.e2e.bos.redhat.com" role: "master" rootDeviceHints: hctl: "0:2:0:0" deviceName: /dev/sda ........ ........ #Disk /dev/sda: 893.3 GiB, 959119884288 bytes, 1873281024 sectors diskPartition: - device: /dev/sda partitions: - mount_point: /var/recovery size: 51200 start: 800000
Procedure
Save the contents of the
ClusterGroupUpgrade
CR with thebackup
andenable
fields set totrue
in theclustergroupupgrades-group-du.yaml
file:apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: du-upgrade-4918 namespace: ztp-group-du-sno spec: preCaching: true backup: true clusters: - cnfdb1 - cnfdb2 enable: true managedPolicies: - du-upgrade-platform-upgrade remediationStrategy: maxConcurrency: 2 timeout: 240
To start the update, apply the
ClusterGroupUpgrade
CR by running the following command:$ oc apply -f clustergroupupgrades-group-du.yaml
Verification
Check the status of the upgrade in the hub cluster by running the following command:
$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
Example output
{ "backup": { "clusters": [ "cnfdb2", "cnfdb1" ], "status": { "cnfdb1": "Succeeded", "cnfdb2": "Failed" 1 } }, "computedMaxConcurrency": 1, "conditions": [ { "lastTransitionTime": "2022-04-05T10:37:19Z", "message": "Backup failed for 1 cluster", 2 "reason": "PartiallyDone", 3 "status": "True", 4 "type": "Succeeded" } ], "precaching": { "spec": {} }, "status": {}
13.7.2. Recovering a cluster after a failed upgrade
If an upgrade of a cluster fails, you can manually log in to the cluster and use the backup to return the cluster to its preupgrade state. There are two stages:
- Rollback
- If the attempted upgrade included a change to the platform OS deployment, you must roll back to the previous version before running the recovery script.
A rollback is only applicable to upgrades from TALM and single-node OpenShift. This process does not apply to rollbacks from any other upgrade type.
- Recovery
- The recovery shuts down containers and uses files from the backup partition to relaunch containers and restore clusters.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
- Install Red Hat Advanced Cluster Management (RHACM).
-
Log in as a user with
cluster-admin
privileges. - Run an upgrade that is configured for backup.
Procedure
Delete the previously created
ClusterGroupUpgrade
custom resource (CR) by running the following command:$ oc delete cgu/du-upgrade-4918 -n ztp-group-du-sno
- Log in to the cluster that you want to recover.
Check the status of the platform OS deployment by running the following command:
$ ostree admin status
Example outputs
[root@lab-test-spoke2-node-0 core]# ostree admin status * rhcos c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9.0 Version: 49.84.202202230006-0 Pinned: yes 1 origin refspec: c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9
- 1
- The current deployment is pinned. A platform OS deployment rollback is not necessary.
[root@lab-test-spoke2-node-0 core]# ostree admin status * rhcos f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa.0 Version: 410.84.202204050541-0 origin refspec: f750ff26f2d5550930ccbe17af61af47daafc8018cd9944f2a3a6269af26b0fa rhcos ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca.0 (rollback) 1 Version: 410.84.202203290245-0 Pinned: yes 2 origin refspec: ad8f159f9dc4ea7e773fd9604c9a16be0fe9b266ae800ac8470f63abc39b52ca
To trigger a rollback of the platform OS deployment, run the following command:
$ rpm-ostree rollback -r
The first phase of the recovery shuts down containers and restores files from the backup partition to the targeted directories. To begin the recovery, run the following command:
$ /var/recovery/upgrade-recovery.sh
When prompted, reboot the cluster by running the following command:
$ systemctl reboot
After the reboot, restart the recovery by running the following command:
$ /var/recovery/upgrade-recovery.sh --resume
If the recovery utility fails, you can retry with the --restart
option:
$ /var/recovery/upgrade-recovery.sh --restart
Verification
To check the status of the recovery run the following command:
$ oc get clusterversion,nodes,clusteroperator
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS clusterversion.config.openshift.io/version 4.9.23 True False 86d Cluster version is 4.9.23 1 NAME STATUS ROLES AGE VERSION node/lab-test-spoke1-node-0 Ready master,worker 86d v1.22.3+b93fd35 2 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE clusteroperator.config.openshift.io/authentication 4.9.23 True False False 2d7h 3 clusteroperator.config.openshift.io/baremetal 4.9.23 True False False 86d ..............
13.8. Using the container image pre-cache feature
Single-node OpenShift clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.
The time of the update is not set by TALM. You can apply the ClusterGroupUpgrade
CR at the beginning of the update by manual application or by external automation.
The container image pre-caching starts when the preCaching
field is set to true
in the ClusterGroupUpgrade
CR.
TALM uses the PrecacheSpecValid
condition to report status information as follows:
true
The pre-caching spec is valid and consistent.
false
The pre-caching spec is incomplete.
TALM uses the PrecachingSucceeded
condition to report status information as follows:
true
TALM has concluded the pre-caching process. If pre-caching fails for any cluster, the update fails for that cluster but proceeds for all other clusters. A message informs you if pre-caching has failed for any clusters.
false
Pre-caching is still in progress for one or more clusters or has failed for all clusters.
After a successful pre-caching process, you can start remediating policies. The remediation actions start when the enable
field is set to true
. If there is a pre-caching failure on a cluster, the upgrade fails for that cluster. The upgrade process continues for all other clusters that have a successful pre-cache.
The pre-caching process can be in the following statuses:
NotStarted
This is the initial state all clusters are automatically assigned to on the first reconciliation pass of the
ClusterGroupUpgrade
CR. In this state, TALM deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. TALM then creates a newManagedClusterView
resource for the spoke pre-caching namespace to verify its deletion in thePrecachePreparing
state.PreparingToStart
Cleaning up any remaining resources from previous incomplete updates is in progress.
Starting
Pre-caching job prerequisites and the job are created.
Active
The job is in "Active" state.
Succeeded
The pre-cache job succeeded.
PrecacheTimeout
The artifact pre-caching is partially done.
UnrecoverableError
The job ends with a non-zero exit code.
13.8.1. Creating a ClusterGroupUpgrade CR with pre-caching
For single-node OpenShift, the pre-cache feature allows the required container images to be present on the spoke cluster before the update starts.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-admin
privileges.
Procedure
Save the contents of the
ClusterGroupUpgrade
CR with thepreCaching
field set totrue
in theclustergroupupgrades-group-du.yaml
file:apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: du-upgrade-4918 namespace: ztp-group-du-sno spec: preCaching: true 1 clusters: - cnfdb1 - cnfdb2 enable: false managedPolicies: - du-upgrade-platform-upgrade remediationStrategy: maxConcurrency: 2 timeout: 240
- 1
- The
preCaching
field is set totrue
, which enables TALM to pull the container images before starting the update.
When you want to start pre-caching, apply the
ClusterGroupUpgrade
CR by running the following command:$ oc apply -f clustergroupupgrades-group-du.yaml
Verification
Check if the
ClusterGroupUpgrade
CR exists in the hub cluster by running the following command:$ oc get cgu -A
Example output
NAMESPACE NAME AGE STATE DETAILS ztp-group-du-sno du-upgrade-4918 10s InProgress Precaching is required and not done 1
- 1
- The CR is created.
Check the status of the pre-caching task by running the following command:
$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
Example output
{ "conditions": [ { "lastTransitionTime": "2022-01-27T19:07:24Z", "message": "Precaching is required and not done", "reason": "InProgress", "status": "False", "type": "PrecachingSucceeded" }, { "lastTransitionTime": "2022-01-27T19:07:34Z", "message": "Pre-caching spec is valid and consistent", "reason": "PrecacheSpecIsWellFormed", "status": "True", "type": "PrecacheSpecValid" } ], "precaching": { "clusters": [ "cnfdb1" 1 "cnfdb2" ], "spec": { "platformImage": "image.example.io"}, "status": { "cnfdb1": "Active" "cnfdb2": "Succeeded"} } }
- 1
- Displays the list of identified clusters.
Check the status of the pre-caching job by running the following command on the spoke cluster:
$ oc get jobs,pods -n openshift-talo-pre-cache
Example output
NAME COMPLETIONS DURATION AGE job.batch/pre-cache 0/1 3m10s 3m10s NAME READY STATUS RESTARTS AGE pod/pre-cache--1-9bmlr 1/1 Running 0 3m10s
Check the status of the
ClusterGroupUpgrade
CR by running the following command:$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
Example output
"conditions": [ { "lastTransitionTime": "2022-01-27T19:30:41Z", "message": "The ClusterGroupUpgrade CR has all clusters compliant with all the managed policies", "reason": "UpgradeCompleted", "status": "True", "type": "Ready" }, { "lastTransitionTime": "2022-01-27T19:28:57Z", "message": "Precaching is completed", "reason": "PrecachingCompleted", "status": "True", "type": "PrecachingSucceeded" 1 }
- 1
- The pre-cache tasks are done.
13.9. Troubleshooting the Topology Aware Lifecycle Manager
The Topology Aware Lifecycle Manager (TALM) is an OpenShift Container Platform Operator that remediates RHACM policies. When issues occur, use the oc adm must-gather
command to gather details and logs and to take steps in debugging the issues.
For more information about related topics, see the following documentation:
- Red Hat Advanced Cluster Management for Kubernetes 2.4 Support Matrix
- Red Hat Advanced Cluster Management Troubleshooting
- The "Troubleshooting Operator issues" section
13.9.1. General troubleshooting
You can determine the cause of the problem by reviewing the following questions:
Is the configuration that you are applying supported?
- Are the RHACM and the OpenShift Container Platform versions compatible?
- Are the TALM and RHACM versions compatible?
Which of the following components is causing the problem?
To ensure that the ClusterGroupUpgrade
configuration is functional, you can do the following:
-
Create the
ClusterGroupUpgrade
CR with thespec.enable
field set tofalse
. - Wait for the status to be updated and go through the troubleshooting questions.
-
If everything looks as expected, set the
spec.enable
field totrue
in theClusterGroupUpgrade
CR.
After you set the spec.enable
field to true
in the ClusterUpgradeGroup
CR, the update procedure starts and you cannot edit the CR’s spec
fields anymore.
13.9.2. Cannot modify the ClusterUpgradeGroup CR
- Issue
-
You cannot edit the
ClusterUpgradeGroup
CR after enabling the update. - Resolution
Restart the procedure by performing the following steps:
Remove the old
ClusterGroupUpgrade
CR by running the following command:$ oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>
Check and fix the existing issues with the managed clusters and policies.
- Ensure that all the clusters are managed clusters and available.
-
Ensure that all the policies exist and have the
spec.remediationAction
field set toinform
.
Create a new
ClusterGroupUpgrade
CR with the correct configurations.$ oc apply -f <ClusterGroupUpgradeCR_YAML>
13.9.3. Managed policies
Checking managed policies on the system
- Issue
- You want to check if you have the correct managed policies on the system.
- Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'
Example output
["group-du-sno-validator-du-validator-policy", "policy2-common-nto-sub-policy", "policy3-common-ptp-sub-policy"]
Checking remediationAction mode
- Issue
-
You want to check if the
remediationAction
field is set toinform
in thespec
of the managed policies. - Resolution
Run the following command:
$ oc get policies --all-namespaces
Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-nto-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
Checking policy compliance state
- Issue
- You want to check the compliance state of policies.
- Resolution
Run the following command:
$ oc get policies --all-namespaces
Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-nto-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
13.9.4. Clusters
Checking if managed clusters are present
- Issue
-
You want to check if the clusters in the
ClusterGroupUpgrade
CR are managed clusters. - Resolution
Run the following command:
$ oc get managedclusters
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.example.com:6443 True Unknown 13d spoke1 true https://api.spoke1.example.com:6443 True True 13d spoke3 true https://api.spoke3.example.com:6443 True True 27h
Alternatively, check the TALM manager logs:
Get the name of the TALM manager by running the following command:
$ oc get pod -n openshift-operators
Example output
NAME READY STATUS RESTARTS AGE cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp 2/2 Running 0 45m
Check the TALM manager logs by running the following command:
$ oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
Example output
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} 1 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
- 1
- The error message shows that the cluster is not a managed cluster.
Checking if managed clusters are available
- Issue
-
You want to check if the managed clusters specified in the
ClusterGroupUpgrade
CR are available. - Resolution
Run the following command:
$ oc get managedclusters
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.testlab.com:6443 True Unknown 13d spoke1 true https://api.spoke1.testlab.com:6443 True True 13d 1 spoke3 true https://api.spoke3.testlab.com:6443 True True 27h 2
Checking clusterLabelSelector
- Issue
-
You want to check if the
clusterLabelSelector
field specified in theClusterGroupUpgrade
CR matches at least one of the managed clusters. - Resolution
Run the following command:
$ oc get managedcluster --selector=upgrade=true 1
- 1
- The label for the clusters you want to update is
upgrade:true
.
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
Checking if canary clusters are present
- Issue
You want to check if the canary clusters are present in the list of clusters.
Example
ClusterGroupUpgrade
CRspec: remediationStrategy: canaries: - spoke3 maxConcurrency: 2 timeout: 240 clusterLabelSelectors: - matchLabels: upgrade: true
- Resolution
Run the following commands:
$ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'
Example output
["spoke1", "spoke3"]
Check if the canary clusters are present in the list of clusters that match
clusterLabelSelector
labels by running the following command:$ oc get managedcluster --selector=upgrade=true
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
A cluster can be present in spec.clusters
and also be matched by the spec.clusterLabelSelector
label.
Checking the pre-caching status on spoke clusters
Check the status of pre-caching by running the following command on the spoke cluster:
$ oc get jobs,pods -n openshift-talo-pre-cache
13.9.5. Remediation Strategy
Checking if remediationStrategy is present in the ClusterGroupUpgrade CR
- Issue
-
You want to check if the
remediationStrategy
is present in theClusterGroupUpgrade
CR. - Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'
Example output
{"maxConcurrency":2, "timeout":240}
Checking if maxConcurrency is specified in the ClusterGroupUpgrade CR
- Issue
-
You want to check if the
maxConcurrency
is specified in theClusterGroupUpgrade
CR. - Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'
Example output
2
13.9.6. Topology Aware Lifecycle Manager
Checking condition message and status in the ClusterGroupUpgrade CR
- Issue
-
You want to check the value of the
status.conditions
field in theClusterGroupUpgrade
CR. - Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'
Example output
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}
Checking corresponding copied policies
- Issue
-
You want to check if every policy from
status.managedPoliciesForUpgrade
has a corresponding policy instatus.copiedPolicies
. - Resolution
Run the following command:
$ oc get cgu lab-upgrade -oyaml
Example output
status: … copiedPolicies: - lab-upgrade-policy3-common-ptp-sub-policy managedPoliciesForUpgrade: - name: policy3-common-ptp-sub-policy namespace: default
Checking if status.remediationPlan was computed
- Issue
-
You want to check if
status.remediationPlan
is computed. - Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'
Example output
[["spoke2", "spoke3"]]
Errors in the TALM manager container
- Issue
- You want to check the logs of the manager container of TALM.
- Resolution
Run the following command:
$ oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
Example output
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} 1 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
- 1
- Displays the error.
Clusters are not compliant to some policies after a ClusterGroupUpgrade
CR has completed
- Issue
The policy compliance status that TALM uses to decide if remediation is needed has not yet fully updated for all clusters. This may be because:
- The CGU was run too soon after a policy was created or updated.
-
The remediation of a policy affects the compliance of subsequent policies in the
ClusterGroupUpgrade
CR.
- Resolution
-
Create a new and apply
ClusterGroupUpdate
CR with the same specification .
Additional resources
- For information about troubleshooting, see OpenShift Container Platform Troubleshooting Operator Issues.
- For more information about using Topology Aware Lifecycle Manager in the ZTP workflow, see Updating managed policies with Topology Aware Lifecycle Manager.
-
For more information about the
PolicyGenTemplate
CRD, see About the PolicyGenTemplate CRD
Chapter 14. Creating a performance profile
Learn about the Performance Profile Creator (PPC) and how you can use it to create a performance profile.
Currently, disabling CPU load balancing is not supported by cgroup v2. As a result, you might not get the desired behavior from performance profiles if you have cgroup v2 enabled. Enabling cgroup v2 is not recommended if you are using performace profiles.
14.1. About the Performance Profile Creator
The Performance Profile Creator (PPC) is a command-line tool, delivered with the Node Tuning Operator, used to create the performance profile. The tool consumes must-gather
data from the cluster and several user-supplied profile arguments. The PPC generates a performance profile that is appropriate for your hardware and topology.
The tool is run by one of the following methods:
-
Invoking
podman
- Calling a wrapper script
14.1.1. Gathering data about your cluster using the must-gather command
The Performance Profile Creator (PPC) tool requires must-gather
data. As a cluster administrator, run the must-gather
command to capture information about your cluster.
In earlier versions of OpenShift Container Platform, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator. However, you must still use the performance-addon-operator-must-gather
image when running the must-gather
command.
Prerequisites
-
Access to the cluster as a user with the
cluster-admin
role. -
Access to the Performance Addon Operator
must gather
image. -
The OpenShift CLI (
oc
) installed.
Procedure
Optional: Verify that a matching machine config pool exists with a label:
$ oc describe mcp/worker-rt
Example output
Name: worker-rt Namespace: Labels: machineconfiguration.openshift.io/role=worker-rt
If a matching label does not exist add a label for a machine config pool (MCP) that matches with the MCP name:
$ oc label mcp <mcp_name> <mcp_name>=""
-
Navigate to the directory where you want to store the
must-gather
data. Run
must-gather
on your cluster:$ oc adm must-gather --image=<PAO_must_gather_image> --dest-dir=<dir>
NoteThe
must-gather
command must be run with theperformance-addon-operator-must-gather
image. The output can optionally be compressed. Compressed output is required if you are running the Performance Profile Creator wrapper script.Example
$ oc adm must-gather --image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8:v4.12 --dest-dir=<path_to_must-gather>/must-gather
Create a compressed file from the
must-gather
directory:$ tar cvaf must-gather.tar.gz must-gather/
14.1.2. Running the Performance Profile Creator using podman
As a cluster administrator, you can run podman
and the Performance Profile Creator to create a performance profile.
Prerequisites
-
Access to the cluster as a user with the
cluster-admin
role. - A cluster installed on bare-metal hardware.
-
A node with
podman
and OpenShift CLI (oc
) installed. - Access to the Node Tuning Operator image.
Procedure
Check the machine config pool:
$ oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-acd1358917e9f98cbdb599aea622d78b True False False 3 3 3 0 22h worker-cnf rendered-worker-cnf-1d871ac76e1951d32b2fe92369879826 False True False 2 1 1 0 22h
Use Podman to authenticate to
registry.redhat.io
:$ podman login registry.redhat.io
Username: myrhusername Password: ************
Optional: Display help for the PPC tool:
$ podman run --rm --entrypoint performance-profile-creator registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.12 -h
Example output
A tool that automates creation of Performance Profiles Usage: performance-profile-creator [flags] Flags: --disable-ht Disable Hyperthreading -h, --help help for performance-profile-creator --info string Show cluster information; requires --must-gather-dir-path, ignore the other arguments. [Valid values: log, json] (default "log") --mcp-name string MCP name corresponding to the target machines (required) --must-gather-dir-path string Must gather directory path (default "must-gather") --offlined-cpu-count int Number of offlined CPUs --power-consumption-mode string The power consumption mode. [Valid values: default, low-latency, ultra-low-latency] (default "default") --profile-name string Name of the performance profile to be created (default "performance") --reserved-cpu-count int Number of reserved CPUs (required) --rt-kernel Enable Real Time Kernel (required) --split-reserved-cpus-across-numa Split the Reserved CPUs across NUMA nodes --topology-manager-policy string Kubelet Topology Manager Policy of the performance profile to be created. [Valid values: single-numa-node, best-effort, restricted] (default "restricted") --user-level-networking Run with User level Networking(DPDK) enabled
Run the Performance Profile Creator tool in discovery mode:
NoteDiscovery mode inspects your cluster using the output from
must-gather
. The output produced includes information on:- The NUMA cell partitioning with the allocated CPU ids
- Whether hyperthreading is enabled
Using this information you can set appropriate values for some of the arguments supplied to the Performance Profile Creator tool.
$ podman run --entrypoint performance-profile-creator -v <path_to_must-gather>/must-gather:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.12 --info log --must-gather-dir-path /must-gather
NoteThis command uses the performance profile creator as a new entry point to
podman
. It maps themust-gather
data for the host into the container image and invokes the required user-supplied profile arguments to produce themy-performance-profile.yaml
file.The
-v
option can be the path to either:-
The
must-gather
output directory -
An existing directory containing the
must-gather
decompressed tarball
The
info
option requires a value which specifies the output format. Possible values are log and JSON. The JSON format is reserved for debugging.Run
podman
:$ podman run --entrypoint performance-profile-creator -v /must-gather:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.12 --mcp-name=worker-cnf --reserved-cpu-count=4 --rt-kernel=true --split-reserved-cpus-across-numa=false --must-gather-dir-path /must-gather --power-consumption-mode=ultra-low-latency --offlined-cpu-count=6 > my-performance-profile.yaml
NoteThe Performance Profile Creator arguments are shown in the Performance Profile Creator arguments table. The following arguments are required:
-
reserved-cpu-count
-
mcp-name
-
rt-kernel
The
mcp-name
argument in this example is set toworker-cnf
based on the output of the commandoc get mcp
. For single-node OpenShift use--mcp-name=master
.-
Review the created YAML file:
$ cat my-performance-profile.yaml
Example output
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: cpu: isolated: 2-39,48-79 offlined: 42-47 reserved: 0-1,40-41 machineConfigPoolSelector: machineconfiguration.openshift.io/role: worker-cnf nodeSelector: node-role.kubernetes.io/worker-cnf: "" numa: topologyPolicy: restricted realTimeKernel: enabled: true workloadHints: highPowerConsumption: true realTime: true
Apply the generated profile:
$ oc apply -f my-performance-profile.yaml
14.1.2.1. How to run podman
to create a performance profile
The following example illustrates how to run podman
to create a performance profile with 20 reserved CPUs that are to be split across the NUMA nodes.
Node hardware configuration:
- 80 CPUs
- Hyperthreading enabled
- Two NUMA nodes
- Even numbered CPUs run on NUMA node 0 and odd numbered CPUs run on NUMA node 1
Run podman
to create the performance profile:
$ podman run --entrypoint performance-profile-creator -v /must-gather:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.12 --mcp-name=worker-cnf --reserved-cpu-count=20 --rt-kernel=true --split-reserved-cpus-across-numa=true --must-gather-dir-path /must-gather > my-performance-profile.yaml
The created profile is described in the following YAML:
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: cpu: isolated: 10-39,50-79 reserved: 0-9,40-49 nodeSelector: node-role.kubernetes.io/worker-cnf: "" numa: topologyPolicy: restricted realTimeKernel: enabled: true
In this case, 10 CPUs are reserved on NUMA node 0 and 10 are reserved on NUMA node 1.
14.1.3. Running the Performance Profile Creator wrapper script
The performance profile wrapper script simplifies the running of the Performance Profile Creator (PPC) tool. It hides the complexities associated with running podman
and specifying the mapping directories and it enables the creation of the performance profile.
Prerequisites
- Access to the Node Tuning Operator image.
-
Access to the
must-gather
tarball.
Procedure
Create a file on your local machine named, for example,
run-perf-profile-creator.sh
:$ vi run-perf-profile-creator.sh
Paste the following code into the file:
#!/bin/bash readonly CONTAINER_RUNTIME=${CONTAINER_RUNTIME:-podman} readonly CURRENT_SCRIPT=$(basename "$0") readonly CMD="${CONTAINER_RUNTIME} run --entrypoint performance-profile-creator" readonly IMG_EXISTS_CMD="${CONTAINER_RUNTIME} image exists" readonly IMG_PULL_CMD="${CONTAINER_RUNTIME} image pull" readonly MUST_GATHER_VOL="/must-gather" NTO_IMG="registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.12" MG_TARBALL="" DATA_DIR="" usage() { print "Wrapper usage:" print " ${CURRENT_SCRIPT} [-h] [-p image][-t path] -- [performance-profile-creator flags]" print "" print "Options:" print " -h help for ${CURRENT_SCRIPT}" print " -p Node Tuning Operator image" print " -t path to a must-gather tarball" ${IMG_EXISTS_CMD} "${NTO_IMG}" && ${CMD} "${NTO_IMG}" -h } function cleanup { [ -d "${DATA_DIR}" ] && rm -rf "${DATA_DIR}" } trap cleanup EXIT exit_error() { print "error: $*" usage exit 1 } print() { echo "$*" >&2 } check_requirements() { ${IMG_EXISTS_CMD} "${NTO_IMG}" || ${IMG_PULL_CMD} "${NTO_IMG}" || \ exit_error "Node Tuning Operator image not found" [ -n "${MG_TARBALL}" ] || exit_error "Must-gather tarball file path is mandatory" [ -f "${MG_TARBALL}" ] || exit_error "Must-gather tarball file not found" DATA_DIR=$(mktemp -d -t "${CURRENT_SCRIPT}XXXX") || exit_error "Cannot create the data directory" tar -zxf "${MG_TARBALL}" --directory "${DATA_DIR}" || exit_error "Cannot decompress the must-gather tarball" chmod a+rx "${DATA_DIR}" return 0 } main() { while getopts ':hp:t:' OPT; do case "${OPT}" in h) usage exit 0 ;; p) NTO_IMG="${OPTARG}" ;; t) MG_TARBALL="${OPTARG}" ;; ?) exit_error "invalid argument: ${OPTARG}" ;; esac done shift $((OPTIND - 1)) check_requirements || exit 1 ${CMD} -v "${DATA_DIR}:${MUST_GATHER_VOL}:z" "${NTO_IMG}" "$@" --must-gather-dir-path "${MUST_GATHER_VOL}" echo "" 1>&2 } main "$@"
Add execute permissions for everyone on this script:
$ chmod a+x run-perf-profile-creator.sh
Optional: Display the
run-perf-profile-creator.sh
command usage:$ ./run-perf-profile-creator.sh -h
Expected output
Wrapper usage: run-perf-profile-creator.sh [-h] [-p image][-t path] -- [performance-profile-creator flags] Options: -h help for run-perf-profile-creator.sh -p Node Tuning Operator image 1 -t path to a must-gather tarball 2 A tool that automates creation of Performance Profiles Usage: performance-profile-creator [flags] Flags: --disable-ht Disable Hyperthreading -h, --help help for performance-profile-creator --info string Show cluster information; requires --must-gather-dir-path, ignore the other arguments. [Valid values: log, json] (default "log") --mcp-name string MCP name corresponding to the target machines (required) --must-gather-dir-path string Must gather directory path (default "must-gather") --offlined-cpu-count int Number of offlined CPUs --power-consumption-mode string The power consumption mode. [Valid values: default, low-latency, ultra-low-latency] (default "default") --profile-name string Name of the performance profile to be created (default "performance") --reserved-cpu-count int Number of reserved CPUs (required) --rt-kernel Enable Real Time Kernel (required) --split-reserved-cpus-across-numa Split the Reserved CPUs across NUMA nodes --topology-manager-policy string Kubelet Topology Manager Policy of the performance profile to be created. [Valid values: single-numa-node, best-effort, restricted] (default "restricted") --user-level-networking Run with User level Networking(DPDK) enabled
NoteThere two types of arguments:
-
Wrapper arguments namely
-h
,-p
and-t
- PPC arguments
-
Wrapper arguments namely
Run the performance profile creator tool in discovery mode:
NoteDiscovery mode inspects your cluster using the output from
must-gather
. The output produced includes information on:- The NUMA cell partitioning with the allocated CPU IDs
- Whether hyperthreading is enabled
Using this information you can set appropriate values for some of the arguments supplied to the Performance Profile Creator tool.
$ ./run-perf-profile-creator.sh -t /must-gather/must-gather.tar.gz -- --info=log
NoteThe
info
option requires a value which specifies the output format. Possible values are log and JSON. The JSON format is reserved for debugging.Check the machine config pool:
$ oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-acd1358917e9f98cbdb599aea622d78b True False False 3 3 3 0 22h worker-cnf rendered-worker-cnf-1d871ac76e1951d32b2fe92369879826 False True False 2 1 1 0 22h
Create a performance profile:
$ ./run-perf-profile-creator.sh -t /must-gather/must-gather.tar.gz -- --mcp-name=worker-cnf --reserved-cpu-count=2 --rt-kernel=true > my-performance-profile.yaml
NoteThe Performance Profile Creator arguments are shown in the Performance Profile Creator arguments table. The following arguments are required:
-
reserved-cpu-count
-
mcp-name
-
rt-kernel
The
mcp-name
argument in this example is set toworker-cnf
based on the output of the commandoc get mcp
. For single-node OpenShift use--mcp-name=master
.-
Review the created YAML file:
$ cat my-performance-profile.yaml
Example output
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: cpu: isolated: 1-39,41-79 reserved: 0,40 nodeSelector: node-role.kubernetes.io/worker-cnf: "" numa: topologyPolicy: restricted realTimeKernel: enabled: false
Apply the generated profile:
NoteInstall the Node Tuning Operator before applying the profile.
$ oc apply -f my-performance-profile.yaml
14.1.4. Performance Profile Creator arguments
Table 14.1. Performance Profile Creator arguments
Argument | Description |
---|---|
| Disable hyperthreading.
Possible values:
Default: Warning
If this argument is set to |
|
This captures cluster information and is used in discovery mode only. Discovery mode also requires the Possible values:
Default: |
|
MCP name for example |
| Must gather directory path. This parameter is required.
When the user runs the tool with the wrapper script |
| Number of offlined CPUs. Note This must be a natural number greater than 0. If not enough logical processors are offlined then error messages are logged. The messages are: Error: failed to compute the reserved and isolated CPUs: please ensure that reserved-cpu-count plus offlined-cpu-count should be in the range [0,1] Error: failed to compute the reserved and isolated CPUs: please specify the offlined CPU count in the range [0,1] |
| The power consumption mode. Possible values:
Default: |
|
Name of the performance profile to create. Default: |
| Number of reserved CPUs. This parameter is required. Note This must be a natural number. A value of 0 is not allowed. |
| Enable real-time kernel. This parameter is required.
Possible values: |
| Split the reserved CPUs across NUMA nodes.
Possible values:
Default: |
| Kubelet Topology Manager policy of the performance profile to be created. Possible values:
Default: |
| Run with user level networking (DPDK) enabled.
Possible values:
Default: |
14.2. Reference performance profiles
14.2.1. A performance profile template for clusters that use OVS-DPDK on OpenStack
To maximize machine performance in a cluster that uses Open vSwitch with the Data Plane Development Kit (OVS-DPDK) on Red Hat OpenStack Platform (RHOSP), you can use a performance profile.
You can use the following performance profile template to create a profile for your deployment.
A performance profile template for clusters that use OVS-DPDK
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: cnf-performanceprofile spec: additionalKernelArgs: - nmi_watchdog=0 - audit=0 - mce=off - processor.max_cstate=1 - idle=poll - intel_idle.max_cstate=0 - default_hugepagesz=1GB - hugepagesz=1G - intel_iommu=on cpu: isolated: <CPU_ISOLATED> reserved: <CPU_RESERVED> hugepages: defaultHugepagesSize: 1G pages: - count: <HUGEPAGES_COUNT> node: 0 size: 1G nodeSelector: node-role.kubernetes.io/worker: '' realTimeKernel: enabled: false globallyDisableIrqLoadBalancing: true
Insert values that are appropriate for your configuration for the CPU_ISOLATED
, CPU_RESERVED
, and HUGEPAGES_COUNT
keys.
To learn how to create and use performance profiles, see the "Creating a performance profile" page in the "Scalability and performance" section of the OpenShift Container Platform documentation.
14.3. Additional resources
-
For more information about the
must-gather
tool, see Gathering data about your cluster.
Chapter 15. Workload partitioning in single-node OpenShift
In resource-constrained environments, such as single-node OpenShift deployments, use workload partitioning to isolate OpenShift Container Platform services, cluster management workloads, and infrastructure pods to run on a reserved set of CPUs.
The minimum number of reserved CPUs required for the cluster management in single-node OpenShift is four CPU Hyper-Threads (HTs). With workload partitioning, you annotate the set of cluster management pods and a set of typical add-on Operators for inclusion in the cluster management workload partition. These pods operate normally within the minimum size CPU configuration. Additional Operators or workloads outside of the set of minimum cluster management pods require additional CPUs to be added to the workload partition.
Workload partitioning isolates user workloads from platform workloads using standard Kubernetes scheduling capabilities.
The following is an overview of the configurations required for workload partitioning:
-
Workload partitioning that uses
/etc/crio/crio.conf.d/01-workload-partitioning
pins the OpenShift Container Platform infrastructure pods to a definedcpuset
configuration. The performance profile pins cluster services such as systemd and kubelet to the CPUs that are defined in the
spec.cpu.reserved
field.NoteUsing the Node Tuning Operator, you can configure the performance profile to also pin system-level apps for a complete workload partitioning configuration on the node.
-
The CPUs that you specify in the performance profile
spec.cpu.reserved
field and the workload partitioningcpuset
field must match.
Workload partitioning introduces an extended <workload-type>.workload.openshift.io/cores
resource for each defined CPU pool, or workload type. Kubelet advertises the resources and CPU requests by pods allocated to the pool within the corresponding resource. When workload partitioning is enabled, the <workload-type>.workload.openshift.io/cores
resource allows access to the CPU capacity of the host, not just the default CPU pool.
Additional resources
- For the recommended workload partitioning configuration for single-node OpenShift clusters, see Workload partitioning.
Chapter 16. Requesting CRI-O and Kubelet profiling data by using the Node Observability Operator
The Node Observability Operator collects and stores the CRI-O and Kubelet profiling data of worker nodes. You can query the profiling data to analyze the CRI-O and Kubelet performance trends and debug the performance-related issues.
The Node Observability Operator is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
16.1. Workflow of the Node Observability Operator
The following workflow outlines on how to query the profiling data using the Node Observability Operator:
- Install the Node Observability Operator in the OpenShift Container Platform cluster.
- Create a NodeObservability custom resource to enable the CRI-O profiling on the worker nodes of your choice.
- Run the profiling query to generate the profiling data.
16.2. Installing the Node Observability Operator
The Node Observability Operator is not installed in OpenShift Container Platform by default. You can install the Node Observability Operator by using the OpenShift Container Platform CLI or the web console.
16.2.1. Installing the Node Observability Operator using the CLI
You can install the Node Observability Operator by using the OpenShift CLI (oc).
Prerequisites
- You have installed the OpenShift CLI (oc).
-
You have access to the cluster with
cluster-admin
privileges.
Procedure
Confirm that the Node Observability Operator is available by running the following command:
$ oc get packagemanifests -n openshift-marketplace node-observability-operator
Example output
NAME CATALOG AGE node-observability-operator Red Hat Operators 9h
Create the
node-observability-operator
namespace by running the following command:$ oc new-project node-observability-operator
Create an
OperatorGroup
object YAML file:cat <<EOF | oc apply -f - apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: node-observability-operator namespace: node-observability-operator spec: targetNamespaces: [] EOF
Create a
Subscription
object YAML file to subscribe a namespace to an Operator:cat <<EOF | oc apply -f - apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: node-observability-operator namespace: node-observability-operator spec: channel: alpha name: node-observability-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF
Verification
View the install plan name by running the following command:
$ oc -n node-observability-operator get sub node-observability-operator -o yaml | yq '.status.installplan.name'
Example output
install-dt54w
Verify the install plan status by running the following command:
$ oc -n node-observability-operator get ip <install_plan_name> -o yaml | yq '.status.phase'
<install_plan_name>
is the install plan name that you obtained from the output of the previous command.Example output
COMPLETE
Verify that the Node Observability Operator is up and running:
$ oc get deploy -n node-observability-operator
Example output
NAME READY UP-TO-DATE AVAILABLE AGE node-observability-operator-controller-manager 1/1 1 1 40h
16.2.2. Installing the Node Observability Operator using the web console
You can install the Node Observability Operator from the OpenShift Container Platform web console.
Prerequisites
-
You have access to the cluster with
cluster-admin
privileges. - You have access to the OpenShift Container Platform web console.
Procedure
- Log in to the OpenShift Container Platform web console.
- In the Administrator’s navigation panel, expand Operators → OperatorHub.
- In the All items field, enter Node Observability Operator and select the Node Observability Operator tile.
- Click Install.
On the Install Operator page, configure the following settings:
- In the Update channel area, click alpha.
- In the Installation mode area, click A specific namespace on the cluster.
- From the Installed Namespace list, select node-observability-operator from the list.
- In the Update approval area, select Automatic.
- Click Install.
Verification
- In the Administrator’s navigation panel, expand Operators → Installed Operators.
- Verify that the Node Observability Operator is listed in the Operators list.
16.3. Creating the Node Observability custom resource
You must create and run the NodeObservability
custom resource (CR) before you run the profiling query. When you run the NodeObservability
CR, it creates the necessary machine config and machine config pool CRs to enable the CRI-O profiling on the worker nodes matching the nodeSelector
.
If CRI-O profiling is not enabled on the worker nodes, the NodeObservabilityMachineConfig
resource gets created. Worker nodes matching the nodeSelector
specified in NodeObservability
CR restarts. This might take 10 or more minutes to complete.
Kubelet profiling is enabled by default.
The CRI-O unix socket of the node is mounted on the agent pod, which allows the agent to communicate with CRI-O to run the pprof request. Similarly, the kubelet-serving-ca
certificate chain is mounted on the agent pod, which allows secure communication between the agent and node’s kubelet endpoint.
Prerequisites
- You have installed the Node Observability Operator.
- You have installed the OpenShift CLI (oc).
-
You have access to the cluster with
cluster-admin
privileges.
Procedure
Log in to the OpenShift Container Platform CLI by running the following command:
$ oc login -u kubeadmin https://<HOSTNAME>:6443
Switch back to the
node-observability-operator
namespace by running the following command:$ oc project node-observability-operator
Create a CR file named
nodeobservability.yaml
that contains the following text:apiVersion: nodeobservability.olm.openshift.io/v1alpha2 kind: NodeObservability metadata: name: cluster 1 spec: nodeSelector: kubernetes.io/hostname: <node_hostname> 2 type: crio-kubelet
Run the
NodeObservability
CR:oc apply -f nodeobservability.yaml
Example output
nodeobservability.olm.openshift.io/cluster created
Review the status of the
NodeObservability
CR by running the following command:$ oc get nob/cluster -o yaml | yq '.status.conditions'
Example output
conditions: conditions: - lastTransitionTime: "2022-07-05T07:33:54Z" message: 'DaemonSet node-observability-ds ready: true NodeObservabilityMachineConfig ready: true' reason: Ready status: "True" type: Ready
NodeObservability
CR run is completed when the reason isReady
and the status isTrue
.
16.4. Running the profiling query
To run the profiling query, you must create a NodeObservabilityRun
resource. The profiling query is a blocking operation that fetches CRI-O and Kubelet profiling data for a duration of 30 seconds. After the profiling query is complete, you must retrieve the profiling data inside the container file system /run/node-observability
directory. The lifetime of data is bound to the agent pod through the emptyDir
volume, so you can access the profiling data while the agent pod is in the running
status.
You can request only one profiling query at any point of time.
Prerequisites
- You have installed the Node Observability Operator.
-
You have created the
NodeObservability
custom resource (CR). -
You have access to the cluster with
cluster-admin
privileges.
Procedure
Create a
NodeObservabilityRun
resource file namednodeobservabilityrun.yaml
that contains the following text:apiVersion: nodeobservability.olm.openshift.io/v1alpha2 kind: NodeObservabilityRun metadata: name: nodeobservabilityrun spec: nodeObservabilityRef: name: cluster
Trigger the profiling query by running the
NodeObservabilityRun
resource:$ oc apply -f nodeobservabilityrun.yaml
Review the status of the
NodeObservabilityRun
by running the following command:$ oc get nodeobservabilityrun nodeobservabilityrun -o yaml | yq '.status.conditions'
Example output
conditions: - lastTransitionTime: "2022-07-07T14:57:34Z" message: Ready to start profiling reason: Ready status: "True" type: Ready - lastTransitionTime: "2022-07-07T14:58:10Z" message: Profiling query done reason: Finished status: "True" type: Finished
The profiling query is complete once the status is
True
and type isFinished
.Retrieve the profiling data from the container’s
/run/node-observability
path by running the following bash script:for a in $(oc get nodeobservabilityrun nodeobservabilityrun -o yaml | yq .status.agents[].name); do echo "agent ${a}" mkdir -p "/tmp/${a}" for p in $(oc exec "${a}" -c node-observability-agent -- bash -c "ls /run/node-observability/*.pprof"); do f="$(basename ${p})" echo "copying ${f} to /tmp/${a}/${f}" oc exec "${a}" -c node-observability-agent -- cat "${p}" > "/tmp/${a}/${f}" done done
Chapter 17. Clusters at the network far edge
17.1. Challenges of the network far edge
Edge computing presents complex challenges when managing many sites in geographically displaced locations. Use zero touch provisioning (ZTP) and GitOps to provision and manage sites at the far edge of the network.
17.1.1. Overcoming the challenges of the network far edge
Today, service providers want to deploy their infrastructure at the edge of the network. This presents significant challenges:
- How do you handle deployments of many edge sites in parallel?
- What happens when you need to deploy sites in disconnected environments?
- How do you manage the lifecycle of large fleets of clusters?
Zero touch provisioning (ZTP) and GitOps meets these challenges by allowing you to provision remote edge sites at scale with declarative site definitions and configurations for bare-metal equipment. Template or overlay configurations install OpenShift Container Platform features that are required for CNF workloads. The full lifecycle of installation and upgrades is handled through the ZTP pipeline.
ZTP uses GitOps for infrastructure deployments. With GitOps, you use declarative YAML files and other defined patterns stored in Git repositories. Red Hat Advanced Cluster Management (RHACM) uses your Git repositories to drive the deployment of your infrastructure.
GitOps provides traceability, role-based access control (RBAC), and a single source of truth for the desired state of each site. Scalability issues are addressed by Git methodologies and event driven operations through webhooks.
You start the ZTP workflow by creating declarative site definition and configuration custom resources (CRs) that the ZTP pipeline delivers to the edge nodes.
The following diagram shows how ZTP works within the far edge framework.

17.1.2. Using ZTP to provision clusters at the network far edge
Red Hat Advanced Cluster Management (RHACM) manages clusters in a hub-and-spoke architecture, where a single hub cluster manages many spoke clusters. Hub clusters running RHACM provision and deploy the managed clusters by using zero touch provisioning (ZTP) and the assisted service that is deployed when you install RHACM.
The assisted service handles provisioning of OpenShift Container Platform on single node clusters, three-node clusters, or standard clusters running on bare metal.
A high-level overview of using ZTP to provision and maintain bare-metal hosts with OpenShift Container Platform is as follows:
- A hub cluster running RHACM manages an OpenShift image registry that mirrors the OpenShift Container Platform release images. RHACM uses the OpenShift image registry to provision the managed clusters.
- You manage the bare-metal hosts in a YAML format inventory file, versioned in a Git repository.
- You make the hosts ready for provisioning as managed clusters, and use RHACM and the assisted service to install the bare-metal hosts on site.
Installing and deploying the clusters is a two-stage process, involving an initial installation phase, and a subsequent configuration phase. The following diagram illustrates this workflow:

17.1.3. Installing managed clusters with SiteConfig resources and RHACM
GitOps ZTP uses SiteConfig
custom resources (CRs) in a Git repository to manage the processes that install OpenShift Container Platform clusters. The SiteConfig
CR contains cluster-specific parameters required for installation. It has options for applying select configuration CRs during installation including user defined extra manifests.
The ZTP GitOps plugin processes SiteConfig
CRs to generate a collection of CRs on the hub cluster. This triggers the assisted service in Red Hat Advanced Cluster Management (RHACM) to install OpenShift Container Platform on the bare-metal host. You can find installation status and error messages in these CRs on the hub cluster.
You can provision single clusters manually or in batches with ZTP:
- Provisioning a single cluster
-
Create a single
SiteConfig
CR and related installation and configuration CRs for the cluster, and apply them in the hub cluster to begin cluster provisioning. This is a good way to test your CRs before deploying on a larger scale. - Provisioning many clusters
-
Install managed clusters in batches of up to 400 by defining
SiteConfig
and related CRs in a Git repository. ArgoCD uses theSiteConfig
CRs to deploy the sites. The RHACM policy generator creates the manifests and applies them to the hub cluster. This starts the cluster provisioning process.
17.1.4. Configuring managed clusters with policies and PolicyGenTemplate resources
Zero touch provisioning (ZTP) uses Red Hat Advanced Cluster Management (RHACM) to configure clusters by using a policy-based governance approach to applying the configuration.
The policy generator or PolicyGen
is a plugin for the GitOps Operator that enables the creation of RHACM policies from a concise template. The tool can combine multiple CRs into a single policy, and you can generate multiple policies that apply to various subsets of clusters in your fleet.
For scalability and to reduce the complexity of managing configurations across the fleet of clusters, use configuration CRs with as much commonality as possible.
- Where possible, apply configuration CRs using a fleet-wide common policy.
- The next preference is to create logical groupings of clusters to manage as much of the remaining configurations as possible under a group policy.
- When a configuration is unique to an individual site, use RHACM templating on the hub cluster to inject the site-specific data into a common or group policy. Alternatively, apply an individual site policy for the site.
The following diagram shows how the policy generator interacts with GitOps and RHACM in the configuration phase of cluster deployment.

For large fleets of clusters, it is typical for there to be a high-level of consistency in the configuration of those clusters.
The following recommended structuring of policies combines configuration CRs to meet several goals:
- Describe common configurations once and apply to the fleet.
- Minimize the number of maintained and managed policies.
- Support flexibility in common configurations for cluster variants.
Table 17.1. Recommended PolicyGenTemplate policy categories
Policy category | Description |
---|---|
Common |
A policy that exists in the common category is applied to all clusters in the fleet. Use common |
Groups |
A policy that exists in the groups category is applied to a group of clusters in the fleet. Use group |
Sites | A policy that exists in the sites category is applied to a specific cluster site. Any cluster can have its own specific policies maintained. |
Additional resources
-
For more information about extracting the reference
SiteConfig
andPolicyGenTemplate
CRs from theztp-site-generate
container image, see Preparing the ZTP Git repository.
17.2. Preparing the hub cluster for ZTP
To use RHACM in a disconnected environment, create a mirror registry that mirrors the OpenShift Container Platform release images and Operator Lifecycle Manager (OLM) catalog that contains the required Operator images. OLM manages, installs, and upgrades Operators and their dependencies in the cluster. You can also use a disconnected mirror host to serve the RHCOS ISO and RootFS disk images that are used to provision the bare-metal hosts.
17.2.1. Telco RAN 4.12 validated solution software versions
The Red Hat Telco Radio Access Network (RAN) version 4.12 solution has been validated using the following Red Hat software products.
Table 17.2. Telco RAN 4.12 validated solution software
Product | Software version |
---|---|
Hub cluster OpenShift Container Platform version | 4.12 |
GitOps ZTP plugin | 4.10, 4.11, or 4.12 |
Red Hat Advanced Cluster Management (RHACM) | 2.6, 2.7 |
Red Hat OpenShift GitOps | 1.6 |
Topology Aware Lifecycle Manager (TALM) | 4.10, 4.11, or 4.12 |
17.2.2. Installing GitOps ZTP in a disconnected environment
Use Red Hat Advanced Cluster Management (RHACM), Red Hat OpenShift GitOps, and Topology Aware Lifecycle Manager (TALM) on the hub cluster in the disconnected environment to manage the deployment of multiple managed clusters.
Prerequisites
-
You have installed the OpenShift Container Platform CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges. You have configured a disconnected mirror registry for use in the cluster.
NoteThe disconnected mirror registry that you create must contain a version of TALM backup and pre-cache images that matches the version of TALM running in the hub cluster. The spoke cluster must be able to resolve these images in the disconnected mirror registry.
Procedure
- Install RHACM in the hub cluster. See Installing RHACM in a disconnected environment.
- Install GitOps and TALM in the hub cluster.
Additional resources
17.2.3. Adding RHCOS ISO and RootFS images to the disconnected mirror host
Before you begin installing clusters in the disconnected environment with Red Hat Advanced Cluster Management (RHACM), you must first host Red Hat Enterprise Linux CoreOS (RHCOS) images for it to use. Use a disconnected mirror to host the RHCOS images.
Prerequisites
- Deploy and configure an HTTP server to host the RHCOS image resources on the network. You must be able to access the HTTP server from your computer, and from the machines that you create.
The RHCOS images might not change with every release of OpenShift Container Platform. You must download images with the highest version that is less than or equal to the version that you install. Use the image versions that match your OpenShift Container Platform version if they are available. You require ISO and RootFS images to install RHCOS on the hosts. RHCOS QCOW2 images are not supported for this installation type.
Procedure
- Log in to the mirror host.
Obtain the RHCOS ISO and RootFS images from mirror.openshift.com, for example:
Export the required image names and OpenShift Container Platform version as environment variables:
$ export ISO_IMAGE_NAME=<iso_image_name> 1
$ export ROOTFS_IMAGE_NAME=<rootfs_image_name> 1
$ export OCP_VERSION=<ocp_version> 1
Download the required images:
$ sudo wget https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.12/${OCP_VERSION}/${ISO_IMAGE_NAME} -O /var/www/html/${ISO_IMAGE_NAME}
$ sudo wget https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.12/${OCP_VERSION}/${ROOTFS_IMAGE_NAME} -O /var/www/html/${ROOTFS_IMAGE_NAME}
Verification steps
Verify that the images downloaded successfully and are being served on the disconnected mirror host, for example:
$ wget http://$(hostname)/${ISO_IMAGE_NAME}
Example output
Saving to: rhcos-4.12.1-x86_64-live.x86_64.iso rhcos-4.12.1-x86_64-live.x86_64.iso- 11%[====> ] 10.01M 4.71MB/s
Additional resources
17.2.4. Enabling the assisted service and updating AgentServiceConfig on the hub cluster
Red Hat Advanced Cluster Management (RHACM) uses the assisted service to deploy OpenShift Container Platform clusters. The assisted service is deployed automatically when you enable the MultiClusterHub Operator with Central Infrastructure Management (CIM). When you have enabled CIM on the hub cluster, you then need to update the AgentServiceConfig
custom resource (CR) with references to the ISO and RootFS images that are hosted on the mirror registry HTTP server.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in to the hub cluster as a user with
cluster-admin
privileges. - You have enabled the assisted service on the hub cluster. For more information, see Enabling the Central Infrastructure Management Service.
Procedure
Update the
AgentServiceConfig
CR by running the following command:$ oc edit AgentServiceConfig
Add the following entry to the
items.spec.osImages
field in the CR:- cpuArchitecture: x86_64 openshiftVersion: "4.12" rootFSUrl: https://<host>/<path>/rhcos-live-rootfs.x86_64.img url: https://<mirror-registry>/<path>/rhcos-live.x86_64.iso
where:
- <host>
- Is the fully qualified domain name (FQDN) for the target mirror registry HTTP server.
- <path>
- Is the path to the image on the target mirror registry.
Save and quit the editor to apply the changes.
17.2.5. Configuring the hub cluster to use a disconnected mirror registry
You can configure the hub cluster to use a disconnected mirror registry for a disconnected environment.
Prerequisites
- You have a disconnected hub cluster installation with Red Hat Advanced Cluster Management (RHACM) 2.5 installed.
-
You have hosted the
rootfs
andiso
images on an HTTP server.
If you enable TLS for the HTTP server, you must confirm the root certificate is signed by an authority trusted by the client and verify the trusted certificate chain between your OpenShift Container Platform hub and managed clusters and the HTTP server. Using a server configured with an untrusted certificate prevents the images from being downloaded to the image creation service. Using untrusted HTTPS servers is not supported.
Procedure
Create a
ConfigMap
containing the mirror registry config:apiVersion: v1 kind: ConfigMap metadata: name: assisted-installer-mirror-config namespace: assisted-installer labels: app: assisted-service data: ca-bundle.crt: <certificate> 1 registries.conf: | 2 unqualified-search-registries = ["registry.access.redhat.com", "docker.io"] [[registry]] location = <mirror_registry_url> 3 insecure = false mirror-by-digest-only = true
- 1
- The mirror registry’s certificate used when creating the mirror registry.
- 2
- The configuration file for the mirror registry. The mirror registry configuration adds mirror information to
/etc/containers/registries.conf
in the Discovery image. The mirror information is stored in theimageContentSources
section of theinstall-config.yaml
file when passed to the installation program. The Assisted Service pod running on the HUB cluster fetches the container images from the configured mirror registry. - 3
- The URL of the mirror registry.
This updates
mirrorRegistryRef
in theAgentServiceConfig
custom resource, as shown below:Example output
apiVersion: agent-install.openshift.io/v1beta1 kind: AgentServiceConfig metadata: name: agent namespace: assisted-installer spec: databaseStorage: volumeName: <db_pv_name> accessModes: - ReadWriteOnce resources: requests: storage: <db_storage_size> filesystemStorage: volumeName: <fs_pv_name> accessModes: - ReadWriteOnce resources: requests: storage: <fs_storage_size> mirrorRegistryRef: name: 'assisted-installer-mirror-config' osImages: - openshiftVersion: <ocp_version> rootfs: <rootfs_url> 1 url: <iso_url> 2
A valid NTP server is required during cluster installation. Ensure that a suitable NTP server is available and can be reached from the installed clusters through the disconnected network.
17.2.6. Configuring the hub cluster to use unauthenticated registries
You can configure the hub cluster to use unauthenticated registries. Unauthenticated registries does not require authentication to access and download images.
Prerequisites
- You have installed and configured a hub cluster and installed Red Hat Advanced Cluster Management (RHACM) on the hub cluster.
- You have installed the OpenShift Container Platform CLI (oc).
-
You have logged in as a user with
cluster-admin
privileges. - You have configured an unauthenticated registry for use with the hub cluster.
Procedure
Update the
AgentServiceConfig
custom resource (CR) by running the following command:$ oc edit AgentServiceConfig agent
Add the
unauthenticatedRegistries
field in the CR:apiVersion: agent-install.openshift.io/v1beta1 kind: AgentServiceConfig metadata: name: agent spec: unauthenticatedRegistries: - example.registry.com - example.registry2.com ...
Unauthenticated registries are listed under
spec.unauthenticatedRegistries
in theAgentServiceConfig
resource. Any registry on this list is not required to have an entry in the pull secret used for the spoke cluster installation.assisted-service
validates the pull secret by making sure it contains the authentication information for every image registry used for installation.
Mirror registries are automatically added to the ignore list and do not need to be added under spec.unauthenticatedRegistries
. Specifying the PUBLIC_CONTAINER_REGISTRIES
environment variable in the ConfigMap
overrides the default values with the specified value. The PUBLIC_CONTAINER_REGISTRIES
defaults are quay.io and registry.svc.ci.openshift.org.
Verification
Verify that you can access the newly added registry from the hub cluster by running the following commands:
Open a debug shell prompt to the hub cluster:
$ oc debug node/<node_name>
Test access to the unauthenticated registry by running the following command:
sh-4.4# podman login -u kubeadmin -p $(oc whoami -t) <unauthenticated_registry>
where:
- <unauthenticated_registry>
-
Is the new registry, for example,
unauthenticated-image-registry.openshift-image-registry.svc:5000
.
Example output
Login Succeeded!
17.2.7. Configuring the hub cluster with ArgoCD
You can configure the hub cluster with a set of ArgoCD applications that generate the required installation and policy custom resources (CRs) for each site with GitOps zero touch provisioning (ZTP).
Red Hat Advanced Cluster Management (RHACM) uses SiteConfig
CRs to generate the Day 1 managed cluster installation CRs for ArgoCD. Each ArgoCD application can manage a maximum of 300 SiteConfig
CRs.
Prerequisites
- You have a OpenShift Container Platform hub cluster with Red Hat Advanced Cluster Management (RHACM) and Red Hat OpenShift GitOps installed.
-
You have extracted the reference deployment from the ZTP GitOps plugin container as described in the "Preparing the GitOps ZTP site configuration repository" section. Extracting the reference deployment creates the
out/argocd/deployment
directory referenced in the following procedure.
Procedure
Prepare the ArgoCD pipeline configuration:
- Create a Git repository with the directory structure similar to the example directory. For more information, see "Preparing the GitOps ZTP site configuration repository".
Configure access to the repository using the ArgoCD UI. Under Settings configure the following:
-
Repositories - Add the connection information. The URL must end in
.git
, for example,https://repo.example.com/repo.git
and credentials. - Certificates - Add the public certificate for the repository, if needed.
-
Repositories - Add the connection information. The URL must end in
Modify the two ArgoCD applications,
out/argocd/deployment/clusters-app.yaml
andout/argocd/deployment/policies-app.yaml
, based on your Git repository:-
Update the URL to point to the Git repository. The URL ends with
.git
, for example,https://repo.example.com/repo.git
. -
The
targetRevision
indicates which Git repository branch to monitor. -
path
specifies the path to theSiteConfig
andPolicyGenTemplate
CRs, respectively.
-
Update the URL to point to the Git repository. The URL ends with
To install the ZTP GitOps plugin you must patch the ArgoCD instance in the hub cluster by using the patch file previously extracted into the
out/argocd/deployment/
directory. Run the following command:$ oc patch argocd openshift-gitops \ -n openshift-gitops --type=merge \ --patch-file out/argocd/deployment/argocd-openshift-gitops-patch.json
Apply the pipeline configuration to your hub cluster by using the following command:
$ oc apply -k out/argocd/deployment
17.2.8. Preparing the GitOps ZTP site configuration repository
Before you can use the ZTP GitOps pipeline, you need to prepare the Git repository to host the site configuration data.
Prerequisites
- You have configured the hub cluster GitOps applications for generating the required installation and policy custom resources (CRs).
- You have deployed the managed clusters using zero touch provisioning (ZTP).
Procedure
-
Create a directory structure with separate paths for the
SiteConfig
andPolicyGenTemplate
CRs. Export the
argocd
directory from theztp-site-generate
container image using the following commands:$ podman pull registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.12
$ mkdir -p ./out
$ podman run --log-driver=none --rm registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.12 extract /home/ztp --tar | tar x -C ./out
Check that the
out
directory contains the following subdirectories:-
out/extra-manifest
contains the source CR files thatSiteConfig
uses to generate extra manifestconfigMap
. -
out/source-crs
contains the source CR files thatPolicyGenTemplate
uses to generate the Red Hat Advanced Cluster Management (RHACM) policies. -
out/argocd/deployment
contains patches and YAML files to apply on the hub cluster for use in the next step of this procedure. -
out/argocd/example
contains the examples forSiteConfig
andPolicyGenTemplate
files that represent the recommended configuration.
-
The directory structure under out/argocd/example
serves as a reference for the structure and content of your Git repository. The example includes SiteConfig
and PolicyGenTemplate
reference CRs for single-node, three-node, and standard clusters. Remove references to cluster types that you are not using. The following example describes a set of CRs for a network of single-node clusters:
example ├── policygentemplates │ ├── common-ranGen.yaml │ ├── example-sno-site.yaml │ ├── group-du-sno-ranGen.yaml │ ├── group-du-sno-validator-ranGen.yaml │ ├── kustomization.yaml │ └── ns.yaml └── siteconfig ├── example-sno.yaml ├── KlusterletAddonConfigOverride.yaml └── kustomization.yaml
Keep SiteConfig
and PolicyGenTemplate
CRs in separate directories. Both the SiteConfig
and PolicyGenTemplate
directories must contain a kustomization.yaml
file that explicitly includes the files in that directory.
This directory structure and the kustomization.yaml
files must be committed and pushed to your Git repository. The initial push to Git should include the kustomization.yaml
files. The SiteConfig
(example-sno.yaml
) and PolicyGenTemplate
(common-ranGen.yaml
, group-du-sno*.yaml
, and example-sno-site.yaml
) files can be omitted and pushed at a later time as required when deploying a site.
The KlusterletAddonConfigOverride.yaml
file is only required if one or more SiteConfig
CRs which make reference to it are committed and pushed to Git. See example-sno.yaml
for an example of how this is used.
17.3. Installing managed clusters with RHACM and SiteConfig resources
You can provision OpenShift Container Platform clusters at scale with Red Hat Advanced Cluster Management (RHACM) using the assisted service and the GitOps plugin policy generator with core-reduction technology enabled. The zero touch priovisioning (ZTP) pipeline performs the cluster installations. ZTP can be used in a disconnected environment.
17.3.1. GitOps ZTP and Topology Aware Lifecycle Manager
GitOps zero touch provisioning (ZTP) generates installation and configuration CRs from manifests stored in Git. These artifacts are applied to a centralized hub cluster where Red Hat Advanced Cluster Management (RHACM), the assisted service, and the Topology Aware Lifecycle Manager (TALM) use the CRs to install and configure the managed cluster. The configuration phase of the ZTP pipeline uses the TALM to orchestrate the application of the configuration CRs to the cluster. There are several key integration points between GitOps ZTP and the TALM.
- Inform policies
-
By default, GitOps ZTP creates all policies with a remediation action of
inform
. These policies cause RHACM to report on compliance status of clusters relevant to the policies but does not apply the desired configuration. During the ZTP process, after OpenShift installation, the TALM steps through the createdinform
policies and enforces them on the target managed cluster(s). This applies the configuration to the managed cluster. Outside of the ZTP phase of the cluster lifecycle, this allows you to change policies without the risk of immediately rolling those changes out to affected managed clusters. You can control the timing and the set of remediated clusters by using TALM. - Automatic creation of ClusterGroupUpgrade CRs
To automate the initial configuration of newly deployed clusters, TALM monitors the state of all
ManagedCluster
CRs on the hub cluster. AnyManagedCluster
CR that does not have aztp-done
label applied, including newly createdManagedCluster
CRs, causes the TALM to automatically create aClusterGroupUpgrade
CR with the following characteristics:-
The
ClusterGroupUpgrade
CR is created and enabled in theztp-install
namespace. -
ClusterGroupUpgrade
CR has the same name as theManagedCluster
CR. -
The cluster selector includes only the cluster associated with that
ManagedCluster
CR. -
The set of managed policies includes all policies that RHACM has bound to the cluster at the time the
ClusterGroupUpgrade
is created. - Pre-caching is disabled.
- Timeout set to 4 hours (240 minutes).
The automatic creation of an enabled
ClusterGroupUpgrade
ensures that initial zero-touch deployment of clusters proceeds without the need for user intervention. Additionally, the automatic creation of aClusterGroupUpgrade
CR for anyManagedCluster
without theztp-done
label allows a failed ZTP installation to be restarted by simply deleting theClusterGroupUpgrade
CR for the cluster.-
The
- Waves
Each policy generated from a
PolicyGenTemplate
CR includes aztp-deploy-wave
annotation. This annotation is based on the same annotation from each CR which is included in that policy. The wave annotation is used to order the policies in the auto-generatedClusterGroupUpgrade
CR. The wave annotation is not used other than for the auto-generatedClusterGroupUpgrade
CR.NoteAll CRs in the same policy must have the same setting for the
ztp-deploy-wave
annotation. The default value of this annotation for each CR can be overridden in thePolicyGenTemplate
. The wave annotation in the source CR is used for determining and setting the policy wave annotation. This annotation is removed from each built CR which is included in the generated policy at runtime.The TALM applies the configuration policies in the order specified by the wave annotations. The TALM waits for each policy to be compliant before moving to the next policy. It is important to ensure that the wave annotation for each CR takes into account any prerequisites for those CRs to be applied to the cluster. For example, an Operator must be installed before or concurrently with the configuration for the Operator. Similarly, the
CatalogSource
for an Operator must be installed in a wave before or concurrently with the Operator Subscription. The default wave value for each CR takes these prerequisites into account.Multiple CRs and policies can share the same wave number. Having fewer policies can result in faster deployments and lower CPU usage. It is a best practice to group many CRs into relatively few waves.
To check the default wave value in each source CR, run the following command against the out/source-crs
directory that is extracted from the ztp-site-generate
container image:
$ grep -r "ztp-deploy-wave" out/source-crs
- Phase labels
The
ClusterGroupUpgrade
CR is automatically created and includes directives to annotate theManagedCluster
CR with labels at the start and end of the ZTP process.When ZTP configuration post-installation commences, the
ManagedCluster
has theztp-running
label applied. When all policies are remediated to the cluster and are fully compliant, these directives cause the TALM to remove theztp-running
label and apply theztp-done
label.For deployments that make use of the
informDuValidator
policy, theztp-done
label is applied when the cluster is fully ready for deployment of applications. This includes all reconciliation and resulting effects of the ZTP applied configuration CRs. Theztp-done
label affects automaticClusterGroupUpgrade
CR creation by TALM. Do not manipulate this label after the initial ZTP installation of the cluster.- Linked CRs
-
The automatically created
ClusterGroupUpgrade
CR has the owner reference set as theManagedCluster
from which it was derived. This reference ensures that deleting theManagedCluster
CR causes the instance of theClusterGroupUpgrade
to be deleted along with any supporting resources.
17.3.2. Overview of deploying managed clusters with ZTP
Red Hat Advanced Cluster Management (RHACM) uses zero touch provisioning (ZTP) to deploy single-node OpenShift Container Platform clusters, three-node clusters, and standard clusters. You manage site configuration data as OpenShift Container Platform custom resources (CRs) in a Git repository. ZTP uses a declarative GitOps approach for a develop once, deploy anywhere model to deploy the managed clusters.
The deployment of the clusters includes:
- Installing the host operating system (RHCOS) on a blank server
- Deploying OpenShift Container Platform
- Creating cluster policies and site subscriptions
- Making the necessary network configurations to the server operating system
- Deploying profile Operators and performing any needed software-related configuration, such as performance profile, PTP, and SR-IOV
Overview of the managed site installation process
After you apply the managed site custom resources (CRs) on the hub cluster, the following actions happen automatically:
- A Discovery image ISO file is generated and booted on the target host.
- When the ISO file successfully boots on the target host it reports the host hardware information to RHACM.
- After all hosts are discovered, OpenShift Container Platform is installed.
-
When OpenShift Container Platform finishes installing, the hub installs the
klusterlet
service on the target cluster. - The requested add-on services are installed on the target cluster.
The Discovery image ISO process is complete when the Agent
CR for the managed cluster is created on the hub cluster.
The target bare-metal host must meet the networking, firmware, and hardware requirements listed in Recommended single-node OpenShift cluster configuration for vDU application workloads.
17.3.3. Creating the managed bare-metal host secrets
Add the required Secret
custom resources (CRs) for the managed bare-metal host to the hub cluster. You need a secret for the ZTP pipeline to access the Baseboard Management Controller (BMC) and a secret for the assisted installer service to pull cluster installation images from the registry.
The secrets are referenced from the SiteConfig
CR by name. The namespace must match the SiteConfig
namespace.
Procedure
Create a YAML secret file containing credentials for the host Baseboard Management Controller (BMC) and a pull secret required for installing OpenShift and all add-on cluster Operators:
Save the following YAML as the file
example-sno-secret.yaml
:apiVersion: v1 kind: Secret metadata: name: example-sno-bmc-secret namespace: example-sno 1 data: 2 password: <base64_password> username: <base64_username> type: Opaque --- apiVersion: v1 kind: Secret metadata: name: pull-secret namespace: example-sno 3 data: .dockerconfigjson: <pull_secret> 4 type: kubernetes.io/dockerconfigjson
-
Add the relative path to
example-sno-secret.yaml
to thekustomization.yaml
file that you use to install the cluster.
17.3.4. Configuring Discovery ISO kernel arguments for installations using GitOps ZTP
The GitOps ZTP workflow uses the Discovery ISO as part of the OpenShift Container Platform installation process on managed bare-metal hosts. You can edit the InfraEnv
resource to specify kernel arguments for the Discovery ISO. This is useful for cluster installations with specific environmental requirements. For example, configure the rd.net.timeout.carrier
kernel argument for the Discovery ISO to facilitate static networking for the cluster or to receive a DHCP address before downloading the root file system during installation.
In OpenShift Container Platform 4.12, you can only add kernel arguments. You can not replace or delete kernel arguments.
Prerequisites
- You have installed the OpenShift CLI (oc).
- You have logged in to the hub cluster as a user with cluster-admin privileges.
Procedure
Create the
InfraEnv
CR and edit thespec.kernelArguments
specification to configure kernel arguments.Save the following YAML in an
InfraEnv-example.yaml
file:NoteThe
InfraEnv
CR in this example uses template syntax such as{{ .Cluster.ClusterName }}
that is populated based on values in theSiteConfig
CR. TheSiteConfig
CR automatically populates values for these templates during deployment. Do not edit the templates manually.apiVersion: agent-install.openshift.io/v1beta1 kind: InfraEnv metadata: annotations: argocd.argoproj.io/sync-wave: "1" name: "{{ .Cluster.ClusterName }}" namespace: "{{ .Cluster.ClusterName }}" spec: clusterRef: name: "{{ .Cluster.ClusterName }}" namespace: "{{ .Cluster.ClusterName }}" kernelArguments: - operation: append 1 value: audit=0 2 - operation: append value: trace=1 sshAuthorizedKey: "{{ .Site.SshPublicKey }}" proxy: "{{ .Cluster.ProxySettings }}" pullSecretRef: name: "{{ .Site.PullSecretRef.Name }}" ignitionConfigOverride: "{{ .Cluster.IgnitionConfigOverride }}" nmStateConfigLabelSelector: matchLabels: nmstate-label: "{{ .Cluster.ClusterName }}" additionalNTPSources: "{{ .Cluster.AdditionalNTPSources }}"