Scalability and performance

OpenShift Container Platform 4.6

Scaling your OpenShift Container Platform cluster and tuning performance in production environments

Red Hat OpenShift Documentation Team

Abstract

This document provides instructions for scaling your cluster and optimizing the performance of your OpenShift Container Platform environment.

Chapter 4. Using the Node Tuning Operator

Learn about the Node Tuning Operator and how you can use it to manage node-level tuning by orchestrating the tuned daemon.

4.1. About the Node Tuning Operator

The Node Tuning Operator helps you manage node-level tuning by orchestrating the Tuned daemon. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs. The Operator manages the containerized Tuned daemon for OpenShift Container Platform as a Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized Tuned daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.

Node-level settings applied by the containerized Tuned daemon are rolled back on an event that triggers a profile change or when the containerized Tuned daemon is terminated gracefully by receiving and handling a termination signal.

The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1 and later.

4.2. Accessing an example Node Tuning Operator specification

Use this process to access an example Node Tuning Operator specification.

Procedure

  1. Run:

    $ oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator

The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift Container Platform nodes based on node or pod labels and profile priorities.

Warning

While in certain situations the support for pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without pod label matching. If a custom profile is created with pod label matching, then the functionality will be enabled at that time. The pod label functionality might be deprecated in future versions of the Node Tuning Operator.

4.3. Default profiles set on a cluster

The following are the default profiles set on a cluster.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: default
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - name: "openshift"
    data: |
      [main]
      summary=Optimize systems running OpenShift (parent profile)
      include=${f:virt_check:virtual-guest:throughput-performance}

      [selinux]
      avc_cache_threshold=8192

      [net]
      nf_conntrack_hashsize=131072

      [sysctl]
      net.ipv4.ip_forward=1
      kernel.pid_max=>4194304
      net.netfilter.nf_conntrack_max=1048576
      net.ipv4.conf.all.arp_announce=2
      net.ipv4.neigh.default.gc_thresh1=8192
      net.ipv4.neigh.default.gc_thresh2=32768
      net.ipv4.neigh.default.gc_thresh3=65536
      net.ipv6.neigh.default.gc_thresh1=8192
      net.ipv6.neigh.default.gc_thresh2=32768
      net.ipv6.neigh.default.gc_thresh3=65536
      vm.max_map_count=262144

      [sysfs]
      /sys/module/nvme_core/parameters/io_timeout=4294967295
      /sys/module/nvme_core/parameters/max_retries=10

  - name: "openshift-control-plane"
    data: |
      [main]
      summary=Optimize systems running OpenShift control plane
      include=openshift

      [sysctl]
      # ktune sysctl settings, maximizing i/o throughput
      #
      # Minimal preemption granularity for CPU-bound tasks:
      # (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
      kernel.sched_min_granularity_ns=10000000
      # The total time the scheduler will consider a migrated process
      # "cache hot" and thus less likely to be re-migrated
      # (system default is 500000, i.e. 0.5 ms)
      kernel.sched_migration_cost_ns=5000000
      # SCHED_OTHER wake-up granularity.
      #
      # Preemption granularity when tasks wake up.  Lower the value to
      # improve wake-up latency and throughput for latency critical tasks.
      kernel.sched_wakeup_granularity_ns=4000000

  - name: "openshift-node"
    data: |
      [main]
      summary=Optimize systems running OpenShift nodes
      include=openshift

      [sysctl]
      net.ipv4.tcp_fastopen=3
      fs.inotify.max_user_watches=65536
      fs.inotify.max_user_instances=8192

  recommend:
  - profile: "openshift-control-plane"
    priority: 30
    match:
    - label: "node-role.kubernetes.io/master"
    - label: "node-role.kubernetes.io/infra"

  - profile: "openshift-node"
    priority: 40

4.4. Verifying that the Tuned profiles are applied

Use this procedure to check which Tuned profiles are applied on every node.

Procedure

  1. Check which Tuned pods are running on each node:

    $ oc get pods -n openshift-cluster-node-tuning-operator -o wide

    Example output

    NAME                                            READY   STATUS    RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
    cluster-node-tuning-operator-599489d4f7-k4hw4   1/1     Running   0          6d2h   10.129.0.76    ip-10-0-145-113.eu-west-3.compute.internal   <none>           <none>
    tuned-2jkzp                                     1/1     Running   1          6d3h   10.0.145.113   ip-10-0-145-113.eu-west-3.compute.internal   <none>           <none>
    tuned-g9mkx                                     1/1     Running   1          6d3h   10.0.147.108   ip-10-0-147-108.eu-west-3.compute.internal   <none>           <none>
    tuned-kbxsh                                     1/1     Running   1          6d3h   10.0.132.143   ip-10-0-132-143.eu-west-3.compute.internal   <none>           <none>
    tuned-kn9x6                                     1/1     Running   1          6d3h   10.0.163.177   ip-10-0-163-177.eu-west-3.compute.internal   <none>           <none>
    tuned-vvxwx                                     1/1     Running   1          6d3h   10.0.131.87    ip-10-0-131-87.eu-west-3.compute.internal    <none>           <none>
    tuned-zqrwq                                     1/1     Running   1          6d3h   10.0.161.51    ip-10-0-161-51.eu-west-3.compute.internal    <none>           <none>

  2. Extract the profile applied from each pod and match them against the previous list:

    $ for p in `oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned -o=jsonpath='{range .items[*]}{.metadata.name} {end}'`; do printf "\n*** $p ***\n" ; oc logs pod/$p -n openshift-cluster-node-tuning-operator | grep applied; done

    Example output

    *** tuned-2jkzp ***
    2020-07-10 13:53:35,368 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-control-plane' applied
    
    *** tuned-g9mkx ***
    2020-07-10 14:07:17,089 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
    2020-07-10 15:56:29,005 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied
    2020-07-10 16:00:19,006 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
    2020-07-10 16:00:48,989 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied
    
    *** tuned-kbxsh ***
    2020-07-10 13:53:30,565 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
    2020-07-10 15:56:30,199 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied
    
    *** tuned-kn9x6 ***
    2020-07-10 14:10:57,123 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
    2020-07-10 15:56:28,757 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied
    
    *** tuned-vvxwx ***
    2020-07-10 14:11:44,932 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-control-plane' applied
    
    *** tuned-zqrwq ***
    2020-07-10 14:07:40,246 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-control-plane' applied

4.5. Custom tuning specification

The custom resource (CR) for the Operator has two major sections. The first section, profile:, is a list of Tuned profiles and their names. The second, recommend:, defines the profile selection logic.

Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized Tuned daemons are updated.

Management state

The Operator Management state is set by adjusting the default Tuned CR. By default, the Operator is in the Managed state and the spec.managementState field is not present in the default Tuned CR. Valid values for the Operator Management state are as follows:

  • Managed: the Operator will update its operands as configuration resources are updated
  • Unmanaged: the Operator will ignore changes to the configuration resources
  • Removed: the Operator will remove its operands and resources the Operator provisioned

Profile data

The profile: section lists Tuned profiles and their names.

profile:
- name: tuned_profile_1
  data: |
    # Tuned profile specification
    [main]
    summary=Description of tuned_profile_1 profile

    [sysctl]
    net.ipv4.ip_forward=1
    # ... other sysctl's or other Tuned daemon plugins supported by the containerized Tuned

# ...

- name: tuned_profile_n
  data: |
    # Tuned profile specification
    [main]
    summary=Description of tuned_profile_n profile

    # tuned_profile_n profile settings

Recommended profiles

The profile: selection logic is defined by the recommend: section of the CR. The recommend: section is a list of items to recommend the profiles based on a selection criteria.

recommend:
<recommend-item-1>
# ...
<recommend-item-n>

The individual items of the list:

- machineConfigLabels: 1
    <mcLabels> 2
  match: 3
  <match> 4
  priority: <priority> 5
  profile: <tuned_profile_name> 6
1
Optional.
2
A dictionary of key/value MachineConfig labels. The keys must be unique.
3
If omitted, profile match is assumed unless a profile with a higher priority matches first or machineConfigLabels is set.
4
An optional list.
5
Profile ordering priority. Lower numbers mean higher priority (0 is the highest priority).
6
A Tuned profile to apply on a match. For example tuned_profile_1.

<match> is an optional list recursively defined as follows:

- label: <label_name> 1
  value: <label_value> 2
  type: <label_type> 3
  <match> 4
1
Node or pod label name.
2
Optional node or pod label value. If omitted, the presence of <label_name> is enough to match.
3
Optional object type (node or pod). If omitted, node is assumed.
4
An optional <match> list.

If <match> is not omitted, all nested <match> sections must also evaluate to true. Otherwise, false is assumed and the profile with the respective <match> section will not be applied or recommended. Therefore, the nesting (child <match> sections) works as logical AND operator. Conversely, if any item of the <match> list matches, the entire <match> list evaluates to true. Therefore, the list acts as logical OR operator.

If machineConfigLabels is defined, machine config pool based matching is turned on for the given recommend: list item. <mcLabels> specifies the labels for a machine config. The machine config is created automatically to apply host settings, such as kernel boot parameters, for the profile <tuned_profile_name>. This involves finding all machine config pools with machine config selector matching <mcLabels> and setting the profile <tuned_profile_name> on all nodes that match the machine config pools' node selectors.

The list items match and machineConfigLabels are connected by the logical OR operator. The match item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true, the machineConfigLabels item is not considered.

Important

When using machine config pool based matching, it is advised to group nodes with the same hardware configuration into the same machine config pool. Not following this practice might result in Tuned operands calculating conflicting kernel parameters for two or more nodes sharing the same machine config pool.

Example: node or pod label based matching

- match:
  - label: tuned.openshift.io/elasticsearch
    match:
    - label: node-role.kubernetes.io/master
    - label: node-role.kubernetes.io/infra
    type: pod
  priority: 10
  profile: openshift-control-plane-es
- match:
  - label: node-role.kubernetes.io/master
  - label: node-role.kubernetes.io/infra
  priority: 20
  profile: openshift-control-plane
- priority: 30
  profile: openshift-node

The CR above is translated for the containerized Tuned daemon into its recommend.conf file based on the profile priorities. The profile with the highest priority (10) is openshift-control-plane-es and, therefore, it is considered first. The containerized Tuned daemon running on a given node looks to see if there is a pod running on the same node with the tuned.openshift.io/elasticsearch label set. If not, the entire <match> section evaluates as false. If there is such a pod with the label, in order for the <match> section to evaluate to true, the node label also needs to be node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

If the labels for the profile with priority 10 matched, openshift-control-plane-es profile is applied and no other profile is considered. If the node/pod label combination did not match, the second highest priority profile (openshift-control-plane) is considered. This profile is applied if the containerized Tuned pod runs on a node with labels node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

Finally, the profile openshift-node has the lowest priority of 30. It lacks the <match> section and, therefore, will always match. It acts as a profile catch-all to set openshift-node profile, if no other profile with higher priority matches on a given node.

Decision workflow

Example: MachineConfigPool based matching

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-node-custom
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift node profile with an additional kernel parameter
      include=openshift-node
      [bootloader]
      cmdline_openshift_node_custom=+skew_tick=1
    name: openshift-node-custom

  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-custom"
    priority: 20
    profile: openshift-node-custom

To minimize node reboots, label the target nodes with a label the machine config pool’s node selector will match, then create the Tuned CR above and finally create the custom machine config pool itself.

4.6. Custom tuning example

The following CR applies custom node-level tuning for OpenShift Container Platform nodes with label tuned.openshift.io/ingress-node-label set to any value. As an administrator, use the following command to create a custom Tuned CR.

Custom tuning example

$ oc create -f- <<_EOF_
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: ingress
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=A custom OpenShift ingress profile
      include=openshift-control-plane
      [sysctl]
      net.ipv4.ip_local_port_range="1024 65535"
      net.ipv4.tcp_tw_reuse=1
    name: openshift-ingress
  recommend:
  - match:
    - label: tuned.openshift.io/ingress-node-label
    priority: 10
    profile: openshift-ingress
_EOF_

Important

Custom profile writers are strongly encouraged to include the default Tuned daemon profiles shipped within the default Tuned CR. The example above uses the default openshift-control-plane profile to accomplish this.

4.7. Supported Tuned daemon plug-ins

Excluding the [main] section, the following Tuned plug-ins are supported when using custom profiles defined in the profile: section of the Tuned CR:

  • audio
  • cpu
  • disk
  • eeepc_she
  • modules
  • mounts
  • net
  • scheduler
  • scsi_host
  • selinux
  • sysctl
  • sysfs
  • usb
  • video
  • vm

There is some dynamic tuning functionality provided by some of these plug-ins that is not supported. The following Tuned plug-ins are currently not supported:

  • bootloader
  • script
  • systemd

See Available Tuned Plug-ins and Getting Started with Tuned for more information.

Chapter 5. Using Cluster Loader

Cluster Loader is a tool that deploys large numbers of various objects to a cluster, which creates user-defined cluster objects. Build, configure, and run Cluster Loader to measure performance metrics of your OpenShift Container Platform deployment at various cluster states.

5.1. Installing Cluster Loader

Procedure

  1. To pull the container image, run:

    $ sudo podman pull quay.io/openshift/origin-tests:4.6

5.2. Running Cluster Loader

Prerequisites

  • The repository will prompt you to authenticate. The registry credentials allow you to access the image, which is not publicly available. Use your existing authentication credentials from installation.

Procedure

  1. Execute Cluster Loader using the built-in test configuration, which deploys five template builds and waits for them to complete:

    $ sudo podman run -v ${LOCAL_KUBECONFIG}:/root/.kube/config:z -i \
    quay.io/openshift/origin-tests:4.6 /bin/bash -c 'export KUBECONFIG=/root/.kube/config && \
    openshift-tests run-test "[sig-scalability][Feature:Performance] Load cluster \
    should populate the cluster [Slow][Serial] [Suite:openshift]"'

    Alternatively, execute Cluster Loader with a user-defined configuration by setting the environment variable for VIPERCONFIG:

    $ sudo podman run -v ${LOCAL_KUBECONFIG}:/root/.kube/config:z \
    -v ${LOCAL_CONFIG_FILE_PATH}:/root/configs/:z \
    -i quay.io/openshift/origin-tests:4.6 \
    /bin/bash -c 'KUBECONFIG=/root/.kube/config VIPERCONFIG=/root/configs/test.yaml \
    openshift-tests run-test "[sig-scalability][Feature:Performance] Load cluster \
    should populate the cluster [Slow][Serial] [Suite:openshift]"'

    In this example, ${LOCAL_KUBECONFIG} refers to the path to the kubeconfig on your local file system. Also, there is a directory called ${LOCAL_CONFIG_FILE_PATH}, which is mounted into the container that contains a configuration file called test.yaml. Additionally, if the test.yaml references any external template files or podspec files, they should also be mounted into the container.

5.3. Configuring Cluster Loader

The tool creates multiple namespaces (projects), which contain multiple templates or pods.

5.3.1. Example Cluster Loader configuration file

Cluster Loader’s configuration file is a basic YAML file:

provider: local 1
ClusterLoader:
  cleanup: true
  projects:
    - num: 1
      basename: clusterloader-cakephp-mysql
      tuning: default
      ifexists: reuse
      templates:
        - num: 1
          file: cakephp-mysql.json

    - num: 1
      basename: clusterloader-dancer-mysql
      tuning: default
      ifexists: reuse
      templates:
        - num: 1
          file: dancer-mysql.json

    - num: 1
      basename: clusterloader-django-postgresql
      tuning: default
      ifexists: reuse
      templates:
        - num: 1
          file: django-postgresql.json

    - num: 1
      basename: clusterloader-nodejs-mongodb
      tuning: default
      ifexists: reuse
      templates:
        - num: 1
          file: quickstarts/nodejs-mongodb.json

    - num: 1
      basename: clusterloader-rails-postgresql
      tuning: default
      templates:
        - num: 1
          file: rails-postgresql.json

  tuningsets: 2
    - name: default
      pods:
        stepping: 3
          stepsize: 5
          pause: 0 s
        rate_limit: 4
          delay: 0 ms
1
Optional setting for end-to-end tests. Set to local to avoid extra log messages.
2
The tuning sets allow rate limiting and stepping, the ability to create several batches of pods while pausing in between sets. Cluster Loader monitors completion of the previous step before continuing.
3
Stepping will pause for M seconds after each N objects are created.
4
Rate limiting will wait M milliseconds between the creation of objects.

This example assumes that references to any external template files or pod spec files are also mounted into the container.

Important

If you are running Cluster Loader on Microsoft Azure, then you must set the AZURE_AUTH_LOCATION variable to a file that contains the output of terraform.azure.auto.tfvars.json, which is present in the installer directory.

5.3.2. Configuration fields

Table 5.1. Top-level Cluster Loader Fields

FieldDescription

cleanup

Set to true or false. One definition per configuration. If set to true, cleanup deletes all namespaces (projects) created by Cluster Loader at the end of the test.

projects

A sub-object with one or many definition(s). Under projects, each namespace to create is defined and projects has several mandatory subheadings.

tuningsets

A sub-object with one definition per configuration. tuningsets allows the user to define a tuning set to add configurable timing to project or object creation (pods, templates, and so on).

sync

An optional sub-object with one definition per configuration. Adds synchronization possibilities during object creation.

Table 5.2. Fields under projects

FieldDescription

num

An integer. One definition of the count of how many projects to create.

basename

A string. One definition of the base name for the project. The count of identical namespaces will be appended to Basename to prevent collisions.

tuning

A string. One definition of what tuning set you want to apply to the objects, which you deploy inside this namespace.

ifexists

A string containing either reuse or delete. Defines what the tool does if it finds a project or namespace that has the same name of the project or namespace it creates during execution.

configmaps

A list of key-value pairs. The key is the config map name and the value is a path to a file from which you create the config map.

secrets

A list of key-value pairs. The key is the secret name and the value is a path to a file from which you create the secret.

pods

A sub-object with one or many definition(s) of pods to deploy.

templates

A sub-object with one or many definition(s) of templates to deploy.

Table 5.3. Fields under pods and templates

FieldDescription

num

An integer. The number of pods or templates to deploy.

image

A string. The docker image URL to a repository where it can be pulled.

basename

A string. One definition of the base name for the template (or pod) that you want to create.

file

A string. The path to a local file, which is either a pod spec or template to be created.

parameters

Key-value pairs. Under parameters, you can specify a list of values to override in the pod or template.

Table 5.4. Fields under tuningsets

FieldDescription

name

A string. The name of the tuning set which will match the name specified when defining a tuning in a project.

pods

A sub-object identifying the tuningsets that will apply to pods.

templates

A sub-object identifying the tuningsets that will apply to templates.

Table 5.5. Fields under tuningsets pods or tuningsets templates

FieldDescription

stepping

A sub-object. A stepping configuration used if you want to create an object in a step creation pattern.

rate_limit

A sub-object. A rate-limiting tuning set configuration to limit the object creation rate.

Table 5.6. Fields under tuningsets pods or tuningsets templates, stepping

FieldDescription

stepsize

An integer. How many objects to create before pausing object creation.

pause

An integer. How many seconds to pause after creating the number of objects defined in stepsize.

timeout

An integer. How many seconds to wait before failure if the object creation is not successful.

delay

An integer. How many milliseconds (ms) to wait between creation requests.

Table 5.7. Fields under sync

FieldDescription

server

A sub-object with enabled and port fields. The boolean enabled defines whether to start an HTTP server for pod synchronization. The integer port defines the HTTP server port to listen on (9090 by default).

running

A boolean. Wait for pods with labels matching selectors to go into Running state.

succeeded

A boolean. Wait for pods with labels matching selectors to go into Completed state.

selectors

A list of selectors to match pods in Running or Completed states.

timeout

A string. The synchronization timeout period to wait for pods in Running or Completed states. For values that are not 0, use units: [ns|us|ms|s|m|h].

5.4. Known issues

  • Cluster Loader fails when called without configuration. (BZ#1761925)
  • If the IDENTIFIER parameter is not defined in user templates, template creation fails with error: unknown parameter name "IDENTIFIER". If you deploy templates, add this parameter to your template to avoid this error:

    {
      "name": "IDENTIFIER",
      "description": "Number to append to the name of resources",
      "value": "1"
    }

    If you deploy pods, adding the parameter is unnecessary.

Chapter 6. Using CPU Manager

CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.

CPU Manager is useful for workloads that have some of these attributes:

  • Require as much CPU time as possible.
  • Are sensitive to processor cache misses.
  • Are low-latency network applications.
  • Coordinate with other processes and benefit from sharing a single processor cache.

6.1. Setting up CPU Manager

Procedure

  1. Optional: Label a node:

    # oc label node perf-node.example.com cpumanager=true
  2. Edit the MachineConfigPool of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:

    # oc edit machineconfigpool worker
  3. Add a label to the worker machine config pool:

    metadata:
      creationTimestamp: 2020-xx-xxx
      generation: 3
      labels:
        custom-kubelet: cpumanager-enabled
  4. Create a KubeletConfig, cpumanager-kubeletconfig.yaml, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See the machineConfigPoolSelector section:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: cpumanager-enabled
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: cpumanager-enabled
      kubeletConfig:
         cpuManagerPolicy: static 1
         cpuManagerReconcilePeriod: 5s 2
    1
    Specify a policy:
    • none. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically.
    • static. This policy allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node.
    2
    Optional. Specify the CPU Manager reconcile frequency. The default is 5s.
  5. Create the dynamic kubelet config:

    # oc create -f cpumanager-kubeletconfig.yaml

    This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.

  6. Check for the merged kubelet config:

    # oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7

    Example output

           "ownerReferences": [
                {
                    "apiVersion": "machineconfiguration.openshift.io/v1",
                    "kind": "KubeletConfig",
                    "name": "cpumanager-enabled",
                    "uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878"
                }
            ],

  7. Check the worker for the updated kubelet.conf:

    # oc debug node/perf-node.example.com
    sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager

    Example output

    cpuManagerPolicy: static        1
    cpuManagerReconcilePeriod: 5s   2

    1 2
    These settings were defined when you created the KubeletConfig CR.
  8. Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:

    # cat cpumanager-pod.yaml

    Example output

    apiVersion: v1
    kind: Pod
    metadata:
      generateName: cpumanager-
    spec:
      containers:
      - name: cpumanager
        image: gcr.io/google_containers/pause-amd64:3.0
        resources:
          requests:
            cpu: 1
            memory: "1G"
          limits:
            cpu: 1
            memory: "1G"
      nodeSelector:
        cpumanager: "true"

  9. Create the pod:

    # oc create -f cpumanager-pod.yaml
  10. Verify that the pod is scheduled to the node that you labeled:

    # oc describe pod cpumanager

    Example output

    Name:               cpumanager-6cqz7
    Namespace:          default
    Priority:           0
    PriorityClassName:  <none>
    Node:  perf-node.example.com/xxx.xx.xx.xxx
    ...
     Limits:
          cpu:     1
          memory:  1G
        Requests:
          cpu:        1
          memory:     1G
    ...
    QoS Class:       Guaranteed
    Node-Selectors:  cpumanager=true

  11. Verify that the cgroups are set up correctly. Get the process ID (PID) of the pause process:

    # ├─init.scope
    │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17
    └─kubepods.slice
      ├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice
      │ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope
      │ └─32706 /pause

    Pods of quality of service (QoS) tier Guaranteed are placed within the kubepods.slice. Pods of other QoS tiers end up in child cgroups of kubepods:

    # cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope
    # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done

    Example output

    cpuset.cpus 1
    tasks 32706

  12. Check the allowed CPU list for the task:

    # grep ^Cpus_allowed_list /proc/32706/status

    Example output

     Cpus_allowed_list:    1

  13. Verify that another pod (in this case, the pod in the burstable QoS tier) on the system cannot run on the core allocated for the Guaranteed pod:

    # cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus
    0
    # oc describe node perf-node.example.com

    Example output

    ...
    Capacity:
     attachable-volumes-aws-ebs:  39
     cpu:                         2
     ephemeral-storage:           124768236Ki
     hugepages-1Gi:               0
     hugepages-2Mi:               0
     memory:                      8162900Ki
     pods:                        250
    Allocatable:
     attachable-volumes-aws-ebs:  39
     cpu:                         1500m
     ephemeral-storage:           124768236Ki
     hugepages-1Gi:               0
     hugepages-2Mi:               0
     memory:                      7548500Ki
     pods:                        250
    -------                               ----                           ------------  ----------  ---------------  -------------  ---
      default                                 cpumanager-6cqz7               1 (66%)       1 (66%)     1G (12%)         1G (12%)       29m
    
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource                    Requests          Limits
      --------                    --------          ------
      cpu                         1440m (96%)       1 (66%)

    This VM has two CPU cores. You set kube-reserved to 500 millicores, meaning half of one core is subtracted from the total capacity of the node to arrive at the Node Allocatable amount. You can see that Allocatable CPU is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:

    NAME                    READY   STATUS    RESTARTS   AGE
    cpumanager-6cqz7        1/1     Running   0          33m
    cpumanager-7qc2t        0/1     Pending   0          11s

Chapter 7. Using Topology Manager

Topology Manager collects hints from the CPU Manager, Device Manager, and other Hint Providers to align pod resources, such as CPU, SR-IOV VFs, and other device resources, for all Quality of Service (QoS) classes on the same non-uniform memory access (NUMA) node.

Topology Manager uses topology information from collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and Pod resources requested.

Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.

Note

To use Topology Manager you must use the CPU Manager with the static policy. For more information on CPU Manager, see Using CPU Manager.

7.1. Topology Manager policies

Topology Manager aligns Pod resources of all Quality of Service (QoS) classes by collecting topology hints from Hint Providers, such as CPU Manager and Device Manager, and using the collected hints to align the Pod resources.

Note

To align CPU resources with other requested resources in a Pod spec, the CPU Manager must be enabled with the static CPU Manager policy.

Topology Manager supports four allocation policies, which you assign in the cpumanager-enabled custom resource (CR):

none policy
This is the default policy and does not perform any topology alignment.
best-effort policy
For each container in a pod with the best-effort topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node.
restricted policy
For each container in a pod with the restricted topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager rejects this pod from the node, resulting in a pod in a Terminated state with a pod admission failure.
single-numa-node policy
For each container in a pod with the single-numa-node topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure.

7.2. Setting up Topology Manager

To use Topology Manager, you must enable the LatencySensitive Feature Gate and configure the Topology Manager policy in the cpumanager-enabled custom resource (CR). This file might exist if you have set up CPU Manager. If the file does not exist, you can create the file.

Prequisites

  • Configure the CPU Manager policy to be static. Refer to Using CPU Manager in the Scalability and Performance section.

Procedure

To activate Topololgy Manager:

  1. Edit the FeatureGate object to add the LatencySensitive feature set:

    $ oc edit featuregate/cluster
    apiVersion: config.openshift.io/v1
    kind: FeatureGate
    metadata:
      annotations:
        release.openshift.io/create-only: "true"
      creationTimestamp: 2020-06-05T14:41:09Z
      generation: 2
      managedFields:
      - apiVersion: config.openshift.io/v1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:annotations:
              .: {}
              f:release.openshift.io/create-only: {}
          f:spec: {}
        manager: cluster-version-operator
        operation: Update
        time: 2020-06-05T14:41:09Z
      - apiVersion: config.openshift.io/v1
        fieldsType: FieldsV1
        fieldsV1:
          f:spec:
            f:featureSet: {}
        manager: oc
        operation: Update
        time: 2020-06-05T15:21:44Z
      name: cluster
      resourceVersion: "28457"
      selfLink: /apis/config.openshift.io/v1/featuregates/cluster
      uid: e802e840-89ee-4137-a7e5-ca15fd2806f8
    spec:
      featureSet: LatencySensitive 1
    ...
    1
    Add the LatencySensitive feature set in a comma-separated list.
  2. Configure the Topology Manager policy in the cpumanager-enabled custom resource (CR).

    $ oc edit KubeletConfig cpumanager-enabled
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: cpumanager-enabled
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: cpumanager-enabled
      kubeletConfig:
         cpuManagerPolicy: static 1
         cpuManagerReconcilePeriod: 5s
         topologyManagerPolicy: single-numa-node 2
    1
    This parameter must be static.
    2
    Specify your selected Topology Manager policy. Here, the policy is single-numa-node. Acceptable values are: default, best-effort, restricted, single-numa-node.

Additional resources

For more information on CPU Manager, see Using CPU Manager.

7.3. Pod interactions with Topology Manager policies

The example Pod specs below help illustrate pod interactions with Topology Manager.

The following pod runs in the BestEffort QoS class because no resource requests or limits are specified.

spec:
  containers:
  - name: nginx
    image: nginx

The next pod runs in the Burstable QoS class because requests are less than limits.

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
      requests:
        memory: "100Mi"

If the selected policy is anything other than none, Topology Manager would not consider either of these Pod specifications.

The last example pod below runs in the Guaranteed QoS class because requests are equal to limits.

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"
      requests:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"

Topology Manager would consider this pod. The Topology Manager consults the CPU Manager static policy, which returns the topology of available CPUs. Topology Manager also consults Device Manager to discover the topology of available devices for example.com/device.

Topology Manager will use this information to store the best Topology for this container. In the case of this pod, CPU Manager and Device Manager will use this stored information at the resource allocation stage.

Chapter 8. Scaling the Cluster Monitoring Operator

OpenShift Container Platform exposes metrics that the Cluster Monitoring Operator collects and stores in the Prometheus-based monitoring stack. As an administrator, you can view system resources, containers and components metrics in one dashboard interface, Grafana.

8.1. Prometheus database storage requirements

Red Hat performed various tests for different scale sizes.

Table 8.1. Prometheus Database storage requirements based on number of nodes/pods in the cluster

Number of NodesNumber of podsPrometheus storage growth per dayPrometheus storage growth per 15 daysRAM Space (per scale size)Network (per tsdb chunk)

50

1800

6.3 GB

94 GB

6 GB

16 MB

100

3600

13 GB

195 GB

10 GB

26 MB

150

5400

19 GB

283 GB

12 GB

36 MB

200

7200

25 GB

375 GB

14 GB

46 MB

Approximately 20 percent of the expected size was added as overhead to ensure that the storage requirements do not exceed the calculated value.

The above calculation is for the default OpenShift Container Platform Cluster Monitoring Operator.

Note

CPU utilization has minor impact. The ratio is approximately 1 core out of 40 per 50 nodes and 1800 pods.

Lab environment

In a previous release, all experiments were performed in an OpenShift Container Platform on RHOSP environment:

  • Infra nodes (VMs) - 40 cores, 157 GB RAM.
  • CNS nodes (VMs) - 16 cores, 62 GB RAM, NVMe drives.

Recommendations for OpenShift Container Platform

  • Use at least three infrastructure (infra) nodes.
  • Use at least three openshift-container-storage nodes with non-volatile memory express (NVMe) drives.

8.2. Configuring cluster monitoring

Procedure

To increase the storage capacity for Prometheus:

  1. Create a YAML configuration file, cluster-monitoring-config.yml. For example:

    apiVersion: v1
    kind: ConfigMap
    data:
      config.yaml: |
        prometheusOperator:
          baseImage: quay.io/coreos/prometheus-operator
          prometheusConfigReloaderBaseImage: quay.io/coreos/prometheus-config-reloader
          configReloaderBaseImage: quay.io/coreos/configmap-reload
          nodeSelector:
            node-role.kubernetes.io/infra: ""
        prometheusK8s:
          retention: {{PROMETHEUS_RETENTION_PERIOD}} 1
          baseImage: openshift/prometheus
          nodeSelector:
            node-role.kubernetes.io/infra: ""
          volumeClaimTemplate:
            spec:
              storageClassName: gp2
              resources:
                requests:
                  storage: {{PROMETHEUS_STORAGE_SIZE}} 2
        alertmanagerMain:
          baseImage: openshift/prometheus-alertmanager
          nodeSelector:
            node-role.kubernetes.io/infra: ""
          volumeClaimTemplate:
            spec:
              storageClassName: gp2
              resources:
                requests:
                  storage: {{ALERTMANAGER_STORAGE_SIZE}} 3
        nodeExporter:
          baseImage: openshift/prometheus-node-exporter
        kubeRbacProxy:
          baseImage: quay.io/coreos/kube-rbac-proxy
        kubeStateMetrics:
          baseImage: quay.io/coreos/kube-state-metrics
          nodeSelector:
            node-role.kubernetes.io/infra: ""
        grafana:
          baseImage: grafana/grafana
          nodeSelector:
            node-role.kubernetes.io/infra: ""
        auth:
          baseImage: openshift/oauth-proxy
        k8sPrometheusAdapter:
          nodeSelector:
            node-role.kubernetes.io/infra: ""
    metadata:
      name: cluster-monitoring-config
    namespace: openshift-monitoring
    1
    A typical value is PROMETHEUS_RETENTION_PERIOD=15d. Units are measured in time using one of these suffixes: s, m, h, d.
    2
    A typical value is PROMETHEUS_STORAGE_SIZE=2000Gi. Storage values can be a plain integer or as a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
    3
    A typical value is ALERTMANAGER_STORAGE_SIZE=20Gi. Storage values can be a plain integer or as a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
  2. Set the values like the retention period and storage sizes.
  3. Apply the changes by running:

    $ oc create -f cluster-monitoring-config.yml

Chapter 9. Planning your environment according to object maximums

Consider the following tested object maximums when you plan your OpenShift Container Platform cluster.

These guidelines are based on the largest possible cluster. For smaller clusters, the maximums are lower. There are many factors that influence the stated thresholds, including the etcd version or storage data format.

In most cases, exceeding these numbers results in lower overall performance. It does not necessarily mean that the cluster will fail.

9.1. OpenShift Container Platform tested cluster maximums for major releases

Tested Cloud Platforms for OpenShift Container Platform 3.x: Red Hat OpenStack Platform (RHOSP), Amazon Web Services and Microsoft Azure. Tested Cloud Platforms for OpenShift Container Platform 4.x: Amazon Web Services, Microsoft Azure and Google Cloud Platform.

Maximum type3.x tested maximum4.x tested maximum

Number of nodes

2,000

2,000

Number of pods [1]

150,000

150,000

Number of pods per node

250

500 [2]

Number of pods per core

There is no default value.

There is no default value.

Number of namespaces [3]

10,000

10,000

Number of builds

10,000 (Default pod RAM 512 Mi) - Pipeline Strategy

10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy

Number of pods per namespace [4]

25,000

25,000

Number of services [5]

10,000

10,000

Number of services per namespace

5,000

5,000

Number of back-ends per service

5,000

5,000

Number of deployments per namespace [4]

2,000

2,000

  1. The pod count displayed here is the number of test pods. The actual number of pods depends on the application’s memory, CPU, and storage requirements.
  2. This was tested on a cluster with 100 worker nodes with 500 pods per worker node. The default maxPods is still 250. To get to 500 maxPods, the cluster must be created with a maxPods set to 500 using a custom kubelet config. If you need 500 user pods, you need a hostPrefix of 22 because there are 10-15 system pods already running on the node. The maximum number of pods with attached persistent volume claims (PVC) depends on storage backend from where PVC are allocated. In our tests, only OpenShift Container Storage v4 (OCS v4) was able to satisfy the number of pods per node discussed in this document.
  3. When there are a large number of active projects, etcd might suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentaion, is highly recommended to free etcd storage.
  4. There are a number of control loops in the system that must iterate over all objects in a given namespace as a reaction to some changes in state. Having a large number of objects of a given type in a single namespace can make those loops expensive and slow down processing given state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy the application requirements.
  5. Each service port and each service back-end has a corresponding entry in iptables. The number of back-ends of a given service impact the size of the endpoints objects, which impacts the size of data that is being sent all over the system.

9.2. OpenShift Container Platform tested cluster maximums

Maximum type4.1 and 4.2 tested maximum4.3 tested maximum4.4 tested maximum4.5 tested maximum4.6 tested maximum

Number of nodes

2,000

2,000

250

500

2,000

Number of pods [1]

150,000

150,000

62,500

62,500

150,000

Number of pods per node

250

500

500

500

500

Number of pods per core

There is no default value.

There is no default value.

There is no default value.

There is no default value.

There is no default value.

Number of namespaces [2]

10,000

10,000

10,000

10,000

10,000

Number of builds

10,000 (Default pod RAM 512 Mi) - Pipeline Strategy

10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy

10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy

10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy

10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy

Number of pods per namespace [3]

25,000

25,000

25,000

25,000

25,000

Number of services [4]

10,000

10,000

10,000

10,000

10,000

Number of services per namespace

5,000

5,000

5,000

5,000

5,000

Number of back ends per service

5,000

5,000

5,000

5,000

5,000

Number of deployments per namespace [3]

2,000

2,000

2,000

2,000

2,000

  1. The pod count displayed here is the number of test pods. The actual number of pods depends on the application’s memory, CPU, and storage requirements.
  2. When there are a large number of active projects, etcd might suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentaion, is highly recommended to free etcd storage.
  3. There are a number of control loops in the system that must iterate over all objects in a given namespace as a reaction to some changes in state. Having a large number of objects of a given type in a single namespace can make those loops expensive and slow down processing given state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy the application requirements.
  4. Each service port and each service back end has a corresponding entry in iptables. The number of back ends of a given service impact the size of the endpoints objects, which impacts the size of data that is being sent all over the system.

In OpenShift Container Platform 4.6, half of a CPU core (500 millicore) is reserved by the system compared to OpenShift Container Platform 3.11 and previous versions.

9.3. How to plan your environment according to tested cluster maximums

Important

Oversubscribing the physical resources on a node affects resource guarantees the Kubernetes scheduler makes during pod placement. Learn what measures you can take to avoid memory swapping.

Some of the tested maximums are stretched only in a single dimension. They will vary when many objects are running on the cluster.

The numbers noted in this documentation are based on Red Hat’s test methodology, setup, configuration, and tunings. These numbers can vary based on your own individual setup and environments.

While planning your environment, determine how many pods are expected to fit per node:

required pods per cluster / pods per node = total number of nodes needed

The current maximum number of pods per node is 250. However, the number of pods that fit on a node is dependent on the application itself. Consider the application’s memory, CPU, and storage requirements, as described in How to plan your environment according to application requirements.

Example scenario

If you want to scope your cluster for 2200 pods per cluster, you would need at least five nodes, assuming that there are 500 maximum pods per node:

2200 / 500 = 4.4

If you increase the number of nodes to 20, then the pod distribution changes to 110 pods per node:

2200 / 20 = 110

Where:

required pods per cluster / total number of nodes = expected pods per node

9.4. How to plan your environment according to application requirements

Consider an example application environment:

Pod typePod quantityMax memoryCPU coresPersistent storage

apache

100

500 MB

0.5

1 GB

node.js

200

1 GB

1

1 GB

postgresql

100

1 GB

2

10 GB

JBoss EAP

100

1 GB

1

1 GB

Extrapolated requirements: 550 CPU cores, 450GB RAM, and 1.4TB storage.

Instance size for nodes can be modulated up or down, depending on your preference. Nodes are often resource overcommitted. In this deployment scenario, you can choose to run additional smaller nodes or fewer larger nodes to provide the same amount of resources. Factors such as operational agility and cost-per-instance should be considered.

Node typeQuantityCPUsRAM (GB)

Nodes (option 1)

100

4

16

Nodes (option 2)

50

8

32

Nodes (option 3)

25

16

64

Some applications lend themselves well to overcommitted environments, and some do not. Most Java applications and applications that use huge pages are examples of applications that would not allow for overcommitment. That memory can not be used for other applications. In the example above, the environment would be roughly 30 percent overcommitted, a common ratio.

The application pods can access a service either by using environment variables or DNS. If using environment variables, for each active service the variables are injected by the kubelet when a pod is run on a node. A cluster-aware DNS server watches the Kubernetes API for new services and creates a set of DNS records for each one. If DNS is enabled throughout your cluster, then all pods should automatically be able to resolve services by their DNS name. Service discovery using DNS can be used in case you must go beyond 5000 services. When using environment variables for service discovery, the argument list exceeds the allowed length after 5000 services in a namespace, then the pods and deployments will start failing. Disable the service links in the deployment’s service specification file to overcome this:

 ---
    Kind: Template
    apiVersion: v1
    metadata:
      name: deploymentConfigTemplate
      creationTimestamp:
      annotations:
        description: This template will create a deploymentConfig with 1 replica, 4 env vars and a service.
        tags: ''
    objects:
    - kind: DeploymentConfig
      apiVersion: v1
      metadata:
        name: deploymentconfig${IDENTIFIER}
      spec:
        template:
          metadata:
            labels:
              name: replicationcontroller${IDENTIFIER}
          spec:
            enableServiceLinks: false
            containers:
            - name: pause${IDENTIFIER}
              image: "${IMAGE}"
              ports:
              - containerPort: 8080
                protocol: TCP
              env:
              - name: ENVVAR1_${IDENTIFIER}
                value: "${ENV_VALUE}"
              - name: ENVVAR2_${IDENTIFIER}
                value: "${ENV_VALUE}"
              - name: ENVVAR3_${IDENTIFIER}
                value: "${ENV_VALUE}"
              - name: ENVVAR4_${IDENTIFIER}
                value: "${ENV_VALUE}"
              resources: {}
              imagePullPolicy: IfNotPresent
              capabilities: {}
              securityContext:
                capabilities: {}
                privileged: false
            restartPolicy: Always
            serviceAccount: ''
        replicas: 1
        selector:
          name: replicationcontroller${IDENTIFIER}
        triggers:
        - type: ConfigChange
        strategy:
          type: Rolling
    - kind: Service
      apiVersion: v1
      metadata:
        name: service${IDENTIFIER}
      spec:
        selector:
          name: replicationcontroller${IDENTIFIER}
        ports:
        - name: serviceport${IDENTIFIER}
          protocol: TCP
          port: 80
          targetPort: 8080
        portalIP: ''
        type: ClusterIP
        sessionAffinity: None
      status:
        loadBalancer: {}
    parameters:
    - name: IDENTIFIER
      description: Number to append to the name of resources
      value: '1'
      required: true
    - name: IMAGE
      description: Image to use for deploymentConfig
      value: gcr.io/google-containers/pause-amd64:3.0
      required: false
    - name: ENV_VALUE
      description: Value to use for environment variables
      generate: expression
      from: "[A-Za-z0-9]{255}"
      required: false
    labels:
      template: deploymentConfigTemplate

Chapter 10. Optimizing storage

Optimizing storage helps to minimize storage use across all resources. By optimizing storage, administrators help ensure that existing storage resources are working in an efficient manner.

10.1. Available persistent storage options

Understand your persistent storage options so that you can optimize your OpenShift Container Platform environment.

Table 10.1. Available storage options

Storage typeDescriptionExamples

Block

  • Presented to the operating system (OS) as a block device
  • Suitable for applications that need full control of storage and operate at a low level on files bypassing the file system
  • Also referred to as a Storage Area Network (SAN)
  • Non-shareable, which means that only one client at a time can mount an endpoint of this type

AWS EBS and VMware vSphere support dynamic persistent volume (PV) provisioning natively in OpenShift Container Platform.

File

  • Presented to the OS as a file system export to be mounted
  • Also referred to as Network Attached Storage (NAS)
  • Concurrency, latency, file locking mechanisms, and other capabilities vary widely between protocols, implementations, vendors, and scales.

RHEL NFS, NetApp NFS [1], and Vendor NFS

Object

  • Accessible through a REST API endpoint
  • Configurable for use in the OpenShift Container Platform Registry
  • Applications must build their drivers into the application and/or container.

AWS S3

  1. NetApp NFS supports dynamic PV provisioning when using the Trident plug-in.
Important

Currently, CNS is not supported in OpenShift Container Platform 4.6.

10.3. Data storage management

The following table summarizes the main directories that OpenShift Container Platform components write data to.

Table 10.3. Main directories for storing OpenShift Container Platform data

DirectoryNotesSizingExpected growth

/var/log

Log files for all components.

10 to 30 GB.

Log files can grow quickly; size can be managed by growing disks or by using log rotate.

/var/lib/etcd

Used for etcd storage when storing the database.

Less than 20 GB.

Database can grow up to 8 GB.

Will grow slowly with the environment. Only storing metadata.

Additional 20-25 GB for every additional 8 GB of memory.

/var/lib/containers

This is the mount point for the CRI-O runtime. Storage used for active container runtimes, including pods, and storage of local images. Not used for registry storage.

50 GB for a node with 16 GB memory. Note that this sizing should not be used to determine minimum cluster requirements.

Additional 20-25 GB for every additional 8 GB of memory.

Growth is limited by capacity for running containers.

/var/lib/kubelet

Ephemeral volume storage for pods. This includes anything external that is mounted into a container at runtime. Includes environment variables, kube secrets, and data volumes not backed by persistent volumes.

Varies

Minimal if pods requiring storage are using persistent volumes. If using ephemeral storage, this can grow quickly.

Chapter 11. Optimizing routing

The OpenShift Container Platform HAProxy router scales to optimize performance.

11.1. Baseline Ingress Controller (router) performance

The OpenShift Container Platform Ingress Controller, or router, is the Ingress point for all external traffic destined for OpenShift Container Platform services.

When evaluating a single HAProxy router performance in terms of HTTP requests handled per second, the performance varies depending on many factors. In particular:

  • HTTP keep-alive/close mode
  • Route type
  • TLS session resumption client support
  • Number of concurrent connections per target route
  • Number of target routes
  • Back end server page size
  • Underlying infrastructure (network/SDN solution, CPU, and so on)

While performance in your specific environment will vary, Red Hat lab tests on a public cloud instance of size 4 vCPU/16GB RAM. A single HAProxy router handling 100 routes terminated by backends serving 1kB static pages is able to handle the following number of transactions per second.

In HTTP keep-alive mode scenarios:

EncryptionLoadBalancerServiceHostNetwork

none

21515

29622

edge

16743

22913

passthrough

36786

53295

re-encrypt

21583

25198

In HTTP close (no keep-alive) scenarios:

EncryptionLoadBalancerServiceHostNetwork

none

5719

8273

edge

2729

4069

passthrough

4121

5344

re-encrypt

2320

2941

Default Ingress Controller configuration with ROUTER_THREADS=4 was used and two different endpoint publishing strategies (LoadBalancerService/HostNetwork) were tested. TLS session resumption was used for encrypted routes. With HTTP keep-alive, a single HAProxy router is capable of saturating 1 Gbit NIC at page sizes as small as 8 kB.

When running on bare metal with modern processors, you can expect roughly twice the performance of the public cloud instance above. This overhead is introduced by the virtualization layer in place on public clouds and holds mostly true for private cloud-based virtualization as well. The following table is a guide to how many applications to use behind the router:

Number of applicationsApplication type

5-10

static file/web server or caching proxy

100-1000

applications generating dynamic content

In general, HAProxy can support routes for 5 to 1000 applications, depending on the technology in use. Ingress Controller performance might be limited by the capabilities and performance of the applications behind it, such as language or static versus dynamic content.

Ingress, or router, sharding should be used to serve more routes towards applications and help horizontally scale the routing tier.

For more information on Ingress sharding, see Configuring Ingress Controller sharding by using route labels and Configuring Ingress Controller sharding by using namespace labels.

11.2. Ingress Controller (router) performance optimizations

OpenShift Container Platform no longer supports modifying Ingress Controller deployments by setting environment variables such as ROUTER_THREADS, ROUTER_DEFAULT_TUNNEL_TIMEOUT, ROUTER_DEFAULT_CLIENT_TIMEOUT, ROUTER_DEFAULT_SERVER_TIMEOUT, and RELOAD_INTERVAL.

You can modify the Ingress Controller deployment, but if the Ingress Operator is enabled, the configuration is overwritten.

Chapter 12. What huge pages do and how they are consumed by applications

12.1. What huge pages do

Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, and so on. CPUs have a built-in memory management unit that manages a list of these pages in hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of virtual-to-physical page mappings. If the virtual address passed in a hardware instruction can be found in the TLB, the mapping can be determined quickly. If not, a TLB miss occurs, and the system falls back to slower, software-based address translation, resulting in performance issues. Since the size of the TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the page size.

A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other architectures. In order to use huge pages, code must be written so that applications are aware of them. Transparent Huge Pages (THP) attempt to automate the management of huge pages without application knowledge, but they have limitations. In particular, they are limited to 2Mi page sizes. THP can lead to performance degradation on nodes with high memory utilization or fragmentation due to defragmenting efforts of THP, which can lock memory pages. For this reason, some applications may be designed to (or recommend) usage of pre-allocated huge pages instead of THP.

In OpenShift Container Platform, applications in a pod can allocate and consume pre-allocated huge pages.

12.2. How huge pages are consumed by apps

Nodes must pre-allocate huge pages in order for the node to report its huge page capacity. A node can only pre-allocate huge pages for a single size.

Huge pages can be consumed through container-level resource requirements using the resource name hugepages-<size>, where size is the most compact binary notation using integer values supported on a particular node. For example, if a node supports 2048KiB page sizes, it exposes a schedulable resource hugepages-2Mi. Unlike CPU or memory, huge pages do not support over-commitment.

apiVersion: v1
kind: Pod
metadata:
  generateName: hugepages-volume-
spec:
  containers:
  - securityContext:
      privileged: true
    image: rhel7:latest
    command:
    - sleep
    - inf
    name: example
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    resources:
      limits:
        hugepages-2Mi: 100Mi 1
        memory: "1Gi"
        cpu: "1"
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
1
Specify the amount of memory for hugepages as the exact amount to be allocated. Do not specify this value as the amount of memory for hugepages multiplied by the size of the page. For example, given a huge page size of 2MB, if you want to use 100MB of huge-page-backed RAM for your application, then you would allocate 50 huge pages. OpenShift Container Platform handles the math for you. As in the above example, you can specify 100MB directly.

Allocating huge pages of a specific size

Some platforms support multiple huge page sizes. To allocate huge pages of a specific size, precede the huge pages boot command parameters with a huge page size selection parameter hugepagesz=<size>. The <size> value must be specified in bytes with an optional scale suffix [kKmMgG]. The default huge page size can be defined with the default_hugepagesz=<size> boot parameter.

Huge page requirements

  • Huge page requests must equal the limits. This is the default if limits are specified, but requests are not.
  • Huge pages are isolated at a pod scope. Container isolation is planned in a future iteration.
  • EmptyDir volumes backed by huge pages must not consume more huge page memory than the pod request.
  • Applications that consume huge pages via shmget() with SHM_HUGETLB must run with a supplemental group that matches proc/sys/vm/hugetlb_shm_group.

12.3. Configuring huge pages

Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. There are two ways of reserving huge pages: at boot time and at run time. Reserving at boot time increases the possibility of success because the memory has not yet been significantly fragmented. The Node Tuning Operator currently supports boot time allocation of huge pages on specific nodes.

12.3.1. At boot time

Procedure

To minimize node reboots, the order of the steps below needs to be followed:

  1. Label all nodes that need the same huge pages setting by a label.

    $ oc label node <node_using_hugepages> node-role.kubernetes.io/worker-hp=
  2. Create a file with the following content and name it hugepages-tuned-boottime.yaml:

    apiVersion: tuned.openshift.io/v1
    kind: Tuned
    metadata:
      name: hugepages 1
      namespace: openshift-cluster-node-tuning-operator
    spec:
      profile: 2
      - data: |
          [main]
          summary=Boot time configuration for hugepages
          include=openshift-node
          [bootloader]
          cmdline_openshift_node_hugepages=hugepagesz=2M hugepages=50 3
        name: openshift-node-hugepages
    
      recommend:
      - machineConfigLabels: 4
          machineconfiguration.openshift.io/role: "worker-hp"
        priority: 30
        profile: openshift-node-hugepages
    1
    Set the name of the Tuned resource to hugepages.
    2
    Set the profile section to allocate huge pages.
    3
    Note the order of parameters is important as some platforms support huge pages of various sizes.
    4
    Enable machine config pool based matching.
  3. Create the Tuned hugepages profile

    $ oc create -f hugepages-tuned-boottime.yaml
  4. Create a file with the following content and name it hugepages-mcp.yaml:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: worker-hp
      labels:
        worker-hp: ""
    spec:
      machineConfigSelector:
        matchExpressions:
          - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-hp]}
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/worker-hp: ""
  5. Create the machine config pool:

    $ oc create -f hugepages-mcp.yaml

Given enough non-fragmented memory, all the nodes in the worker-hp machine config pool should now have 50 2Mi huge pages allocated.

$ oc get node <node_using_hugepages> -o jsonpath="{.status.allocatable.hugepages-2Mi}"
100Mi
Warning

This functionality is currently only supported on Red Hat Enterprise Linux CoreOS (RHCOS) 8.x worker nodes. On Red Hat Enterprise Linux (RHEL) 7.x worker nodes the Tuned [bootloader] plug-in is currently not supported.

Chapter 13. Performance Addon Operator for low latency nodes

13.1. Understanding low latency

The emergence of Edge computing in the area of Telco / 5G plays a key role in reducing latency and congestion problems and improving application performance.

Simply put, latency determines how fast data (packets) moves from the sender to receiver and returns to the sender after processing by the receiver. Obviously, maintaining a network architecture with the lowest possible delay of latency speeds is key for meeting the network performance requirements of 5G. Compared to 4G technology, with an average latency of 50ms, 5G is targeted to reach latency numbers of 1ms or less. This reduction in latency boosts wireless throughput by a factor of 10.

Many of the deployed applications in the Telco space require low latency that can only tolerate zero packet loss. Tuning for zero packet loss helps mitigate the inherent issues that degrade network performance. For more information, see Tuning for Zero Packet Loss in Red Hat OpenStack Platform (RHOSP).

The Edge computing initiative also comes in to play for reducing latency rates. Think of it as literally being on the edge of the cloud and closer to the user. This greatly reduces the distance between the user and distant data centers, resulting in reduced application response times and performance latency.

Administrators must be able to manage their many Edge sites and local services in a centralized way so that all of the deployments can run at the lowest possible management cost. They also need an easy way to deploy and configure certain nodes of their cluster for real-time low latency and high-performance purposes. Low latency nodes are useful for applications such as Cloud-native Network Functions (CNF) and Data Plane Development Kit (DPDK).

OpenShift Container Platform currently provides mechanisms to tune software on an OpenShift Container Platform cluster for real-time running and low latency (around <20 microseconds reaction time). This includes tuning the kernel and OpenShift Container Platform set values, installing a kernel, and reconfiguring the machine. But this method requires setting up four different Operators and performing many configurations that, when done manually, is complex and could be prone to mistakes.

OpenShift Container Platform provides a Performance Addon Operator to implement automatic tuning in order to achieve low latency performance for OpenShift applications. The cluster administrator uses this performance profile configuration that makes it easier to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt, the CPUs that will be reserved for housekeeping, and the CPUs that will be used for running the workloads.

13.2. Installing the Performance Addon Operator

Performance Addon Operator provides the ability to enable advanced node performance tunings on a set of nodes. As a cluster administrator, you can install Performance Addon Operator using the OpenShift Container Platform CLI or the web console.

13.2.1. Installing the Operator using the CLI

As a cluster administrator, you can install the Operator using the CLI.

Prerequisites

  • A cluster installed on bare-metal hardware.
  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a namespace for the Performance Addon Operator by completing the following actions:

    1. Create the following Namespace Custom Resource (CR) that defines the openshift-performance-addon-operator namespace, and then save the YAML in the pao-namespace.yaml file:

      apiVersion: v1
      kind: Namespace
      metadata:
        name: openshift-performance-addon-operator
        labels:
          openshift.io/run-level: "1"
    2. Create the namespace by running the following command:

      $ oc create -f pao-namespace.yaml
  2. Install the Performance Addon Operator in the namespace you created in the previous step by creating the following objects:

    1. Create the following OperatorGroup CR and save the YAML in the pao-operatorgroup.yaml file:

      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        name: openshift-performance-addon-operator
        namespace: openshift-performance-addon-operator
      spec:
        targetNamespaces:
        - openshift-performance-addon-operator
    2. Create the OperatorGroup CR by running the following command:

      $ oc create -f pao-operatorgroup.yaml
    3. Run the following command to get the channel value required for the next step.

      $ oc get packagemanifest performance-addon-operator -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'
      
      4.6
    4. Create the following Subscription CR and save the YAML in the pao-sub.yaml file:

      Example Subscription

      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: openshift-performance-addon-operator-subscription
        namespace: openshift-performance-addon-operator
      spec:
        channel: <channel> 1
        name: performance-addon-operator
        source: redhat-operators 2
        sourceNamespace: openshift-marketplace

      1
      Specify the value from you obtained in the previous step for the .status.defaultChannel parameter.
      2
      You must specify the redhat-operators value.
    5. Create the Subscription object by running the following command:

      $ oc create -f pao-sub.yaml
    6. Change to the openshift-performance-addon-operator project:

      $ oc project openshift-performance-addon-operator

13.2.2. Installing the Performance Addon Operator using the web console

As a cluster administrator, you can install the Performance Addon Operator using the web console.

Note

You must create the Namespace CR and OperatorGroup CR as mentioned in the previous section.

Procedure

  1. Install the Performance Addon Operator using the OpenShift Container Platform web console:

    1. In the OpenShift Container Platform web console, click OperatorsOperatorHub.
    2. Choose Performance Addon Operator from the list of available Operators, and then click Install.
    3. On the Install Operator page, under A specific namespace on the cluster select openshift-performance-addon-operator. Then, click Install.
  2. Optional: Verify that the performance-addon-operator installed successfully:

    1. Switch to the OperatorsInstalled Operators page.
    2. Ensure that Performance Addon Operator is listed in the openshift-performance-addon-operator project with a Status of InstallSucceeded.

      Note

      During installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.

      If the Operator does not appear as installed, to troubleshoot further:

      • Go to the OperatorsInstalled Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
      • Go to the WorkloadsPods page and check the logs for pods in the performance-addon-operator project.
Important

The feature described in this document is for Developer Preview purposes and is not supported by Red Hat at this time. This feature could cause nodes to reboot and not be available.

13.3. Upgrading Performance Addon Operator

You can manually upgrade to the next minor version of Performance Addon Operator and monitor the status of an update by using the web console.

13.3.1. About upgrading Performance Addon Operator

  • You can upgrade to the next minor version of Performance Addon Operator by using the OpenShift web console to change the channel of your Operator subscription.
  • You can enable automatic z-stream updates during Performance Addon Operator installation.
  • Updates are delivered via the Marketplace Operator, which is deployed during OpenShift Container Platform installation.The Marketplace Operator makes external Operators available to your cluster.
  • The amount of time an update takes to complete depends on your network connection. Most automatic updates complete within fifteen minutes.

13.3.1.1. How Performance Addon Operator upgrades affect your cluster

  • Neither the low latency tuning nor huge pages are affected.
  • Updating the Operator should not cause any unexpected reboots.

13.3.1.2. Upgrading Performance Addon Operator to the next minor version

You can manually upgrade Performance Addon Operator to the next minor version by using the OpenShift Container Platform web console to change the channel of your Operator subscription.

Prerequisites

  • Access to the cluster as a user with the cluster-admin role.

Procedure

  1. Access the OpenShift web console and navigate to Operators → Installed Operators.
  2. Click Performance Addon Operator to open the Operator Details page.
  3. Click the Subscription tab to open the Subscription Overview page.
  4. In the Channel pane, click the pencil icon on the right side of the version number to open the Change Subscription Update Channel window.
  5. Select the next minor version. For example, if you want to upgrade to Performance Addon Operator 4.6, select 4.6.
  6. Click Save.
  7. Check the status of the upgrade by navigating to Operators → Installed Operators. You can also check the status by running the following oc command:

    $ oc get csv -n openshift-performance-addon-operator

13.3.2. Monitoring upgrade status

The best way to monitor Performance Addon Operator upgrade status is to watch the ClusterServiceVersion (CSV) PHASE. You can also monitor the CSV conditions in the web console or by running the oc get csv command.

Note

The PHASE and conditions values are approximations that are based on available information.

Prerequisites

  • Access to the cluster as a user with the cluster-admin role.
  • Install the OpenShift CLI (oc).

Procedure

  1. Run the following command:

    $ oc get csv
  2. Review the output, checking the PHASE field. For example:

    VERSION    REPLACES                                         PHASE
    4.6.0      performance-addon-operator.v4.5.0                Installing
    4.5.0                                                       Replacing
  3. Run get csv again to verify the output:

    # oc get csv

    Example output

    NAME                                DISPLAY                      VERSION   REPLACES                            PHASE
    performance-addon-operator.v4.5.0   Performance Addon Operator   4.6.0     performance-addon-operator.v4.5.0   Succeeded

13.4. Provisioning real-time and low latency workloads

Many industries and organizations need extremely high performance computing and might require low and predictable latency, especially in the financial and telecommunications industries. For these industries, with their unique requirements, OpenShift Container Platform provides a Performance Addon Operator to implement automatic tuning to achieve low latency performance and consistent response time for OpenShift Container Platform applications.

The cluster administrator uses this performance profile configuration that makes it easier to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt (real-time), the CPUs that will be reserved for housekeeping, and the CPUs that are used for running the workloads.

13.4.1. Known limitations for real-time

Note

The RT kernel is only supported on worker nodes.

To fully utilize the real-time mode, the containers must run with elevated privileges. See Set capabilities for a Container for information on granting privileges.

OpenShift Container Platform restricts the allowed capabilities, so you might need to create a SecurityContext as well, as explained in Creating Security Context Constraints.

Note

This procedure is fully supported with bare metal installations using Red Hat Enterprise Linux CoreOS (RHCOS) systems.

Establishing the right performance expectations refers to the fact that the real-time kernel is not a panacea. Its objective is consistent, low-latency determinism offering predictable response times. There is some additional kernel overhead associated with the real-time kernel. This is due primarily to handling hardware interruptions in separately scheduled threads. The increased overhead in some workloads results in some degradation in overall throughput. The exact amount of degradation is very workload dependent, ranging from 0% to 30%. However, it is the cost of determinism.

13.4.2. Provisioning a worker with real-time capabilities

  1. Install Performance Addon Operator to the cluster.
  2. (Optional) Add a node to the OpenShift Container Platform cluster. See Setting BIOS parameters.
  3. (Optional) Create a new machine config pool for real-time nodes.
  4. Add the node to the proper machine config pool, using node role labels.

    You must decide which nodes will be configured with real-time workloads. It could be all of the nodes in the cluster or a subset of the nodes. The Performance Addon Operator expects all of the nodes are part of a dedicated machine config pool. If you use all of the nodes, you just point the Performance Addon Operator to the worker node role label. If you use a subset, you must group the nodes into a new machine config pool.

  5. Create the PerformanceProfile with the proper set of housekeeping cores and realTimeKernel: enabled: true.
  6. Specify a node selector in the PerformanceProfile, as shown here:

    apiVersion: performance.openshift.io/v1
    kind: PerformanceProfile
    metadata:
     name: example-performanceprofile
    spec:
     [...]
      realTimeKernel:
        enabled: true
      nodeSelector:
       node-role.kubernetes.io/worker-rt: ""
  7. Verify that a matching machine config pool exists with a label:

    machineconfiguration.openshift.io/role=worker-rt
  8. OpenShift Container Platform will start configuring the nodes, which might involve multiple reboots. Wait for the nodes to settle. This can take a long time depending on the specific hardware you use, but 20 minutes per node is expected.
  9. Verify everything is working as expected.

13.4.3. Verifying the real-time kernel installation

Use this command to verify that the real-time kernel is installed:

$ oc get node -o wide

Note the worker with the role worker-rt that contains the string 4.18.0-211.rt5.23.el8.x86_64:

NAME                               	STATUS   ROLES           	AGE 	VERSION                  	INTERNAL-IP
EXTERNAL-IP   OS-IMAGE                                       	KERNEL-VERSION
CONTAINER-RUNTIME
cnf-worker-0.example.com	          Ready	 worker,worker-rt   5d17h   v1.19.0-rc.2+aaf4ce1-dirty
128.66.135.107   <none>    	        Red Hat Enterprise Linux CoreOS 46.82.202008252340-0 (Ootpa)
4.18.0-211.rt5.23.el8.x86_64   cri-o://1.19.0-90.rhaos4.6.git4a0ac05.el8-rc.1
[...]

13.4.4. Creating a workload that works in real-time

Use the following procedures for preparing a workload that will use real-time capabilities.

Procedure

  1. Create a pod with a QoS class of Guaranteed.
  2. (Optional) Disable CPU load balancing for DPDK.
  3. Assign a proper node selector.

When writing your applications, follow the general recommendations described in Application tuning and deployment.

13.4.5. Creating a pod with a QoS class of Guaranteed

Keep the following in mind when you create a pod that is given a QoS class of Guaranteed:

  • Every container in the pod must have a memory limit and a memory request, and they must be the same.
  • Every container in the pod must have a CPU limit and a CPU request, and they must be the same.

The following example shows the configuration file for a pod that has one container. The container has a memory limit and a memory request, both equal to 200 MiB. The container has a CPU limit and a CPU request, both equal to 1 CPU.

apiVersion: v1
kind: Pod
metadata:
  name: qos-demo
  namespace: qos-example
spec:
  containers:
  - name: qos-demo-ctr
    image: <image-pull-spec>
    resources:
      limits:
        memory: "200Mi"
        cpu: "1"
      requests:
        memory: "200Mi"
        cpu: "1"
  1. Create the pod:

    $ oc  apply -f qos-pod.yaml --namespace=qos-example
  2. View detailed information about the pod:

    $ oc get pod qos-demo --namespace=qos-example --output=yaml

    Example output

    spec:
      containers:
        ...
    status:
      qosClass: Guaranteed

    Note

    If a container specifies its own memory limit, but does not specify a memory request, OpenShift Container Platform automatically assigns a memory request that matches the limit. Similarly, if a container specifies its own CPU limit, but does not specify a CPU request, OpenShift Container Platform automatically assigns a CPU request that matches the limit.

13.4.6. (Optional) Disabling CPU load balancing for DPDK

Functionality to disable or enable CPU load balancing is implemented on the CRI-O level. The code under the CRI-O disables or enables CPU load balancing only when the following requirements are met.

  • The pod must use the performance-<profile-name> runtime class. You can get the proper name by looking at the status of the performance profile, as shown here:

    apiVersion: performance.openshift.io/v1
    kind: PerformanceProfile
    ...
    status:
      ...
      runtimeClass: performance-manual
  • The pod must have the cpu-load-balancing.crio.io: true annotation.

The Performance Addon Operator is responsible for the creation of the high-performance runtime handler config snippet under relevant nodes and for creation of the high-performance runtime class under the cluster. It will have the same content as default runtime handler except it enables the CPU load balancing configuration functionality.

To disable the CPU load balancing for the pod, the Pod specification must include the following fields:

apiVersion: v1
kind: Pod
metadata:
  ...
  annotations:
    ...
    cpu-load-balancing.crio.io: "true"
    ...
  ...
spec:
  ...
  runtimeClassName: performance-<profile_name>
  ...
Note

Only disable CPU load balancing when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU load balancing can affect the performance of other containers in the cluster.

13.4.7. Assigning a proper node selector

The preferred way to assign a pod to nodes is to use the same node selector the performance profile used, as shown here:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  [...]
  nodeSelector:
     node-role.kubernetes.io/worker-rt: ""

For more information, see Placing pods on specific nodes using node selectors.

13.4.8. Scheduling a workload onto a worker with real-time capabilities

Use label selectors that match the nodes attached to the machine config pool that was configured for low latency by the Performance Addon Operator. For more information, see Assigning pods to nodes.

13.5. Configuring huge pages

Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. Use the Performance Addon Operator to allocate huge pages on a specific node.

OpenShift Container Platform provides a method for creating and allocating huge pages. Performance Addon Operator provides an easier method for doing thisusing the performance profile.

For example, in the hugepages pages section of the performance profile, you can specify multiple blocks of size, count, and, optionally, node:

hugepages:
   defaultHugepagesSize: "1G"
   pages:
   - size:  "1G"
     count:  4
     node:  0 1
1
node is the NUMA node in which the huge pages are allocated. If you omit node, the pages are evenly spread across all NUMA nodes.
Note

Wait for the relevant machine config pool status that indicates the update is finished.

These are the only configuration steps you need to do to allocate huge pages.

Verification steps

  • To verify the configuration, see the /proc/meminfo file on the node:

    $ oc debug node/ip-10-0-141-105.ec2.internal
    # grep -i huge /proc/meminfo

    Example output

    AnonHugePages:    ###### ##
    ShmemHugePages:        0 kB
    HugePages_Total:       2
    HugePages_Free:        2
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:       #### ##
    Hugetlb:            #### ##

  • Use oc describe to report the new size:

    $ oc describe node worker-0.ocp4poc.example.com | grep -i huge

    Example output

                                       hugepages-1g=true
     hugepages-###:  ###
     hugepages-###:  ###

13.6. Allocating multiple huge page sizes

You can request huge pages with different sizes under the same container. This allows you to define more complicated pods consisting of containers with different huge page size needs.

For example, you can define sizes 1G and 2M and the Performance Addon Operator will configure both sizes on the node, as shown here:

spec:
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 1024
      node: 0
      size: 2M
    - count: 4
      node: 1
      size: 1G
Important

The feature described in this document is for Developer Preview purposes and is not supported by Red Hat at this time. This feature could cause nodes to reboot and not be available.

13.7. Tuning nodes for low latency with the performance profile

The performance profile lets you control latency tuning aspects of nodes that belong to a certain machine config pool. After you specify your settings, the PerformanceProfile object is compiled into multiple objects that perform the actual node level tuning:

  • A MachineConfig file that manipulates the nodes.
  • A KubeletConfig file that configures the Topology Manager, the CPU Manager, and the OpenShift Container Platform nodes.
  • The Tuned profile that configures the Node Tuning Operator.

Procedure

  1. Prepare a cluster.
  2. Create a machine config pool.
  3. Install the Performance Addon Operator.
  4. Create a performance profile that is appropriate for your hardware and topology. In the performance profile, you can specify whether to update the kernel to kernel-rt, allocation of huge pages, the CPUs that will be reserved for operating system housekeeping processes and CPUs that will be used for running the workloads.

    This is a typical performance profile:

    apiVersion: performance.openshift.io/v1
    kind: PerformanceProfile
    metadata:
     name: performance
    spec:
     cpu:
      isolated: "5-15"
      reserved: "0-4"
     hugepages:
      defaultHugepagesSize: "1G"
      pages:
      -size: "1G"
       count: 16
       node: 0
     realTimeKernel:
      enabled: true  1
     numa:  2
      topologyPolicy: "best-effort"
     nodeSelector:
      node-role.kubernetes.io/worker-cnf: ""
1
Valid values are true or false. Setting the true value installs the real-time kernel on the node.
2
Use this field to configure the topology manager policy. Valid values are none (default), best-effort, restricted, and single-numa-node. For more information, see Topology Manager Policies.

13.7.1. Partitioning the CPUs

You can reserve cores, or threads, for operating system housekeeping tasks from a single NUMA node and put your workloads on another NUMA node. The reason for this is that the housekeeping processes might be using the CPUs in a way that would impact latency sensitive processes running on those same CPUs. Keeping your workloads on a separate NUMA node prevents the processes from interfering with each other. Additionally, each NUMA node has its own memory bus that is not shared.

Specify two groups of CPUs in the spec section:

  • isolated - Has the lowest latency. Processes in this group have no interruptions and so can, for example, reach much higher DPDK zero packet loss bandwidth.
  • reserved - The housekeeping CPUs. Threads in the reserved group tend to be very busy, so latency-sensitive applications should be run in the isolated group. See Create a pod that gets assigned a QoS class of Guaranteed.

13.8. Performing end-to-end tests for platform verification

The Cloud-native Network Functions (CNF) tests image is a containerized test suite that validates features required to run CNF payloads. You can use this image to validate a CNF-enabled OpenShift cluster where all the components required for running CNF workloads are installed.

The tests run by the image are split into three different phases:

  • Simple cluster validation
  • Setup
  • End to end tests

The validation phase checks that all the features required to be tested are deployed correctly on the cluster.

Validations include:

  • Targeting a machine config pool that belong to the machines to be tested
  • Enabling SCTP on the nodes
  • Having the Performance Addon Operator installed
  • Having the SR-IOV Operator installed
  • Having the PTP Operator installed
  • Using OVN kubernetes as the SDN

The tests need to perform an environment configuration every time they are executed. This involves items such as creating SRI-OV node policies, performance profiles, or PTP profiles. Allowing the tests to configure an already configured cluster might affect the functionality of the cluster. Also, changes to configuration items such as SR-IOV node policy might result in the environment being temporarily unavailable until the configuration change is processed.

13.8.1. Prerequisites

  • The test entrypoint is /usr/bin/test-run.sh. It runs both a setup test set and the real conformance test suite. The minimum requirement is to provide it with a kubeconfig file and its related $KUBECONFIG environment variable, mounted through a volume.
  • The tests assumes that a given feature is already available on the cluster in the form of an Operator, flags enabled on the cluster, or machine configs.
  • Some tests require a pre-existing machine config pool to append their changes to. This must be created on the cluster before running the tests.

    The default worker pool is worker-cnf and can be created with the following manifest:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: worker-cnf
      labels:
        machineconfiguration.openshift.io/role: worker-cnf
    spec:
      machineConfigSelector:
        matchExpressions:
          - {
              key: machineconfiguration.openshift.io/role,
              operator: In,
              values: [worker-cnf, worker],
            }
      paused: false
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/worker-cnf: ""

    You can use the ROLE_WORKER_CNF variable to override the worker pool name:

    $ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig -e
    ROLE_WORKER_CNF=custom-worker-pool registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/test-run.sh
    Note

    Currently, not all tests run selectively on the nodes belonging to the pool.

13.8.2. Running the tests

Assuming the file is in the current folder, the command for running the test suite is:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/test-run.sh

This allows your kubeconfig file to be consumed from inside the running container.

13.8.3. Image parameters

Depending on the requirements, the tests can use different images. There are two images used by the tests that can be changed using the following environment variables:

  • CNF_TESTS_IMAGE
  • DPDK_TESTS_IMAGE

For example, to change the CNF_TESTS_IMAGE with a custom registry run the following command:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig -e CNF_TESTS_IMAGE="custom-cnf-tests-image:latests" registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/test-run.sh

13.8.3.1. Ginkgo parameters

The test suite is built upon the ginkgo BDD framework. This means that it accepts parameters for filtering or skipping tests.

You can use the -ginkgo.focus parameter to filter a set of tests:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/test-run.sh -ginkgo.focus="performance|sctp"
Note

There is a particular test that requires both SR-IOV and SCTP. Given the selective nature of the focus parameter, this test is triggered by only placing the sriov matcher. If the tests are executed against a cluster where SR-IOV is installed but SCTP is not, adding the -ginkgo.skip=SCTP parameter causes the tests to skip SCTP testing.

13.8.3.2. Available features

The set of available features to filter are:

  • performance
  • sriov
  • ptp
  • sctp
  • dpdk

13.8.4. Dry run

Use this command to run in dry-run mode. This is useful for checking what is in the test suite and provides output for all of the tests the image would run.

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/test-run.sh -ginkgo.dryRun -ginkgo.v

13.8.5. Disconnected mode

The CNF tests image support running tests in a disconnected cluster, meaning a cluster that is not able to reach outer registries. This is done in two steps:

  1. Performing the mirroring.
  2. Instructing the tests to consume the images from a custom registry.

13.8.5.1. Mirroring the images to a custom registry accessible from the cluster

A mirror executable is shipped in the image to provide the input required by oc to mirror the images needed to run the tests to a local registry.

Run this command from an intermediate machine that has access both to the cluster and to registry.redhat.io over the Internet:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/mirror -registry my.local.registry:5000/ |  oc image mirror -f -

Then, follow the instructions in the following section about overriding the registry used to fetch the images.

13.8.5.2. Instruct the tests to consume those images from a custom registry

This is done by setting the IMAGE_REGISTRY environment variable:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig -e IMAGE_REGISTRY="my.local.registry:5000/" -e CNF_TESTS_IMAGE="custom-cnf-tests-image:latests" registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/test-run.sh

13.8.5.3. Mirroring to the cluster internal registry

OpenShift Container Platform provides a built-in container image registry, which runs as a standard workload on the cluster.

Procedure

  1. Gain external access to the registry by exposing it with a route:

    $ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
  2. Fetch the registry endpoint:

    REGISTRY=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
  3. Create a namespace for exposing the images:

    $ oc create ns cnftests
  4. Make that image stream available to all the namespaces used for tests. This is required to allow the tests namespaces to fetch the images from the cnftests image stream.

    $ oc policy add-role-to-user system:image-puller system:serviceaccount:sctptest:default --namespace=cnftests
    $ oc policy add-role-to-user system:image-puller system:serviceaccount:cnf-features-testing:default --namespace=cnftests
    $ oc policy add-role-to-user system:image-puller system:serviceaccount:performance-addon-operators-testing:default --namespace=cnftests
    $ oc policy add-role-to-user system:image-puller system:serviceaccount:dpdk-testing:default --namespace=cnftests
    $ oc policy add-role-to-user system:image-puller system:serviceaccount:sriov-conformance-testing:default --namespace=cnftests
  5. Retrieve the docker secret name and auth token:

    SECRET=$(oc -n cnftests get secret | grep builder-docker | awk {'print $1'}
    TOKEN=$(oc -n cnftests get secret $SECRET -o jsonpath="{.data['\.dockercfg']}" | base64 -d | jq '.["image-registry.openshift-image-registry.svc:5000"].auth')
  6. Write a dockerauth.json similar to this:

    echo "{\"auths\": { \"$REGISTRY\": { \"auth\": $TOKEN } }}" > dockerauth.json
  7. Do the mirroring:

    $ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/mirror -registry $REGISTRY/cnftests |  oc image mirror --insecure=true -a=$(pwd)/dockerauth.json -f -
  8. Run the tests:

    $ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig -e IMAGE_REGISTRY=image-registry.openshift-image-registry.svc:5000/cnftests cnf-tests-local:latest /usr/bin/test-run.sh

13.8.5.4. Mirroring a different set of images

Procedure

  1. The mirror command tries to mirror the u/s images by default. This can be overridden by passing a file with the following format to the image:

    [
        {
            "registry": "public.registry.io:5000",
            "image": "imageforcnftests:4.6"
        },
        {
            "registry": "public.registry.io:5000",
            "image": "imagefordpdk:4.6"
        }
    ]
  2. Pass it to the mirror command, for example saving it locally as images.json. With the following command, the local path is mounted in /kubeconfig inside the container and that can be passed to the mirror command.

    $ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/mirror --registry "my.local.registry:5000/" --images "/kubeconfig/images.json" |  oc image mirror -f -

13.8.6. Discovery mode

Discovery mode allows you to validate the functionality of a cluster without altering its configuration. Existing environment configurations are used for the tests. The tests attempt to find the configuration items needed and use those items to execute the tests. If resources needed to run a specific test are not found, the test is skipped, providing an appropriate message to the user. After the tests are finished, no cleanup of the pre-configured configuration items is done, and the test environment can be immediately used for another test run.

Some configuration items are still created by the tests. These are specific items needed for a test to run; for example, a SR-IOV Network. These configuration items are created in custom namespaces and are cleaned up after the tests are executed.

An additional bonus is a reduction in test run times. As the configuration items are already there, no time is needed for environment configuration and stabilization.

To enable discovery mode, the tests must be instructed by setting the DISCOVERY_MODE environment variable as follows:

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e
DISCOVERY_MODE=true registry.redhat.io/openshift-kni/cnf-tests /usr/bin/test-run.sh

13.8.6.1. Required environment configuration prerequisites

SR-IOV tests

Most SR-IOV tests require the following resources:

  • SriovNetworkNodePolicy.
  • At least one with the resource specified by SriovNetworkNodePolicy being allocatable; a resource count of at least 5 is considered sufficient.

Some tests have additional requirements:

  • An unused device on the node with available policy resource, with link state DOWN and not a bridge slave.
  • A SriovNetworkNodePolicy with a MTU value of 9000.

DPDK tests

The DPDK related tests require:

  • A performance profile.
  • A SR-IOV policy.
  • A node with resources available for the SR-IOV policy and available with the PerformanceProfile node selector.

PTP tests

  • A slave PtpConfig (ptp4lOpts="-s" ,phc2sysOpts="-a -r").
  • A node with a label matching the slave PtpConfig.

SCTP tests

  • SriovNetworkNodePolicy.
  • A node matching both the SriovNetworkNodePolicy and a MachineConfig that enables SCTP.

Performance Operator tests

Various tests have different requirements. Some of them are:

  • A performance profile.
  • A performance profile having profile.Spec.CPU.Isolated = 1.
  • A performance profile having profile.Spec.RealTimeKernel.Enabled == true.
  • A node with no huge pages usage.

13.8.6.2. Limiting the nodes used during tests

The nodes on which the tests are executed can be limited by specifying a NODES_SELECTOR environment variable. Any resources created by the test are then limited to the specified nodes.

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e
NODES_SELECTOR=node-role.kubernetes.io/worker-cnf registry.redhat.io/openshift-kni/cnf-tests /usr/bin/test-run.sh

13.8.6.3. Using a single performance profile

The resources needed by the DPDK tests are higher than those required by the performance test suite. To make the execution faster, the performance profile used by tests can be overridden using one that also serves the DPDK test suite.

To do this, a profile like the following one can be mounted inside the container, and the performance tests can be instructed to deploy it.

apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: "4-15"
    reserved: "0-3"
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      count: 16
      node: 0
  realTimeKernel:
    enabled: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

To override the performance profile used, the manifest must be mounted inside the container and the tests must be instructed by setting the PERFORMANCE_PROFILE_MANIFEST_OVERRIDE parameter as follows:

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e
PERFORMANCE_PROFILE_MANIFEST_OVERRIDE=/kubeconfig/manifest.yaml registry.redhat.io/openshift-kni/cnf-tests /usr/bin/test-run.sh

13.8.6.4. Disabling the performance profile cleanup

When not running in discovery mode, the suite cleans up all the created artifacts and configurations. This includes the performance profile.

When deleting the performance profile, the machine config pool is modified and nodes are rebooted. After a new iteration, a new profile is created. This causes long test cycles between runs.

To speed up this process, set CLEAN_PERFORMANCE_PROFILE="false" to instruct the tests not to clean the performance profile. In this way, the next iteration will not need to create it and wait for it to be applied.

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e
CLEAN_PERFORMANCE_PROFILE="false" registry.redhat.io/openshift-kni/cnf-tests /usr/bin/test-run.sh

13.8.7. Troubleshooting

The cluster must be reached from within the container. You can verify this by running:

$ docker run -v $(pwd)/:/kubeconfig -e KUBECONFIG=/kubeconfig/kubeconfig
registry.redhat.io/openshift-kni/cnf-tests oc get nodes

If this does not work, it could be caused by spanning across DNS, MTU size, or firewall issues.

13.8.8. Test reports

CNF end-to-end tests produce two outputs: a JUnit test output and a test failure report.

13.8.8.1. JUnit test output

A JUnit-compliant XML is produced by passing the --junit parameter together with the path where the report is dumped:

$ docker run -v $(pwd)/:/kubeconfig -v $(pwd)/junitdest:/path/to/junit -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/test-run.sh --junit /path/to/junit

13.8.8.2. Test failure report

A report with information about the cluster state and resources for troubleshooting can be produced by passing the --report parameter with the path where the report is dumped:

$ docker run -v $(pwd)/:/kubeconfig -v $(pwd)/reportdest:/path/to/report -e KUBECONFIG=/kubeconfig/kubeconfig registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/test-run.sh --report /path/to/report

13.8.8.3. A note on podman

When executing podman as non root and non privileged, mounting paths can fail with "permission denied" errors. To make it work, append :Z to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z to allow podman to do the proper SELinux relabeling.

13.8.8.4. Running on OpenShift Container Platform 4.4

With the exception of the following, the CNF end-to-end tests are compatible with OpenShift Container Platform 4.4:

[test_id:28466][crit:high][vendor:cnf-qe@redhat.com][level:acceptance] Should contain configuration injected through openshift-node-performance profile
[test_id:28467][crit:high][vendor:cnf-qe@redhat.com][level:acceptance] Should contain configuration injected through the openshift-node-performance profile

You can skip these tests by adding the -ginkgo.skip “28466|28467" parameter.

13.8.8.5. Using a single performance profile

The DPDK tests require more resources than what is required by the performance test suite. To make the execution faster, you can override the performance profile used by the tests using a profile that also serves the DPDK test suite.

To do this, use a profile like the following one that can be mounted inside the container, and the performance tests can be instructed to deploy it.

apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
 name: performance
spec:
 cpu:
  isolated: "5-15"
  reserved: "0-4"
 hugepages:
  defaultHugepagesSize: "1G"
  pages:
  -size: "1G"
   count: 16
   node: 0
 realTimeKernel:
  enabled: true
 numa:
  topologyPolicy: "best-effort"
 nodeSelector:
  node-role.kubernetes.io/worker-cnf: ""

To override the performance profile, the manifest must be mounted inside the container and the tests must be instructed by setting the PERFORMANCE_PROFILE_MANIFEST_OVERRIDE:

$ docker run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig -e PERFORMANCE_PROFILE_MANIFEST_OVERRIDE=/kubeconfig/manifest.yaml registry.redhat.io/openshift4/cnf-tests-rhel8:v4.6 /usr/bin/test-run.sh

13.8.9. Impacts on the cluster

Depending on the feature, running the test suite could cause different impacts on the cluster. In general, only the SCTP tests do not change the cluster configuration. All of the other features have various impacts on the configuration.

13.8.9.1. SCTP

SCTP tests just run different pods on different nodes to check connectivity. The impacts on the cluster are related to running simple pods on two nodes.

13.8.9.2. SR-IOV

SR-IOV tests require changes in the SR-IOV network configuration, where the tests create and destroy different types of configuration.

This might have an impact if existing SR-IOV network configurations are already installed on the cluster, because there may be conflicts depending on the priority of such configurations.

At the same time, the result of the tests might be affected by existing configurations.

13.8.9.3. PTP

PTP tests apply a PTP configuration to a set of nodes of the cluster. As with SR-IOV, this might conflict with any existing PTP configuration already in place, with unpredictable results.

13.8.9.4. Performance

Performance tests apply a performance profile to the cluster. The effect of this is changes in the node configuration, reserving CPUs, allocating memory huge pages, and setting the kernel packages to be realtime. If an existing profile named performance is already available on the cluster, the tests do not deploy it.

13.8.9.5. DPDK

DPDK relies on both performance and SR-IOV features, so the test suite configures both a performance profile and SR-IOV networks, so the impacts are the same as those described in SR-IOV testing and performance testing.

13.8.9.6. Cleaning up

After running the test suite, all the dangling resources are cleaned up.

13.9. Debugging low latency CNF tuning status

The PerformanceProfile custom resource (CR) contains status fields for reporting tuning status and debugging latency degradation issues. These fields report on conditions that describe the state of the operator’s reconciliation functionality.

A typical issue can arise when the status of machine config pools that are attached to the performance profile are in a degraded state, causing the PerformanceProfile status to degrade. In this case, the machine config pool issues a failure message.

The Performance Addon Operator contains the performanceProfile.spec.status.Conditions status field:

Status:
  Conditions:
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                True
    Type:                  Available
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                True
    Type:                  Upgradeable
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                False
    Type:                  Degraded

The Status field contains Conditions that specify Type values that indicate the status of the performance profile:

Available
All machine configs and Tuned profiles have been created successfully and are available for cluster components are responsible to process them (NTO, MCO, Kubelet).
Upgradeable
Indicates whether the resources maintained by the Operator are in a state that is safe to upgrade.
Progressing
Indicates that the deployment process from the performance profile has started.
Degraded

Indicates an error if:

  • Validation of the performance profile has failed.
  • Creation of all relevant components did not complete successfully.

Each of these types contain the following fields:

Status
The state for the specific type (true or false).
Timestamp
The transaction timestamp.
Reason string
The machine readable reason.
Message string
The human readable reason describing the state and error details, if any.

13.9.1. Machine config pools

A performance profile and its created products are applied to a node according to an associated machine config pool (MCP). The MCP holds valuable information about the progress of applying the machine configurations created by performance addons that encompass kernel args, kube config, huge pages allocation, and deployment of rt-kernel. The performance addons controller monitors changes in the MCP and updates the performance profile status accordingly.

The only conditions returned by the MCP to the performance profile status is when the MCP is Degraded, which leads to performaceProfile.status.condition.Degraded = true.

Example

The following example is for a performance profile with an associated machine config pool (worker-cnf) that was created for it:

  1. The associated machine config pool is in a degraded state:

    # oc get mcp

    Example output

    NAME         CONFIG                                                 UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master       rendered-master-2ee57a93fa6c9181b546ca46e1571d2d       True      False      False      3              3                   3                     0                      2d21h
    worker       rendered-worker-d6b2bdc07d9f5a59a6b68950acf25e5f       True      False      False      2              2                   2                     0                      2d21h
    worker-cnf   rendered-worker-cnf-6c838641b8a08fff08dbd8b02fb63f7c   False     True       True       2              1                   1                     1                      2d20h

  2. The describe section of the MCP shows the reason:

    # oc describe mcp worker-cnf

    Example output

      Message:               Node node-worker-cnf is reporting: "prepping update:
      machineconfig.machineconfiguration.openshift.io \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not
      found"
        Reason:                1 nodes are reporting degraded status on sync

  3. The degraded state should also appear under the performance profile status field marked as degraded = true:

    # oc describe performanceprofiles performance

    Example output

    Message: Machine config pool worker-cnf Degraded Reason: 1 nodes are reporting degraded status on sync.
    Machine config pool worker-cnf Degraded Message: Node yquinn-q8s5v-w-b-z5lqn.c.openshift-gce-devel.internal is
    reporting: "prepping update: machineconfig.machineconfiguration.openshift.io
    \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not found".    Reason:  MCPDegraded
       Status:  True
       Type:    Degraded

13.10. Collecting low latency tuning debugging data for Red Hat Support

When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.

The must-gather tool enables you to collect diagnostic information about your OpenShift Container Platform cluster, including node tuning, NUMA topology, and other information needed to debug issues with low latency setup.

For prompt support, supply diagnostic information for both OpenShift Container Platform and low latency tuning.

13.10.1. About the must-gather tool

The oc adm must-gather CLI command collects the information from your cluster that is most likely needed for debugging issues, such as:

  • Resource definitions
  • Audit logs
  • Service logs

You can specify one or more images when you run the command by including the --image argument. When you specify an image, the tool collects data related to that feature or product. When you run oc adm must-gather, a new pod is created on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local. This directory is created in your current working directory.

13.10.2. About collecting low latency tuning data

Use the oc adm must-gather CLI command to collect information about your cluster, including features and objects associated with low latency tuning, including:

  • The Performance Addon Operator namespaces and child objects.
  • MachineConfigPool and associated MachineConfig objects.
  • The Node Tuning Operator and associated Tuned objects.
  • Linux Kernel command line options.
  • CPU and NUMA topology
  • Basic PCI device information and NUMA locality.

To collect container-native virtualization data with must-gather, you must specify the container-native virtualization image:

--image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8.

13.10.3. Gathering data about specific features

You can gather debugging information about specific features by using the oc adm must-gather CLI command with the --image or --image-stream argument. The must-gather tool supports multiple images, so you can gather data about more than one feature by running a single command.

Note

To collect the default must-gather data in addition to specific feature data, add the --image-stream=openshift/must-gather argument.

Prerequisites

  • Access to the cluster as a user with the cluster-admin role.
  • The OpenShift Container Platform CLI (oc) installed.

Procedure

  1. Navigate to the directory where you want to store the must-gather data.
  2. Run the oc adm must-gather command with one or more --image or --image-stream arguments. For example, the following command gathers both the default cluster data and information specific to container-native virtualization:

    $ oc adm must-gather \
     --image-stream=openshift/must-gather \ 1
    
     --image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8 2
    1
    The default OpenShift Container Platform must-gather image.
    2
    The must-gather image for low latency tuning diagnostics.
  3. Create a compressed file from the must-gather directory that was created in your working directory. For example, on a computer that uses a Linux operating system, run the following command:

     $ tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/ 1
    1
    Replace must-gather-local.5421342344627712289/ with the actual directory name.
  4. Attach the compressed file to your support case on the Red Hat Customer Portal.

Additional resources

Legal Notice

Copyright © 2020 Red Hat, Inc.
The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
Java® is a registered trademark of Oracle and/or its affiliates.
XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.
MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.
Node.js® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation's permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
All other trademarks are the property of their respective owners.