Performance addons operator advanced configuration

Solution Verified - Updated -

Environment

  • OpenShift Container Platform (OCP) 4.6 or latest.

Issue

Advance configuration instructions dedicated to field and performance teams for applying advanced tuning settings which are not the default latency sensitive workloads portfolio performance addons operator provides.

Resolution

Advanced Configuration

This is an advanced guide for changing low level performance configuration on a cluster to hotfix an issue or test impact.
Instructions for editing (add/remove/change) kernel arguments,sysfs,proc parameters and are described in this section.

Default tunings

Default tunings are applied with the openshift-performance base profile, it is the base for creating a Tuned CR that would be detected by cluster node tuning operator and finally be executed by tuned.

Additional kernel arguments

When creating a performance profile CR , a default set of kernel arguments are created from the openshift-performance base profile in addition to tuned generated argument and can include for example:

nohz=on rcu_nocbs=<isolated_cores> tuned.non_isolcpus=<not_isolated_cpumask> intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt systemd.cpu_affinity=<not_isolated_cores> isolcpus=<isolated_cores> default_hugepagesz=<DefaultHugepagesSize> hugepagesz=<hugepages_size> hugepages=<hugepages_>

Note: isolcpus is added only when balanceIsolated is disabled.

Additional kernel arguments could be added in the performance profile CR using the additionalKernelArgs field:

apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
  name: example-performanceprofile
spec:
  additionalKernelArgs:
  - "nmi_watchdog=0"
  - "audit=0"
  - "mce=off"
  - "processor.max_cstate=1"
  - "idle=poll"
  - "intel_idle.max_cstate=0"  
...

Note: These arguments will be added on top of the default arguments mentioned above. Editing these additional arguments could be done when editing the CR.

Note: This should be used for simple additions, for more complex operations see the following custom tunings section.

Additional Kubelet Arguments

To avoid KubeletConfig custom resource limitation, we introduced the experimental kubelet config snippet annotation kubeletconfig.experimental. It gives you a way to configure an additional set of kubelet parameters on top of the KubeletConfig custom resource generated by the performance-addon-operator. But take into account that it is the experimental feature, and we do not have any validations for the kubelet config snippet that you will provide under the annotation, so please make sure to set only parameters specified under the v1beta1 specification. Also, by default, the performance-addon-operator will override:
1. CPU manager policy
2. CPU manager reconcile period
3. Topology manager policy
4. Reserved CPUs
5. Memory manager policy

Please avoid specifying them and use the relevant API to configure these parameters.

To update the KubeletConfig CR, you should pass the KubeletConfig v1beta1 snippet in the json format, for example:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: unsafe-sysctl-performanceprofile
  annotations:
    kubeletconfig.experimental: |
      {"allowedUnsafeSysctls":["net.core.somaxconn"]}
spec:
  ...

Custom tunings

To perform hotfixes on top of the tuned openshift-performance base profile, a tuned custom profile (A child profile) will be used to apply the desired changes.
This profile will inherit the base tuned profile and override its fields where needed.

For complete details about customizing tuned see : Customizing Tuned profiles.

Getting the current deployed tuned profile

In order to apply changes we will need to get the name of the deployed tuned profile that was generated by the performance addon operator:

#oc describe performanceprofile <profile name> | grep Tuned
Tuned:  <tuned namespace>/<tuned name>

Any tuned profile created for custom tunings will need to inherit from this tuned profile:

include=<tuned name>

The custom Tuned CR should be under the same tuned namespace:

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: ...
  namespace: <tuned namespace>

for example:

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: configuration-hotfixes
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=...
      # override performance addons generated tuned profile
      include=openshift-node-performance-manual
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-cnf"
    priority: 19
    profile: configuration-hotfixes 

Important Note: The Tuned CR created for custom configuration must take precedence over the existing one.
To do so - its priority under the recommend section must be a lower number than the existing tuned CR (lower number = higher priority).

Example use cases

Initial performance profile

performance_profile.yaml
apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
  name: manual
spec:
  additionalKernelArgs:
    - "nmi_watchdog=0"
    - "audit=0"
    - "mce=off"
    - "processor.max_cstate=1"
    - "idle=poll"
    - "intel_idle.max_cstate=0"
  cpu:
    isolated: "1-3"
    reserved: "0"
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
      - size: "1G"
        count: 1
        node: 0
  realTimeKernel:
    enabled: true
  numa:
    topologyPolicy: "single-numa-node"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

Example of the kernel arguments generated after initial profile deployment:

sh-4.2# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-35750ad692eb3cc24529d0bc23857ad3cc29340d39912b43e3a40d255f05f740/vmlinuz-4.18.0-147.8.1.rt24.101.el8_1.x86_64 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.1/rhcos/35750ad692eb3cc24529d0bc23857ad3cc29340d39912b43e3a40d255f05f740/0 ignition.platform.id=gcp skew_tick=1 nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 `**idle=poll**` intel_idle.max_cstate=0 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup default_hugepagesz=1G tsc=nowatchdog intel_iommu=on iommu=pt systemd.cpu_affinity=0

Note: check /proc/cmdline on the nodes to get the current kernel arguments list.

Removing kernel argument
oc create -f- <<_EOF_
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: configuration-hotfixes
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned

      include=openshift-node-performance-manual
      [bootloader]
      cmdline_removeKernelArgs=-idle=poll
    name: openshift-configuration-hotfixes
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-cnf"
    priority: 19
    profile: openshift-configuration-hotfixes
_EOF_


The kernel argument is now removed:


sh-4.2# cat /proc/cmdline | grep "idle=poll" sh-4.2#
Changing sysctl values
sh-4.2# sysctl -n kernel.hung_task_timeout_secs
600
sh-4.2# sysctl -n kernel.nmi_watchdog          
0
sh-4.2# sysctl -n kernel.sched_rt_runtime_us
-1

oc create -f- <<_EOF_
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: configuration-hotfixes
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned

      include=openshift-node-performance-manual
      [sysctl]
      kernel.hung_task_timeout_secs = 700  # change value from 600 to 700
      kernel.nmi_watchdog=     #set empty value
      kernel.sched_rt_runtime_us=-   # try removal


    name: openshift-configuration-hotfixes
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-cnf"
    priority: 19
    profile: openshift-configuration-hotfixes
_EOF_

sh-4.2# sysctl -n kernel.hung_task_timeout_secs
700
sh-4.2# sysctl -n kernel.nmi_watchdog
0
sh-4.2# sysctl -n kernel.sched_rt_runtime_us
950000

RPS settings

The default RPS settings for a performance profile are to set the RPS mask as the reserved CPUs, on the host level for all network devices excluding virtual(veth) devices and physical devices(pci) and on the container level for all virtual network devices(veth).

RPS and workload hints (Openshift 4.12 and above)

When the realtime workload hint is explicitly disabled there is no need for any RPS settings to be applied since it is relevant only for the realtime use case.
The following will result in no RPS settings applied on the cluster at all:

performance_profile.yaml
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: example-performanceprofile
spec:
  workloadHints:
    realTime: false

In special cases where there is a need to explicitly specify the realtime workload hint as false but keep the RPS settings,
an override annotation performance.openshift.io/enable-rps could be added to the performance profile that will keep the default RPS settings:

performance_profile.yaml
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: example-performanceprofile
  annotations:
     performance.openshift.io/enable-rps: "true"  
spec:
  workloadHints:
    realTime: false

Enable RPS on physical devices annotation (Openshift 4.11 and above)

In case there is a need to set RPS mask for physical(pci) devices as well on the host side an override annotation performance.openshift.io/enable-physical-dev-rps to the default RPS settings could be added to the performance profile:

performance_profile.yaml
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: example-performanceprofile
  annotations:
     performance.openshift.io/enable-physical-dev-rps: "true"

Note: performance.openshift.io/enable-physical-dev-rps annotation can be applied only when realtime workload hint is
NOT explicitly set to false unless performance.openshift.io/enable-rps is set to true.

Root Cause

The Performance Operator optimizes OpenShift clusters for applications sensitive to CPU and network latency.
The operator consumes a PerformanceProfile CRD that offers high level options for applying various performance tunings to cluster node.
These tunings are based on an underlying tuned profile that holds default configurations that are best suited for sensitive low latency workloads but there are cases where extensible functionality is needed.

The target audience for this advanced solution are field and performance teams that test alternative settings before having to implement them.

Diagnostic Steps

Resolution section covers the steps + diagnostic steps for applying configurations.
For further debugging methods see troubleshooting

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments