NUMA-aware scheduling with NUMA Resources Operator

Updated -

The NUMA Resources Operator allows you to schedule high-performance workloads in the same NUMA zone. It deploys a node resources exporting agent that reports on available cluster node NUMA resources, and a secondary scheduler that manages the workloads.

Note
NUMA Resources Operator is a Developer Preview feature in OpenShift Container Platform 4.10 only. It is not available on previous versions of OpenShift Container Platform.

About Developer Preview features
Developer Preview features are not supported with Red Hat production service level agreements (SLAs) and are not functionally complete. Red Hat does not advise using them in a production setting. Developer Preview features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. These releases may not have any documentation, and testing is limited. Red Hat may provide ways to submit feedback on Developer Preview releases without an associated SLA.

About NUMA-aware scheduling

Non-Uniform Memory Access (NUMA) is a compute platform architecture that allows different CPUs to access different regions of memory at different speeds. NUMA resource topology refers to the locations of CPUs, memory, and PCI devices relative to each other in the compute node. Co-located resources are said to be in the same NUMA zone. For high-performance applications, pod workloads should be processed in a single NUMA zone.

NUMA architecture allows a CPU with multiple memory controllers to use any available memory across CPU complexes, regardless of where the memory is located. This allows for increased flexibility at the expense of performance. A CPU processing a workload using memory that is outside its NUMA zone is slower than a workload processed in a single NUMA zone. Also, for I/O-constrained workloads, the network interface on a distant NUMA zone slows down how quickly information can reach the application. High-performance workloads, such as telecommunications workloads, cannot operate to specification under these conditions. To process latency-sensitive or high-performance workloads efficiently, all the requested cluster compute resources (CPUs, memory, devices) should be aligned in the same NUMA zone. NUMA-aware scheduling also improves pod density per compute node for greater resource efficiency.

The default OpenShift Container Platform pod scheduler makes scheduling decisions based on available resources of the entire compute node, not individual NUMA zones. If the most restrictive resource alignment is requested in the kubelet topology manager, error conditions can occur when admitting the pod to a node. Conversely, if the most restrictive resource alignment is not requested, the pod can be admitted to the node without proper resource alignment, leading to worse or unpredictable performance. For example, runaway pod creation with Topology Affinity Error statuses can occur when the pod scheduler makes suboptimal scheduling decisions for guaranteed pod workloads by not knowing if the pod’s requested resources are available. Scheduling mismatch decisions can cause indefinite pod startup delays. Also, depending on the cluster state and resource allocation, poor pod scheduling decisions can cause extra load on the cluster because of failed startup attempts.

The NUMA Resources Operator deploys a custom NUMA resources secondary scheduler and other resources to mitigate against the shortcomings of the default OpenShift Container Platform pod scheduler. The following diagram provides a high-level overview of NUMA-aware pod scheduling.

NUMA-aware scheduling overview

NodeResourceTopology API
The NodeResourceTopology API describes the available NUMA zone resources in each compute node.

NUMA-aware scheduler
The NUMA-aware secondary scheduler receives information about the available NUMA zones from the NodeResourceTopology API and schedules high-performance workloads on a node where it can be optimally processed.

Node topology exporter
The node topology exporter exposes the available NUMA zone resources for each compute node to the NodeResourceTopology API. The node topology exporter daemon tracks the resource allocation from the kubelet by using the PodResources API.

PodResources API
The PodResources API is local to each node and exposes the resource topology and available resources to the kubelet.

  • For more information about running custom pod schedulers in your cluster and how to deploy pods with a custom pod scheduler, see Running a custom scheduler.

Installing the NUMA Resources Operator using the CLI

As a cluster administrator, you can install the Operator using the CLI.

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  1. Create a namespace for the NUMA Resources Operator:

    1. Save the following YAML in the nro-namespace.yaml file:

      apiVersion: v1
      kind: Namespace
      metadata:
        name: openshift-numaresources
      
    2. Create the Namespace CR:

      $ oc create -f nro-namespace.yaml
      
  2. Create the operator group for the NUMA Resources Operator:

    1. Save the following YAML in the nro-operatorgroup.yaml file:

      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        name: openshift-numaresources-operator
        namespace: openshift-numaresources
      spec:
        targetNamespaces:
        - openshift-numaresources
      
    2. Create the OperatorGroup CR:

      $ oc create -f nro-operatorgroup.yaml
      
  3. Create the subscription for the NUMA Resources Operator:

    1. Save the following YAML in the nro-sub.yaml file:

      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: openshift-numaresources-operator
        namespace: openshift-numaresources
      spec:
        channel: "4.10"
        name: openshift-numaresources-operator
        source: redhat-operators
        sourceNamespace: openshift-marketplace
      
    2. Create the Subscription CR:

      $ oc create -f nro-sub.yaml
      
  4. Verify that the installation succeeded by inspecting the CSV resource in the openshift-numaresources namespace:

    $ oc get csv -n openshift-numaresources
    

    Example output

    NAME                                      DISPLAY                      VERSION   REPLACES   PHASE
    openshift-numaresources-operator.v4.10.0  NUMA Resources Operator      4.10.0               Succeeded
    

Installing the NUMA Resources Operator using the web console

As a cluster administrator, you can install the NUMA Resources Operator using the web console.

  1. Install the NUMA Resources Operator using the OpenShift Container Platform web console:

    1. In the OpenShift Container Platform web console, click OperatorsOperatorHub.

    2. Choose NUMA Resources Operator from the list of available Operators, and then click Install.

  2. Optional: Verify that the NUMA Resources Operator installed successfully:

    1. Switch to the OperatorsInstalled Operators page.

    2. Ensure that NUMA Resources Operator is listed in the default project with a Status of InstallSucceeded.

      During installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.

      If the Operator does not appear as installed, to troubleshoot further:

      • Go to the OperatorsInstalled Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.

      • Go to the WorkloadsPods page and check the logs for pods in the default project.

Deploying the NUMA-aware secondary pod scheduler

After you have installed the NUMA Resources Operator, configure the pod admittance policy for the required machine profile, create the required machine config pool, and deploy the NUMA-aware secondary scheduler. These steps are described in the following procedure.

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator.

  1. Create the KubeletConfig custom resource that configures the pod admittance policy for the machine profile:

    1. Save the following YAML in the nro-kubeletconfig.yaml file:

      apiVersion: machineconfiguration.openshift.io/v1
      kind: KubeletConfig
      metadata:
        name: cnf-worker-tuning
      spec:
        machineConfigPoolSelector:
          matchLabels:
            cnf-worker-tuning: enabled
        kubeletConfig:
          cpuManagerPolicy: "static"
          cpuManagerReconcilePeriod: "5s"
          reservedSystemCPUs: "0,1"
          memoryManagerPolicy: "Static"
          evictionHard:
            memory.available: "100Mi"
          kubeReserved:
            memory: "512Mi"
          reservedMemory:
            - numaNode: 0
        limits:
          memory: "1124Mi"
          systemReserved:
            memory: "512Mi"
          topologyManagerPolicy: "single-numa-node" 
          topologyManagerScope: "pod" 
      
      • topologyManagerPolicy must be set to single-numa-node.

      • topologyManagerScope must be set to pod.

    2. Create the KubeletConfig CR:

      $ oc create -f nro-kubeletconfig.yaml
      
  2. Create the MachineConfigPool custom resource that enables custom kubelet configurations for worker nodes:

    1. Save the following YAML in the nro-machineconfig.yaml file:

      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        labels:
          cnf-worker-tuning: enabled
          machineconfiguration.openshift.io/mco-built-in: ""
          pools.operator.machineconfiguration.openshift.io/worker: ""
        name: worker
      spec:
        machineConfigSelector:
          matchLabels:
            machineconfiguration.openshift.io/role: worker
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/worker: ""
      
    2. Create the MachineConfigPool CR:

      $ oc create -f nro-machineconfig.yaml
      
  3. Create the NUMAResourcesScheduler custom resource that deploys the NUMA-aware custom pod scheduler:

    1. Save the following YAML in the nro-scheduler.yaml file:

      apiVersion: nodetopology.openshift.io/v1alpha1
      kind: NUMAResourcesScheduler
      metadata:
        name: numaresourcesscheduler
      spec:
        imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.10.0"
      
    2. Create the NUMAResourcesScheduler CR:

      $ oc create -f nro-scheduler.yaml
      

Verification

Verify that the required resources deployed successfully:

$ oc get all -n openshift-numaresources

Example output

NAME                                                    READY   STATUS    RESTARTS   AGE
pod/numaresources-controller-manager-7575848485-bns4s   1/1     Running   0          13h
pod/numaresourcesoperator-worker-dvj4n                  2/2     Running   0          16h
pod/numaresourcesoperator-worker-lcg4t                  2/2     Running   0          16h
pod/secondary-scheduler-56994cf6cf-7qf4q                1/1     Running   0          16h

NAME                                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
daemonset.apps/numaresourcesoperator-worker   2         2         2       2            2           node-role.kubernetes.io/worker=   16h

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/numaresources-controller-manager   1/1     1            1           13h
deployment.apps/secondary-scheduler                1/1     1            1           16h

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/numaresources-controller-manager-7575848485   1         1         1       13h
replicaset.apps/secondary-scheduler-56994cf6cf                1         1         1       16h

Scheduling workloads with the NUMA-aware scheduler

You can schedule workloads with the NUMA-aware scheduler using Deployment CRs that specify the minimum required resources to process the workload.

The following example deployment uses NUMA-aware scheduling for a sample workload.

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  • Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.

  1. Get the name of the NUMA-aware scheduler that is deployed in the cluster:

    $ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
    

    Example output

    topo-aware-scheduler
    
  2. Create a Deployment CR that uses scheduler named topo-aware-scheduler, for example:

    1. Save the following YAML in the nro-deployment.yaml file:

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: numa-deployment-1
        namespace: openshift-numaresources
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: test
        template:
          metadata:
            labels:
              app: test
          spec:
            schedulerName: topo-aware-scheduler 
            containers:
            - name: ctnr
              image: quay.io/openshifttest/hello-openshift:openshift
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  memory: "100Mi"
                  cpu: "10"
                requests:
                  memory: "100Mi"
                  cpu: "10"
            - name: ctnr2
              image: gcr.io/google_containers/pause-amd64:3.0
              imagePullPolicy: IfNotPresent
              command: ["/bin/sh", "-c"]
              args: [ "while true; do sleep 1h; done;" ]
              resources:
                limits:
                  memory: "100Mi"
                  cpu: "8"
                requests:
                  memory: "100Mi"
                  cpu: "8"
      
      • schedulerName must match the name of the NUMA-aware scheduler that is deployed in your cluster, for example topo-aware-scheduler.
    2. Create the Deployment CR:

      $ oc create -f nro-deployment.yaml
      
  3. Verify that the deployment was successful:

    $ oc get pods -n openshift-numaresources
    

    Example output

    NAME                                                READY   STATUS    RESTARTS   AGE
    numa-deployment-1-56954b7b46-pfgw8                  2/2     Running   0          129m
    numaresources-controller-manager-7575848485-bns4s   1/1     Running   0          15h
    numaresourcesoperator-worker-dvj4n                  2/2     Running   0          18h
    numaresourcesoperator-worker-lcg4t                  2/2     Running   0          16h
    secondary-scheduler-56994cf6cf-7qf4q                1/1     Running   0          18h
    
  4. Verify that the topo-aware-scheduler is scheduling the deployed pod:

    $ oc describe pod numa-deployment-1-56954b7b46-pfgw8 -n openshift-numaresources
    

    Example output

    Events:
      Type    Reason          Age   From                  Message
      ----    ------          ----  ----                  -------
      Normal  Scheduled       130m  topo-aware-scheduler  Successfully assigned openshift-numaresources/numa-deployment-1-56954b7b46-pfgw8 to compute-0.example.com
    

Deployments that request more resources than is available for scheduling will fail with a MinimumReplicasUnavailable error. The deployment will succeed when the required resources become available. Pods stay in the Pending state until the required resources are available.

Troubleshooting NUMA-aware scheduling

To troubleshoot common problems with NUMA-aware pod scheduling, perform the following steps.

  1. Install the OpenShift Container Platform CLI (oc).

  2. Log in as a user with cluster-admin privileges.

  3. Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.

  4. Verify that the noderesourcetopologies CRD is deployed in the cluster:

    $ oc get crd | grep noderesourcetopologies
    

    Example output

    NAME                                                              CREATED AT
    noderesourcetopologies.topology.node.k8s.io                       2022-01-18T08:28:06Z
    
  5. Check that the NUMA-aware scheduler name matches the name specified in your NUMA-aware workloads:

    $ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
    

    Example output

    topo-aware-scheduler
    
  6. Verify that NUMA-aware scheduable nodes have the noderesourcetopologies CR applied to them.

    $ oc get noderesourcetopologies.topology.node.k8s.io
    

    Example output

    NAME                    AGE
    compute-0.example.com   17h
    compute-1.example.com   17h
    

    The number of nodes returned should equal the number of worker nodes that are configured using the machine config pool (mcp) worker definition.

  7. Verify the NUMA zone granularity for all scheduable nodes:

    $ oc get noderesourcetopologies.topology.node.k8s.io -o yaml
    

    Example output

    apiVersion: v1
    items:
    - apiVersion: topology.node.k8s.io/v1alpha1
      kind: NodeResourceTopology
      metadata:
        annotations:
          k8stopoawareschedwg/rte-update: periodic
        creationTimestamp: "2022-01-24T12:14:32Z"
        generation: 11184
        name: compute-0.example.com
        resourceVersion: "17195693"
        uid: ef04fcc8-8022-4e85-8ad0-033640584966
      topologyPolicies:
      - SingleNUMANodeContainerLevel
      zones: <1>
      - costs:
        - name: node-0
          value: 10
        - name: node-1
          value: 21
        name: node-0
        resources: <2>
        - allocatable: "51"
          available: "33"
          capacity: "52"
          name: cpu
        - allocatable: "47984070656"
          available: "47459782656"
          capacity: "49162670080"
          name: memory
        type: Node
      - costs:
        - name: node-0
          value: 21
        - name: node-1
          value: 10
        name: node-1
        resources:
        - allocatable: "51"
          available: "51"
          capacity: "52"
          name: cpu
        - allocatable: "50722099200"
          available: "50722099200"
          capacity: "50722099200"
          name: memory
        type: Node
    - apiVersion: topology.node.k8s.io/v1alpha1
      kind: NodeResourceTopology
      metadata:
        annotations:
          k8stopoawareschedwg/rte-update: periodic
        creationTimestamp: "2022-01-24T12:14:32Z"
        generation: 10652
        name: compute-1.example.com
        resourceVersion: "17196630"
        uid: 82c286cc-2cce-469d-a25c-1f28a5d963e4
      topologyPolicies:
      - SingleNUMANodeContainerLevel
      zones:
      - costs:
        - name: node-0
          value: 10
        - name: node-1
          value: 21
        name: node-0
        resources:
        - allocatable: "51"
          available: "51"
          capacity: "52"
          name: cpu
        - allocatable: "48023056384"
          available: "48023056384"
          capacity: "49201655808"
          name: memory
        type: Node
      - costs:
        - name: node-0
          value: 21
        - name: node-1
          value: 10
        name: node-1
        resources:
        - allocatable: "51"
          available: "51"
          capacity: "52"
          name: cpu
        - allocatable: "50683113472"
          available: "50683113472"
          capacity: "50683113472"
          name: memory
        type: Node
    kind: List
    metadata:
      resourceVersion: ""
      selfLink: ""
    
    • Each stanza under zones describes the resources for a single NUMA zone.

    • resources describes the current state of the NUMA zone resources. Check that resources listed under items.zones.resources.available correspond to the exclusive NUMA zone resources allocated to each guaranteed pod.

Checking the NUMA-aware scheduler logs

Troubleshoot problems with the NUMA-aware scheduler by looking at the logs. If required, you can increase the scheduler log level by modifying the spec.logLevel field of the NUMAResourcesScheduler resource. Acceptable values are Normal, Debug, and Trace, with Trace being the most verbose option.

To change the log level of the secondary scheduler, delete the running scheduler resource and re-deploy it with the changed log level. The scheduler is unavailable for scheduling new workloads during this downtime.

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  1. Delete the currently running NUMAResourcesScheduler resource:

    1. Get the active NUMAResourcesScheduler:

      $ oc get NUMAResourcesScheduler
      

      Example output

      NAME                     AGE
      numaresourcesscheduler   90m
      
    2. Delete the secondary scheduler resource:

      $ oc delete NUMAResourcesScheduler numaresourcesscheduler
      

      Example output

      numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
      
  2. Save the following YAML in the file nro-scheduler-debug.yaml. This example changes the log level to Debug:

    apiVersion: nodetopology.openshift.io/v1alpha1
    kind: NUMAResourcesScheduler
    metadata:
      name: numaresourcesscheduler
    spec:
      imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.10.0"
      logLevel: Debug
    
  3. Create the updated Debug logging NUMAResourcesScheduler resource:

    $ oc create -f nro-scheduler-debug.yaml
    

    Example output

    numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
    
  4. Check that the NUMA-aware scheduler was successfully deployed:

    1. Check that the CRD is created succesfully:

      $ oc get crd | grep numaresourcesschedulers
      

      Example output

      NAME                                                              CREATED AT
      numaresourcesschedulers.nodetopology.openshift.io                 2022-02-25T11:57:03Z
      
    2. Check that the new custom scheduler is available:

      $ oc get numaresourcesschedulers.nodetopology.openshift.io
      

      Example output

      NAME                     AGE
      numaresourcesscheduler   3h26m
      
  5. Check that the logs for the scheduler shows the increased log level:

    1. Get the list of pods running in the openshift-numaresources namespace:

      $ oc get pods -n openshift-numaresources
      

      Example output

      NAME                                               READY   STATUS    RESTARTS   AGE
      numaresources-controller-manager-d87d79587-76mrm   1/1     Running   0          46h
      numaresourcesoperator-worker-5wm2k                 2/2     Running   0          45h
      numaresourcesoperator-worker-pb75c                 2/2     Running   0          45h
      secondary-scheduler-7976c4d466-qm4sc               1/1     Running   0          21m
      
    2. Get the logs for the secondary scheduler pod:

      $ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
      

      Example output

      ...
      I0223 11:04:55.614788       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received
      I0223 11:04:56.609114       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received
      I0223 11:05:22.626818       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received
      I0223 11:05:31.610356       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received
      I0223 11:05:31.713032       1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
      I0223 11:05:53.461016       1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
      

Troubleshooting the resource topology exporter

Troubleshoot noderesourcetopologies objects where unexpected results are occurring by inspecting the corresponding resource-topology-exporter logs.

NUMA resource topology exporter instances in the cluster should be named for nodes they refer to. For example, a worker node with the name worker should have a corresponding noderesourcetopologies object called worker.

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

  1. Get the daemonsets managed by the NUMA Resources Operator. Each daemonset has a corresponding nodeGroup in the NUMAResourcesOperator CR.

    $ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"
    

    Example output

    {"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}
    
  2. Get the label for the daemonset of interest using the value for name from the previous step:

    $ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"
    

    Example output

    {"name":"resource-topology"}
    
  3. Get the pods using the resource-topology label:

    $ oc get pods -n openshift-numaresources -l name=resource-topology -o wide
    

    Example output

    NAME                                 READY   STATUS    RESTARTS   AGE    IP            NODE
    numaresourcesoperator-worker-5wm2k   2/2     Running   0          2d1h   10.135.0.64   compute-0.example.com
    numaresourcesoperator-worker-pb75c   2/2     Running   0          2d1h   10.132.2.33   compute-1.example.com
    
  4. Examine the logs of the resource-topology-exporter container running on the worker pod that corresponds to the node you are troubleshooting.

    $ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c
    

    Example output

    I0221 13:38:18.334140       1 main.go:206] using sysinfo:
    reservedCpus: 0,1
    reservedMemory:
      "0": 1178599424
    I0221 13:38:18.334370       1 main.go:67] === System information ===
    I0221 13:38:18.334381       1 sysinfo.go:231] cpus: reserved "0-1"
    I0221 13:38:18.334493       1 sysinfo.go:237] cpus: online "0-103"
    I0221 13:38:18.546750       1 main.go:72]
    cpus: allocatable "2-103"
    hugepages-1Gi:
      numa cell 1 -> 0
      numa cell 0 -> 0
    hugepages-2Mi:
      numa cell 0 -> 0
      numa cell 1 -> 0
    memory:
      numa cell 0 -> 45758Mi
      numa cell 1 -> 48372Mi
    ...
    

Developer Preview limitations and known issues

Comments