Chapter 12. Scheduling resources

12.1. Using node selectors to move logging resources

A node selector specifies a map of key/value pairs that are defined using custom labels on nodes and selectors specified in pods.

For the pod to be eligible to run on a node, the pod must have the same key/value node selector as the label on the node.

12.1.1. About node selectors

You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.

You can use a node selector to place specific pods on specific nodes, cluster-wide node selectors to place new pods on specific nodes anywhere in the cluster, and project node selectors to place new pods in a project on specific nodes.

For example, as a cluster administrator, you can create an infrastructure where application developers can deploy pods only onto the nodes closest to their geographical location by including a node selector in every pod they create. In this example, the cluster consists of five data centers spread across two regions. In the U.S., label the nodes as us-east, us-central, or us-west. In the Asia-Pacific region (APAC), label the nodes as apac-east or apac-west. The developers can add a node selector to the pods they create to ensure the pods get scheduled on those nodes.

A pod is not scheduled if the Pod object contains a node selector, but no node has a matching label.

Important

If you are using node selectors and node affinity in the same pod configuration, the following rules control pod placement onto nodes:

  • If you configure both nodeSelector and nodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node.
  • If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
  • If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions are satisfied.
Node selectors on specific pods and nodes

You can control which node a specific pod is scheduled on by using node selectors and labels.

To use node selectors and labels, first label the node to avoid pods being descheduled, then add the node selector to the pod.

Note

You cannot add a node selector directly to an existing scheduled pod. You must label the object that controls the pod, such as deployment config.

For example, the following Node object has the region: east label:

Sample Node object with a label

kind: Node
apiVersion: v1
metadata:
  name: ip-10-0-131-14.ec2.internal
  selfLink: /api/v1/nodes/ip-10-0-131-14.ec2.internal
  uid: 7bc2580a-8b8e-11e9-8e01-021ab4174c74
  resourceVersion: '478704'
  creationTimestamp: '2019-06-10T14:46:08Z'
  labels:
    kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/zone: us-east-1a
    node.openshift.io/os_version: '4.5'
    node-role.kubernetes.io/worker: ''
    failure-domain.beta.kubernetes.io/region: us-east-1
    node.openshift.io/os_id: rhcos
    beta.kubernetes.io/instance-type: m4.large
    kubernetes.io/hostname: ip-10-0-131-14
    beta.kubernetes.io/arch: amd64
    region: east 1
    type: user-node
#...

1
Labels to match the pod node selector.

A pod has the type: user-node,region: east node selector:

Sample Pod object with node selectors

apiVersion: v1
kind: Pod
metadata:
  name: s1
#...
spec:
  nodeSelector: 1
    region: east
    type: user-node
#...

1
Node selectors to match the node label. The node must have a label for each node selector.

When you create the pod using the example pod spec, it can be scheduled on the example node.

Default cluster-wide node selectors

With default cluster-wide node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the default node selectors to the pod and schedules the pod on nodes with matching labels.

For example, the following Scheduler object has the default cluster-wide region=east and type=user-node node selectors:

Example Scheduler Operator Custom Resource

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
#...
spec:
  defaultNodeSelector: type=user-node,region=east
#...

A node in that cluster has the type=user-node,region=east labels:

Example Node object

apiVersion: v1
kind: Node
metadata:
  name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4
#...
  labels:
    region: east
    type: user-node
#...

Example Pod object with a node selector

apiVersion: v1
kind: Pod
metadata:
  name: s1
#...
spec:
  nodeSelector:
    region: east
#...

When you create the pod using the example pod spec in the example cluster, the pod is created with the cluster-wide node selector and is scheduled on the labeled node:

Example pod list with the pod on the labeled node

NAME     READY   STATUS    RESTARTS   AGE   IP           NODE                                       NOMINATED NODE   READINESS GATES
pod-s1   1/1     Running   0          20s   10.131.2.6   ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4   <none>           <none>

Note

If the project where you create the pod has a project node selector, that selector takes preference over a cluster-wide node selector. Your pod is not created or scheduled if the pod does not have the project node selector.

Project node selectors

With project node selectors, when you create a pod in this project, OpenShift Container Platform adds the node selectors to the pod and schedules the pods on a node with matching labels. If there is a cluster-wide default node selector, a project node selector takes preference.

For example, the following project has the region=east node selector:

Example Namespace object

apiVersion: v1
kind: Namespace
metadata:
  name: east-region
  annotations:
    openshift.io/node-selector: "region=east"
#...

The following node has the type=user-node,region=east labels:

Example Node object

apiVersion: v1
kind: Node
metadata:
  name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4
#...
  labels:
    region: east
    type: user-node
#...

When you create the pod using the example pod spec in this example project, the pod is created with the project node selectors and is scheduled on the labeled node:

Example Pod object

apiVersion: v1
kind: Pod
metadata:
  namespace: east-region
#...
spec:
  nodeSelector:
    region: east
    type: user-node
#...

Example pod list with the pod on the labeled node

NAME     READY   STATUS    RESTARTS   AGE   IP           NODE                                       NOMINATED NODE   READINESS GATES
pod-s1   1/1     Running   0          20s   10.131.2.6   ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4   <none>           <none>

A pod in the project is not created or scheduled if the pod contains different node selectors. For example, if you deploy the following pod into the example project, it is not be created:

Example Pod object with an invalid node selector

apiVersion: v1
kind: Pod
metadata:
  name: west-region
#...
spec:
  nodeSelector:
    region: west
#...

12.1.2. Moving logging resources

You can configure the Red Hat OpenShift Logging Operator to deploy the pods for logging components, such as Elasticsearch and Kibana, to different nodes. You cannot move the Red Hat OpenShift Logging Operator pod from its installed location.

For example, you can move the Elasticsearch pods to a separate node because of high CPU, memory, and disk requirements.

Prerequisites

  • You have installed the Red Hat OpenShift Logging Operator and the OpenShift Elasticsearch Operator.

Procedure

  1. Edit the ClusterLogging custom resource (CR) in the openshift-logging project:

    $ oc edit ClusterLogging instance

    Example ClusterLogging CR

    apiVersion: logging.openshift.io/v1
    kind: ClusterLogging
    # ...
    spec:
      logStore:
        elasticsearch:
          nodeCount: 3
          nodeSelector: 1
            node-role.kubernetes.io/infra: ''
          tolerations:
          - effect: NoSchedule
            key: node-role.kubernetes.io/infra
            value: reserved
          - effect: NoExecute
            key: node-role.kubernetes.io/infra
            value: reserved
          redundancyPolicy: SingleRedundancy
          resources:
            limits:
              cpu: 500m
              memory: 16Gi
            requests:
              cpu: 500m
              memory: 16Gi
          storage: {}
        type: elasticsearch
      managementState: Managed
      visualization:
        kibana:
          nodeSelector: 2
            node-role.kubernetes.io/infra: ''
          tolerations:
          - effect: NoSchedule
            key: node-role.kubernetes.io/infra
            value: reserved
          - effect: NoExecute
            key: node-role.kubernetes.io/infra
            value: reserved
          proxy:
            resources: null
          replicas: 1
          resources: null
        type: kibana
    # ...

    1 2
    Add a nodeSelector parameter with the appropriate value to the component you want to move. You can use a nodeSelector in the format shown or use <key>: <value> pairs, based on the value specified for the node. If you added a taint to the infrasructure node, also add a matching toleration.

Verification

To verify that a component has moved, you can use the oc get pod -o wide command.

For example:

  • You want to move the Kibana pod from the ip-10-0-147-79.us-east-2.compute.internal node:

    $ oc get pod kibana-5b8bdf44f9-ccpq9 -o wide

    Example output

    NAME                      READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
    kibana-5b8bdf44f9-ccpq9   2/2     Running   0          27s   10.129.2.18   ip-10-0-147-79.us-east-2.compute.internal   <none>           <none>

  • You want to move the Kibana pod to the ip-10-0-139-48.us-east-2.compute.internal node, a dedicated infrastructure node:

    $ oc get nodes

    Example output

    NAME                                         STATUS   ROLES          AGE   VERSION
    ip-10-0-133-216.us-east-2.compute.internal   Ready    master         60m   v1.24.0
    ip-10-0-139-146.us-east-2.compute.internal   Ready    master         60m   v1.24.0
    ip-10-0-139-192.us-east-2.compute.internal   Ready    worker         51m   v1.24.0
    ip-10-0-139-241.us-east-2.compute.internal   Ready    worker         51m   v1.24.0
    ip-10-0-147-79.us-east-2.compute.internal    Ready    worker         51m   v1.24.0
    ip-10-0-152-241.us-east-2.compute.internal   Ready    master         60m   v1.24.0
    ip-10-0-139-48.us-east-2.compute.internal    Ready    infra          51m   v1.24.0

    Note that the node has a node-role.kubernetes.io/infra: '' label:

    $ oc get node ip-10-0-139-48.us-east-2.compute.internal -o yaml

    Example output

    kind: Node
    apiVersion: v1
    metadata:
      name: ip-10-0-139-48.us-east-2.compute.internal
      selfLink: /api/v1/nodes/ip-10-0-139-48.us-east-2.compute.internal
      uid: 62038aa9-661f-41d7-ba93-b5f1b6ef8751
      resourceVersion: '39083'
      creationTimestamp: '2020-04-13T19:07:55Z'
      labels:
        node-role.kubernetes.io/infra: ''
    ...

  • To move the Kibana pod, edit the ClusterLogging CR to add a node selector:

    apiVersion: logging.openshift.io/v1
    kind: ClusterLogging
    # ...
    spec:
    # ...
      visualization:
        kibana:
          nodeSelector: 1
            node-role.kubernetes.io/infra: ''
          proxy:
            resources: null
          replicas: 1
          resources: null
        type: kibana
    1
    Add a node selector to match the label in the node specification.
  • After you save the CR, the current Kibana pod is terminated and new pod is deployed:

    $ oc get pods

    Example output

    NAME                                            READY   STATUS        RESTARTS   AGE
    cluster-logging-operator-84d98649c4-zb9g7       1/1     Running       0          29m
    elasticsearch-cdm-hwv01pf7-1-56588f554f-kpmlg   2/2     Running       0          28m
    elasticsearch-cdm-hwv01pf7-2-84c877d75d-75wqj   2/2     Running       0          28m
    elasticsearch-cdm-hwv01pf7-3-f5d95b87b-4nx78    2/2     Running       0          28m
    collector-42dzz                                 1/1     Running       0          28m
    collector-d74rq                                 1/1     Running       0          28m
    collector-m5vr9                                 1/1     Running       0          28m
    collector-nkxl7                                 1/1     Running       0          28m
    collector-pdvqb                                 1/1     Running       0          28m
    collector-tflh6                                 1/1     Running       0          28m
    kibana-5b8bdf44f9-ccpq9                         2/2     Terminating   0          4m11s
    kibana-7d85dcffc8-bfpfp                         2/2     Running       0          33s

  • The new pod is on the ip-10-0-139-48.us-east-2.compute.internal node:

    $ oc get pod kibana-7d85dcffc8-bfpfp -o wide

    Example output

    NAME                      READY   STATUS        RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
    kibana-7d85dcffc8-bfpfp   2/2     Running       0          43s   10.131.0.22   ip-10-0-139-48.us-east-2.compute.internal   <none>           <none>

  • After a few moments, the original Kibana pod is removed.

    $ oc get pods

    Example output

    NAME                                            READY   STATUS    RESTARTS   AGE
    cluster-logging-operator-84d98649c4-zb9g7       1/1     Running   0          30m
    elasticsearch-cdm-hwv01pf7-1-56588f554f-kpmlg   2/2     Running   0          29m
    elasticsearch-cdm-hwv01pf7-2-84c877d75d-75wqj   2/2     Running   0          29m
    elasticsearch-cdm-hwv01pf7-3-f5d95b87b-4nx78    2/2     Running   0          29m
    collector-42dzz                                 1/1     Running   0          29m
    collector-d74rq                                 1/1     Running   0          29m
    collector-m5vr9                                 1/1     Running   0          29m
    collector-nkxl7                                 1/1     Running   0          29m
    collector-pdvqb                                 1/1     Running   0          29m
    collector-tflh6                                 1/1     Running   0          29m
    kibana-7d85dcffc8-bfpfp                         2/2     Running   0          62s

12.1.3. Additional resources

12.2. Using taints and tolerations to control logging pod placement

Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.

12.2.1. Understanding taints and tolerations

A taint allows a node to refuse a pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the Node specification (NodeSpec) and apply tolerations to a pod through the Pod specification (PodSpec). When you apply a taint a node, the scheduler cannot place a pod on that node unless the pod can tolerate the taint.

Example taint in a node specification

apiVersion: v1
kind: Node
metadata:
  name: my-node
#...
spec:
  taints:
  - effect: NoExecute
    key: key1
    value: value1
#...

Example toleration in a Pod spec

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600
#...

Taints and tolerations consist of a key, value, and effect.

Table 12.1. Taint and toleration components

ParameterDescription

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

NoSchedule [1]

  • New pods that do not match the taint are not scheduled onto that node.
  • Existing pods on the node remain.

PreferNoSchedule

  • New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.
  • Existing pods on the node remain.

NoExecute

  • New pods that do not match the taint cannot be scheduled onto that node.
  • Existing pods on the node that do not have a matching toleration are removed.

operator

Equal

The key/value/effect parameters must match. This is the default.

Exists

The key/effect parameters must match. You must leave a blank value parameter, which matches any.

  1. If you add a NoSchedule taint to a control plane node, the node must have the node-role.kubernetes.io/master=:NoSchedule taint, which is added by default.

    For example:

    apiVersion: v1
    kind: Node
    metadata:
      annotations:
        machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
        machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
      name: my-node
    #...
    spec:
      taints:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
    #...

A toleration matches a taint:

  • If the operator parameter is set to Equal:

    • the key parameters are the same;
    • the value parameters are the same;
    • the effect parameters are the same.
  • If the operator parameter is set to Exists:

    • the key parameters are the same;
    • the effect parameters are the same.

The following taints are built into OpenShift Container Platform:

  • node.kubernetes.io/not-ready: The node is not ready. This corresponds to the node condition Ready=False.
  • node.kubernetes.io/unreachable: The node is unreachable from the node controller. This corresponds to the node condition Ready=Unknown.
  • node.kubernetes.io/memory-pressure: The node has memory pressure issues. This corresponds to the node condition MemoryPressure=True.
  • node.kubernetes.io/disk-pressure: The node has disk pressure issues. This corresponds to the node condition DiskPressure=True.
  • node.kubernetes.io/network-unavailable: The node network is unavailable.
  • node.kubernetes.io/unschedulable: The node is unschedulable.
  • node.cloudprovider.kubernetes.io/uninitialized: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.
  • node.kubernetes.io/pid-pressure: The node has pid pressure. This corresponds to the node condition PIDPressure=True.

    Important

    OpenShift Container Platform does not set a default pid.available evictionHard.

12.2.2. Using tolerations to control log store pod placement

By default, log store pods have the following toleration configurations:

Elasticsearch log store pods default tolerations

apiVersion: v1
kind: Pod
metadata:
  name: elasticsearch-example
  namespace: openshift-logging
spec:
# ...
  tolerations:
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
# ...

LokiStack log store pods default tolerations

apiVersion: v1
kind: Pod
metadata:
  name: lokistack-example
  namespace: openshift-logging
spec:
# ...
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
# ...

You can configure a toleration for log store pods by adding a taint and then modifying the tolerations syntax in the ClusterLogging custom resource (CR).

Prerequisites

  • You have installed the Red Hat OpenShift Logging Operator.
  • You have installed the OpenShift CLI (oc).
  • You have deployed an internal log store that is either Elasticsearch or LokiStack.

Procedure

  1. Add a taint to a node where you want to schedule the logging pods, by running the following command:

    $ oc adm taint nodes <node_name> <key>=<value>:<effect>

    Example command

    $ oc adm taint nodes node1 lokistack=node:NoExecute

    This example places a taint on node1 that has key lokistack, value node, and taint effect NoExecute. Nodes with the NoExecute effect schedule only pods that match the taint and remove existing pods that do not match.

  2. Edit the logstore section of the ClusterLogging CR to configure a toleration for the log store pods:

    Example ClusterLogging CR

    apiVersion: logging.openshift.io/v1
    kind: ClusterLogging
    metadata:
    # ...
    spec:
    # ...
      logStore:
        type: lokistack
        elasticsearch:
          nodeCount: 1
          tolerations:
          - key: lokistack 1
            operator: Exists 2
            effect: NoExecute 3
            tolerationSeconds: 6000 4
    # ...

    1
    Specify the key that you added to the node.
    2
    Specify the Exists operator to require a taint with the key lokistack to be present on the node.
    3
    Specify the NoExecute effect.
    4
    Optional: Specify the tolerationSeconds parameter to set how long a pod can remain bound to a node before being evicted.

This toleration matches the taint created by the oc adm taint command. A pod with this toleration can be scheduled onto node1.

12.2.3. Using tolerations to control the log visualizer pod placement

You can use a specific key/value pair that is not on other pods to ensure that only the Kibana pod can run on the specified node.

Prerequisites

  • You have installed the Red Hat OpenShift Logging Operator, the OpenShift Elasticsearch Operator, and the OpenShift CLI (oc).

Procedure

  1. Add a taint to a node where you want to schedule the log visualizer pod by running the following command:

    $ oc adm taint nodes <node_name> <key>=<value>:<effect>

    Example command

    $ oc adm taint nodes node1 kibana=node:NoExecute

    This example places a taint on node1 that has key kibana, value node, and taint effect NoExecute. You must use the NoExecute taint effect. NoExecute schedules only pods that match the taint and remove existing pods that do not match.

  2. Edit the visualization section of the ClusterLogging CR to configure a toleration for the Kibana pod:

    apiVersion: logging.openshift.io/v1
    kind: ClusterLogging
    metadata:
    # ...
    spec:
    # ...
      visualization:
        type: kibana
        kibana:
          tolerations:
          - key: kibana  1
            operator: Exists 2
            effect: NoExecute 3
            tolerationSeconds: 6000 4
          resources:
            limits:
              memory: 2Gi
            requests:
              cpu: 100m
              memory: 1Gi
          replicas: 1
    # ...
    1
    Specify the key that you added to the node.
    2
    Specify the Exists operator to require the key, value, and effect parameters to match.
    3
    Specify the NoExecute effect.
    4
    Optionally, specify the tolerationSeconds parameter to set how long a pod can remain bound to a node before being evicted.

This toleration matches the taint created by the oc adm taint command. A pod with this toleration would be able to schedule onto node1.

12.2.4. Using tolerations to control log collector pod placement

By default, log collector pods have the following tolerations configuration:

apiVersion: v1
kind: Pod
metadata:
  name: collector-example
  namespace: openshift-logging
spec:
# ...
  collection:
    type: vector
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/disk-pressure
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/memory-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/pid-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      operator: Exists
# ...

Prerequisites

  • You have installed the Red Hat OpenShift Logging Operator and OpenShift CLI (oc).

Procedure

  1. Add a taint to a node where you want logging collector pods to schedule logging collector pods by running the following command:

    $ oc adm taint nodes <node_name> <key>=<value>:<effect>

    Example command

    $ oc adm taint nodes node1 collector=node:NoExecute

    This example places a taint on node1 that has key collector, value node, and taint effect NoExecute. You must use the NoExecute taint effect. NoExecute schedules only pods that match the taint and removes existing pods that do not match.

  2. Edit the collection stanza of the ClusterLogging custom resource (CR) to configure a toleration for the logging collector pods:

    apiVersion: logging.openshift.io/v1
    kind: ClusterLogging
    metadata:
    # ...
    spec:
    # ...
      collection:
        type: vector
        tolerations:
        - key: collector 1
          operator: Exists 2
          effect: NoExecute 3
          tolerationSeconds: 6000 4
        resources:
          limits:
            memory: 2Gi
          requests:
            cpu: 100m
            memory: 1Gi
    # ...
    1
    Specify the key that you added to the node.
    2
    Specify the Exists operator to require the key/value/effect parameters to match.
    3
    Specify the NoExecute effect.
    4
    Optionally, specify the tolerationSeconds parameter to set how long a pod can remain bound to a node before being evicted.

This toleration matches the taint created by the oc adm taint command. A pod with this toleration can be scheduled onto node1.

12.2.5. Additional resources