Red Hat Training

A Red Hat training course is available for OpenShift Container Platform

Chapter 16. Scheduling

16.1. Overview

16.1.1. Overview

Pod scheduling is an internal process that determines placement of new pods onto nodes within the cluster.

The scheduler code has a clean separation that watches new pods as they get created and identifies the most suitable node to host them. It then creates bindings (pod to node bindings) for the pods using the master API.

16.1.2. Default scheduling

OpenShift Container Platform comes with a default scheduler that serves the needs of most users. The default scheduler uses both inherent and customizable tools to determine the best fit for a pod.

For information on how the default scheduler determines pod placement and available customizable parameters, see Default Scheduling.

16.1.3. Advanced scheduling

In situations where you might want more control over where new pods are placed, the OpenShift Container Platform advanced scheduling features allow you to configure a pod so that the pod is required to (or has a preference to) run on a particular node or alongside a specific pod. Advanced scheduling also allows you to prevent pods from being placed on a node or with another pod.

For information about advanced scheduling, see Advanced Scheduling.

16.1.4. Custom scheduling

OpenShift Container Platform also allows you to use your own or third-party schedulers by editing the pod specification.

For more information, see Custom Schedulers.

16.2. Default Scheduling

16.2.1. Overview

The default OpenShift Container Platform pod scheduler is responsible for determining placement of new pods onto nodes within the cluster. It reads data from the pod and tries to find a node that is a good fit based on configured policies. It is completely independent and exists as a standalone/pluggable solution. It does not modify the pod and just creates a binding for the pod that ties the pod to the particular node.

16.2.2. Generic Scheduler

The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:

16.2.3. Filter the Nodes

The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates.

16.2.3.1. Prioritize the Filtered List of Nodes

This is achieved by passing each node through a series of priority functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each priority function. The node score provided by each priority function is multiplied by the weight (default weight for most priorities is 1) and then combined by adding the scores for each node provided by all the priorities. This weight attribute can be used by administrators to give higher importance to some priorities.

16.2.3.2. Select the Best Fit Node

The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.

16.2.4. Scheduler Policy

The selection of the predicate and priorities defines the policy for the scheduler.

The scheduler configuration file is a JSON file that specifies the predicates and priorities the scheduler will consider.

In the absence of the scheduler policy file, the default configuration file, /etc/origin/master/scheduler.json, gets applied.

Important

The predicates and priorities defined in the scheduler configuration file completely override the default scheduler policy. If any of the default predicates and priorities are required, you must explicitly specify the functions in the scheduler configuration file.

Default scheduler configuration file

{
    "apiVersion": "v1",
    "kind": "Policy",
    "predicates": [
        {
            "name": "NoVolumeZoneConflict"
        },
        {
            "name": "MaxEBSVolumeCount"
        },
        {
            "name": "MaxGCEPDVolumeCount"
        },
        {
            "name": "MaxAzureDiskVolumeCount"
        },
        {
            "name": "MatchInterPodAffinity"
        },
        {
            "name": "NoDiskConflict"
        },
        {
            "name": "GeneralPredicates"
        },
        {
            "name": "PodToleratesNodeTaints"
        },
        {
            "argument": {
                "serviceAffinity": {
                    "labels": [
                        "region"
                    ]
                }
            },
            "name": "Region"

         }
    ],
    "priorities": [
        {
            "name": "SelectorSpreadPriority",
            "weight": 1
        },
        {
            "name": "InterPodAffinityPriority",
            "weight": 1
        },
        {
            "name": "LeastRequestedPriority",
            "weight": 1
        },
        {
            "name": "BalancedResourceAllocation",
            "weight": 1
        },
        {
            "name": "NodePreferAvoidPodsPriority",
            "weight": 10000
        },
        {
            "name": "NodeAffinityPriority",
            "weight": 1
        },
        {
            "name": "TaintTolerationPriority",
            "weight": 1
        },
        {
            "argument": {
                "serviceAntiAffinity": {
                    "label": "zone"
                }
            },
            "name": "Zone",
            "weight": 2
        }
    ]
}

16.2.4.1. Modifying Scheduler Policy

The scheduler policy is defined in a file on the master, named /etc/origin/master/scheduler.json by default, unless overridden by the kubernetesMasterConfig.schedulerConfigFile field in the master configuration file.

Sample modified scheduler configuration file

kind: "Policy"
version: "v1"
"predicates": [
        {
            "name": "PodFitsResources"
        },
        {
            "name": "NoDiskConflict"
        },
        {
            "name": "MatchNodeSelector"
        },
        {
            "name": "HostName"
        },
        {
            "argument": {
                "serviceAffinity": {
                    "labels": [
                        "region"
                    ]
                }
            },
            "name": "Region"
        }
    ],
    "priorities": [
        {
            "name": "LeastRequestedPriority",
            "weight": 1
        },
        {
            "name": "BalancedResourceAllocation",
            "weight": 1
        },
        {
            "name": "ServiceSpreadingPriority",
            "weight": 1
        },
        {
            "argument": {
                "serviceAntiAffinity": {
                    "label": "zone"
                }
            },
            "name": "Zone",
            "weight": 2
        }
    ]

To modify the scheduler policy:

  1. Edit the scheduler configuration file to configure the desired default predicates and priorities. You can create a custom configuration, or use and modify one of the sample policy configurations.
  2. Add any configurable predicates and configurable priorities you require.
  3. Restart the OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers

16.2.5. Available Predicates

Predicates are rules that filter out unqualified nodes.

There are several predicates provided by default in OpenShift Container Platform. Some of these predicates can be customized by providing certain parameters. Multiple predicates can be combined to provide additional filtering of nodes.

16.2.5.1. Static Predicates

These predicates do not take any configuration parameters or inputs from the user. These are specified in the scheduler configuration using their exact name.

16.2.5.1.1. Default Predicates

The default scheduler policy includes the following predicates:

NoVolumeZoneConflict checks that the volumes a pod requests are available in the zone.

{"name" : "NoVolumeZoneConflict"}

MaxEBSVolumeCount checks the maximum number of volumes that can be attached to an AWS instance.

{"name" : "MaxEBSVolumeCount"}

MaxGCEPDVolumeCount checks the maximum number of Google Compute Engine (GCE) Persistent Disks (PD).

{"name" : "MaxGCEPDVolumeCount"}

MatchInterPodAffinity checks if the pod affinity/antiaffinity rules permit the pod.

{"name" : "MatchInterPodAffinity"}

NoDiskConflict checks if the volume requested by a pod is available.

{"name" : "NoDiskConflict"}

PodToleratesNodeTaints checks if a pod can tolerate the node taints.

{"name" : "PodToleratesNodeTaints"}
16.2.5.1.2. Other Static Predicates

OpenShift Container Platform also supports the following predicates:

CheckVolumeBinding evaluates if a pod can fit based on the volumes, it requests, for both bound and unbound PVCs. * For PVCs that are bound, the predicate checks that the corresponding PV’s node affinity is satisfied by the given node. * For PVCs that are unbound, the predicate searched for available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.

The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound PVCs can be matched with an available and node-compatible PV.

{"name" : "CheckVolumeBinding"}

The CheckVolumeBinding predicate must be enabled in non-default schedulers.

CheckNodeCondition checks if a pod can be scheduled on a node reporting out of disk, network unavailable, or not ready conditions.

{"name" : "CheckNodeCondition"}

PodToleratesNodeNoExecuteTaints checks if a pod tolerations can tolerate a node NoExecute taints.

{"name" : "PodToleratesNodeNoExecuteTaints"}

CheckNodeLabelPresence checks if all of the specified labels exist on a node, regardless of their value.

{"name" : "CheckNodeLabelPresence"}

checkServiceAffinity checks that ServiceAffinity labels are homogeneous for pods that are scheduled on a node.

{"name" : "checkServiceAffinity"}

MaxAzureDiskVolumeCount checks the maximum number of Azure Disk Volumes.

{"name" : "MaxAzureDiskVolumeCount"}

16.2.5.2. General Predicates

The following general predicates check whether non-critical predicates and essential predicates pass. Non-critical predicates are the predicates that only non-critical pods need to pass and essential predicates are the predicates that all pods need to pass.

The default scheduler policy includes the general predicates.

Non-critical general predicates

PodFitsResources determines a fit based on resource availability (CPU, memory, GPU, and so forth). The nodes can declare their resource capacities and then pods can specify what resources they require. Fit is based on requested, rather than used resources.

{"name" : "PodFitsResources"}
Essential general predicates

PodFitsHostPorts determines if a node has free ports for the requested pod ports (absence of port conflicts).

{"name" : "PodFitsHostPorts"}

HostName determines fit based on the presence of the Host parameter and a string match with the name of the host.

{"name" : "HostName"}

MatchNodeSelector determines fit based on node selector (nodeSelector) queries defined in the pod.

{"name" : "MatchNodeSelector"}

16.2.5.3. Configurable Predicates

You can configure these predicates in the scheduler configuration, by default /etc/origin/master/scheduler.json, to add labels to affect how the predicate functions.

Since these are configurable, multiple predicates of the same type (but different configuration parameters) can be combined as long as their user-defined names are different.

For information on using these priorities, see Modifying Scheduler Policy.

ServiceAffinity places pods on nodes based on the service running on that pod. Placing pods of the same service on the same or co-located nodes can lead to higher efficiency.

This predicate attempts to place pods with specific labels in its node selector on nodes that have the same label.

If the pod does not specify the labels in its node selector, then the first pod is placed on any node based on availability and all subsequent pods of the service are scheduled on nodes that have the same label values as that node.

"predicates":[
      {
         "name":"<name>", 1
         "argument":{
            "serviceAffinity":{
               "labels":[
                  "<label>" 2
               ]
            }
         }
      }
   ],
1
Specify a name for the predicate.
2
Specify a label to match.

For example:

        "name":"ZoneAffinity",
        "argument":{
            "serviceAffinity":{
                "labels":[
                    "rack"
                ]
            }
        }

For example. if the first pod of a service had a node selector rack was scheduled to a node with label region=rack, all the other subsequent pods belonging to the same service will be scheduled on nodes with the same region=rack label. For more information, see Controlling Pod Placement.

Multiple-level labels are also supported. Users can also specify all pods for a service to be scheduled on nodes within the same region and within the same zone (under the region).

The labelsPresence parameter checks whether a particular node has a specific label. The labels create node groups that the LabelPreference priority uses. Matching by label can be useful, for example, where nodes have their physical location or status defined by labels.

"predicates":[
      {
         "name":"<name>", 1
         "argument":{
            "labelsPresence":{
               "labels":[
                  "<label>" 2
                ],
                "presence": true 3
            }
         }
      }
   ],
1
Specify a name for the predicate.
2
Specify a label to match.
3
Specify whether the labels are required, either true or false.
  • For presence:false, if any of the requested labels are present in the node labels, the pod cannot be scheduled. If the labels are not present, the pod can be scheduled.
  • For presence:true, if all of the requested labels are present in the node labels, the pod can be scheduled. If all of the labels are not present, the pod is not scheduled.

For example:

        "name":"RackPreferred",
        "argument":{
            "labelsPresence":{
                "labels":[
                    "rack",
                    "region"
                ],
                "presence": true
            }
        }

16.2.6. Available Priorities

Priorities are rules that rank remaining nodes according to preferences.

A custom set of priorities can be specified to configure the scheduler. There are several priorities provided by default in OpenShift Container Platform. Other priorities can be customized by providing certain parameters. Multiple priorities can be combined and different weights can be given to each in order to impact the prioritization.

16.2.6.1. Static Priorities

Static priorities do not take any configuration parameters from the user, except weight. A weight is required to be specified and cannot be 0 or negative.

These are specified in the scheduler configuration, by default /etc/origin/master/scheduler.json.

16.2.6.1.1. Default Priorities

The default scheduler policy includes the following priorities. Each of the priority function has a weight of 1 except NodePreferAvoidPodsPriority, which has a weight of 10000.

SelectorSpreadPriority looks for services, replication controllers (RC), replication sets (RS), and stateful sets that match the pod, then finds existing pods that match those selectors. The scheduler favors nodes that have fewer existing matching pods. Then, it schedules the pod on a node with the smallest number of pods that match those selectors as the pod being scheduled.

{"name" : "SelectorSpreadPriority", "weight" : 1}

InterPodAffinityPriority computes a sum by iterating through the elements of weightedPodAffinityTerm and adding weight to the sum if the corresponding PodAffinityTerm is satisfied for that node. The node(s) with the highest sum are the most preferred.

{"name" : "InterPodAffinityPriority", "weight" : 1}

LeastRequestedPriority favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes nodes that have the highest available/remaining capacity.

{"name" : "LeastRequestedPriority", "weight" : 1}

BalancedResourceAllocation favors nodes with balanced resource usage rate. It calculates the difference between the consumed CPU and memory as a fraction of capacity, and prioritizes the nodes based on how close the two metrics are to each other. This should always be used together with LeastRequestedPriority.

{"name" : "BalancedResourceAllocation", "weight" : 1}

NodePreferAvoidPodsPriority ignores pods that are owned by a controller other than a replication controller.

{"name" : "NodePreferAvoidPodsPriority", "weight" : 10000}

NodeAffinityPriority prioritizes nodes according to node affinity scheduling preferences

{"name" : "NodeAffinityPriority", "weight" : 1}

TaintTolerationPriority prioritizes nodes that have a fewer number of intolerable taints on them for a pod. An intolerable taint is one which has key PreferNoSchedule.

{"name" : "TaintTolerationPriority", "weight" : 1}
16.2.6.1.2. Other Static Priorities

OpenShift Container Platform also supports the following priorities:

EqualPriority gives an equal weight of 1 to all nodes, if no priority configurations are provided. We recommend using this priority only for testing environments.

{"name" : "EqualPriority", "weight" : 1}

MostRequestedPriority prioritizes nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.

{"name" : "MostRequestedPriority", "weight" : 1}

ImageLocalityPriority prioritizes nodes that already have requested pod container’s images.

{"name" : "ImageLocalityPriority", "weight" : 1}

ServiceSpreadingPriority spreads pods by minimizing the number of pods belonging to the same service onto the same machine.

{"name" : "ServiceSpreadingPriority", "weight" : 1}

16.2.6.2. Configurable Priorities

You can configure these priorities in the scheduler configuration, by default /etc/origin/master/scheduler.json, to add labels to affect how the priorities.

The type of the priority function is identified by the argument that they take. Since these are configurable, multiple priorities of the same type (but different configuration parameters) can be combined as long as their user-defined names are different.

For information on using these priorities, see Modifying Scheduler Policy.

ServiceAntiAffinity takes a label and ensures a good spread of the pods belonging to the same service across the group of nodes based on the label values. It gives the same score to all nodes that have the same value for the specified label. It gives a higher score to nodes within a group with the least concentration of pods.

"priorities":[
    {
        "name":"<name>", 1
        "weight" : 1 2
        "argument":{
            "serviceAntiAffinity":{
                "label":[
                    "<label>" 3
                ]
            }
        }
    }
]
1
Specify a name for the priority.
2
Specify a weight. Enter a non-zero positive value.
3
Specify a label to match.

For example:

        "name":"RackSpread", 1
        "weight" : 1 2
        "argument":{
            "serviceAntiAffinity":{
                "label": "rack" 3
            }
        }
1
Specify a name for the priority.
2
Specify a weight. Enter a non-zero positive value.
3
Specify a label to match.
Note

In some situations using ServiceAntiAffinity based on custom labels does not spread pod as expected. See this Red Hat Solution.

*The labelPreference parameter gives priority based on the specified label. If the label is present on a node, that node is given priority. If no label is specified, priority is given to nodes that do not have a label.

"priorities":[
    {
        "name":"<name>", 1
        "weight" : 1, 2
        "argument":{
            "labelPreference":{
                "label": "<label>", 3
                "presence": true 4
            }
        }
    }
]
1
Specify a name for the priority.
2
Specify a weight. Enter a non-zero positive value.
3
Specify a label to match.
4
Specify whether the label is required, either true or false.

16.2.7. Use Cases

One of the important use cases for scheduling within OpenShift Container Platform is to support flexible affinity and anti-affinity policies.

16.2.7.1. Infrastructure Topological Levels

Administrators can define multiple topological levels for their infrastructure (nodes) by specifying labels on nodes (e.g., region=r1, zone=z1, rack=s1).

These label names have no particular meaning and administrators are free to name their infrastructure levels anything (eg, city/building/room). Also, administrators can define any number of levels for their infrastructure topology, with three levels usually being adequate (such as: regionszonesracks). Administrators can specify affinity and anti-affinity rules at each of these levels in any combination.

16.2.7.2. Affinity

Administrators should be able to configure the scheduler to specify affinity at any topological level, or even at multiple levels. Affinity at a particular level indicates that all pods that belong to the same service are scheduled onto nodes that belong to the same level. This handles any latency requirements of applications by allowing administrators to ensure that peer pods do not end up being too geographically separated. If no node is available within the same affinity group to host the pod, then the pod is not scheduled.

If you need greater control over where the pods are scheduled, see Using Node Affinity and Using Pod Affinity and Anti-affinity. These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

16.2.7.3. Anti Affinity

Administrators should be able to configure the scheduler to specify anti-affinity at any topological level, or even at multiple levels. Anti-affinity (or 'spread') at a particular level indicates that all pods that belong to the same service are spread across nodes that belong to that level. This ensures that the application is well spread for high availability purposes. The scheduler tries to balance the service pods across all applicable nodes as evenly as possible.

If you need greater control over where the pods are scheduled, see Using Node Affinity and Using Pod Affinity and Anti-affinity. These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

16.2.8. Sample Policy Configurations

The configuration below specifies the default scheduler configuration, if it were to be specified via the scheduler policy file.

kind: "Policy"
version: "v1"
predicates:
...
  - name: "RegionZoneAffinity" 1
    argument:
      serviceAffinity: 2
        labels: 3
          - "region"
          - "zone"
priorities:
...
  - name: "RackSpread" 4
    weight: 1
    argument:
      serviceAntiAffinity: 5
        label: "rack" 6
1
The name for the predicate.
2
3
The labels for the predicate.
4
The name for the priority.
5
6
The labels for the priority.

In all of the sample configurations below, the list of predicates and priority functions is truncated to include only the ones that pertain to the use case specified. In practice, a complete/meaningful scheduler policy should include most, if not all, of the default predicates and priorities listed above.

The following example defines three topological levels, region (affinity) → zone (affinity) → rack (anti-affinity):

kind: "Policy"
version: "v1"
predicates:
...
  - name: "RegionZoneAffinity"
    argument:
      serviceAffinity:
        labels:
          - "region"
          - "zone"
priorities:
...
  - name: "RackSpread"
    weight: 1
    argument:
      serviceAntiAffinity:
        label: "rack"

The following example defines three topological levels, city (affinity) → building (anti-affinity) → room (anti-affinity):

kind: "Policy"
version: "v1"
predicates:
...
  - name: "CityAffinity"
    argument:
      serviceAffinity:
        labels:
          - "city"
priorities:
...
  - name: "BuildingSpread"
    weight: 1
    argument:
      serviceAntiAffinity:
        label: "building"
  - name: "RoomSpread"
    weight: 1
    argument:
      serviceAntiAffinity:
        label: "room"

The following example defines a policy to only use nodes with the 'region' label defined and prefer nodes with the 'zone' label defined:

kind: "Policy"
version: "v1"
predicates:
...
  - name: "RequireRegion"
    argument:
      labelsPresence:
        labels:
          - "region"
        presence: true
priorities:
...
  - name: "ZonePreferred"
    weight: 1
    argument:
      labelPreference:
        label: "zone"
        presence: true

The following example combines both static and configurable predicates and also priorities:

kind: "Policy"
version: "v1"
predicates:
...
  - name: "RegionAffinity"
    argument:
      serviceAffinity:
        labels:
          - "region"
  - name: "RequireRegion"
    argument:
      labelsPresence:
        labels:
          - "region"
        presence: true
  - name: "BuildingNodesAvoid"
    argument:
      labelsPresence:
        labels:
          - "building"
        presence: false
  - name: "PodFitsPorts"
  - name: "MatchNodeSelector"
priorities:
...
  - name: "ZoneSpread"
    weight: 2
    argument:
      serviceAntiAffinity:
        label: "zone"
  - name: "ZonePreferred"
    weight: 1
    argument:
      labelPreference:
        label: "zone"
        presence: true
  - name: "ServiceSpreadingPriority"
    weight: 1

16.3. Descheduling

16.3.1. Overview

Descheduling involves evicting pods based on specific policies so that the pods can be rescheduled onto more appropriate nodes.

Your cluster can benefit from descheduling and rescheduling already-running pods for various reasons:

  • Nodes are under- or over-utilized.
  • Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
  • Node failure requires pods to be moved.
  • New nodes are added to clusters.

The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.

It is important to note that there are a number of core components, such as DNS, that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster may stop working properly if the component is evicted. To prevent the descheduler from removing these pods, configure the pod as a critical pod by adding the scheduler.alpha.kubernetes.io/critical-pod annotation to the pod specification.

Note

The descheduler job is considered a critical pod, which prevents the descheduler pod from being evicted by the descheduler.

The descheduler job and descheduler pod are created in the kube-system project, which is created by default.

Important

The descheduler is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend to use them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information on Red Hat Technology Preview features support scope, see https://access.redhat.com/support/offerings/techpreview/.

The descheduler does not evict the following types of pods:

  • Critical pods (with the scheduler.alpha.kubernetes.io/critical-pod annotation).
  • Pods (static and mirror pods or pods in standalone mode) not associated with a Replica Set, Replication Controller, Deployment, or Job (because these pods are not recreated).
  • Pods associated with DaemonSets.
  • Pods with local storage.
  • Pods subject to Pod Disruption Budget (PDB) are not evicted if descheduling violates the PDB. The pods can be evicted using an eviction policy.
Note

Best efforts pods are evicted before Burstable and Guaranteed pods.

The following sections describe the process to configure and run the descheduler:

16.3.2. Creating a Cluster Role

To configure the necessary permissions for the descheduler to work in a pod:

  1. Create a cluster role with the following rules:

    kind: ClusterRole
    apiVersion: rbac.authorization.k8s.io/v1beta1
    metadata:
      name: descheduler-cluster-role
    rules:
    - apiGroups: [""]
      resources: ["nodes"]
      verbs: ["get", "watch", "list"] 1
    - apiGroups: [""]
      resources: ["pods"]
      verbs: ["get", "watch", "list", "delete"] 2
    - apiGroups: [""]
      resources: ["pods/eviction"] 3
      verbs: ["create"]
    1
    Configures the role to allow viewing nodes.
    2
    Configures the role to allow viewing and deleting pods.
    3
    Allows a node to evict pods bound to itself.
  2. Create the service account which will be used to run the job:

    # oc create sa <file-name>.yaml -n kube-system

    For example:

    # oc create sa descheduler-sa.yaml -n kube-system
  3. Bind the cluster role to the service account:

    # oc create clusterrolebinding descheduler-cluster-role-binding \
        --clusterrole=<cluster-role-name> \
        --serviceaccount=kube-system:<service-account-name>

    For example:

    # oc create clusterrolebinding descheduler-cluster-role-binding \
        --clusterrole=descheduler-cluster-role \
        --serviceaccount=kube-system:descheduler-sa

16.3.3. Creating Descheduler Policies

You can configure the descheduler to remove pods from nodes that violate rules defined by strategies in a YAML policy file. You then create a configuration map that includes a path to the policy file and a job specification using that configuration map to apply the specific descheduling strategy.

Sample descheduler policy file

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
     enabled: false
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           "cpu" : 20
           "memory": 20
           "pods": 20
         targetThresholds:
           "cpu" : 50
           "memory": 50
           "pods": 50
         numberOfNodes: 3
  "RemovePodsViolatingInterPodAntiAffinity":
     enabled: true
  "RemovePodsViolatingNodeAffinity":
    enabled: true
    params:
      nodeAffinityType:
      - "requiredDuringSchedulingIgnoredDuringExecution"

There are three default strategies that can be used with the descheduler:

You can configure and disable parameters associated with strategies as needed.

16.3.3.1. Removing Duplicate Pods

The RemoveDuplicates strategy ensures that there is only one pod associated with a Replica Set, Replication Controller, Deployment Configuration, or Job running on same node. If there are other pods associated with those objects, the duplicate pods are evicted. Removing duplicate pods results in better spreading of pods in a cluster.

For example, duplicate pods could happen if a node fails and the pods on the node are moved to another node, leading to more than one pod associated with an Replica Set or Replication Controller, running on same node. After the failed node is ready again, this strategy could be used to evict those duplicate pods.

There are no parameters associated with this strategy.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
     enabled: false 1
1
Set this value to enabled: true to use this policy. Set to false to disable this policy.

16.3.3.2. Creating a Low Node Utilization Policy

The LowNodeUtilization strategy finds nodes that are underutilized and evicts pods from other nodes so that the evicted pods can be scheduled on these underutilized nodes.

The underutilization of nodes is determined by a configurable threshold, thresholds, for CPU, memory, or number of pods (based on percentage). If a node usage is below all these thresholds, the node is considered underutilized and the descheduler can evict pods from other nodes. Pods request resource requirements are considered when computing node resource utilization.

A high threshold value, targetThresholds is used to determine properly utilized nodes. Any node that is between the thresholds and targetThresholds is considered properly utilized and is not considered for eviction. The threshold, targetThresholds, can be configured for CPU, memory, and number of pods (based on percentage).

These thresholds could be tuned for your cluster requirements.

The numberOfNodes parameter can be configured to activate the strategy only when number of underutilized nodes is above the configured value. Set this parameter if it is acceptable for a few nodes to go underutilized. By default, numberOfNodes is set to zero.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds: 1
           "cpu" : 20
           "memory": 20
           "pods": 20
         targetThresholds: 2
           "cpu" : 50
           "memory": 50
           "pods": 50
         numberOfNodes: 3 3
1
Set the low-end threshold. If the node is below all three values, the descheduler considers the node underutilized.
2
Set the high-end threshold. If the node is below these values and above the threshold values, the descheduler considers the node properly utilized.
3
Set the number of nodes that can be underutilized before the descheduler will evict pods from underutilized nodes.

16.3.3.3. Remove Pods Violating Inter-Pod Anti-Affinity

The RemovePodsViolatingInterPodAntiAffinity strategy ensures that pods violating inter-pod anti-affinity are removed from nodes.

For example, Node1 has podA, podB, and podC. podB and podC have anti-affinity rules that prohibit them from running on the same node as podA. podA will be evicted from the node so that podB and podC can run on that node. This situation could happen if the anti-affinity rule was applied when podB and podC were running on the node.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingInterPodAntiAffinity": 1
     enabled: true
1
Set this value to enabled: true to use this policy. Set to false to disable this policy.

16.3.3.4. Remove Pods Violating Node Affinity

The RemovePodsViolatingNodeAffinity strategy ensures that all pods violating node affinity are removed from nodes. This situation could occur if a node no longer satisfies a pod’s affinity rule. If another node is available that satisfies the affinity rule, the pod is evicted.

For example, podA is scheduled on nodeA, because the node satisfies the requiredDuringSchedulingIgnoredDuringExecution node affinity rule at the time of scheduling.If nodeA stops satisfying the rule, and if there is another node available that satisfies the node affinity rule, the strategy evicts podA from nodeA and moves it to the other node.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeAffinity": 1
    enabled: true
    params:
      nodeAffinityType:
      - "requiredDuringSchedulingIgnoredDuringExecution" 2
1
Set this value to enabled: true to use this policy. Set to false to disable this policy.
2
Specify the requiredDuringSchedulingIgnoredDuringExecution node affinity type.

16.3.4. Create a Configuration Map for the Descheduler Policy

Create a configuration map for the descheduler policy file in the kube-system project, so that it can be referenced by the descheduler job.

# oc create configmap descheduler-policy-configmap \
     -n kube-system --from-file=<path-to-policy-dir/policy.yaml> 1
1
The path to the policy file you created.

16.3.5. Create the Job Specification

Create a job configuration for the descheduler.

apiVersion: batch/v1
kind: Job
metadata:
  name: descheduler-job
  namespace: kube-system
spec:
  parallelism: 1
  completions: 1
  template:
    metadata:
      name: descheduler-pod 1
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: "true" 2
    spec:
        containers:
        - name: descheduler
          image: registry.access.redhat.com/openshift3/ose-descheduler
          volumeMounts: 3
          - mountPath: /policy-dir
            name: policy-volume
          command:
          - "/bin/sh"
          - "-ec"
          - |
            /bin/descheduler --policy-config-file /policy-dir/policy.yaml 4
        restartPolicy: "Never"
        serviceAccountName: descheduler-sa 5
        volumes:
        - name: policy-volume
          configMap:
            name: descheduler-policy-configmap
1
Specify a name for the job.
2
Configures the pod so that it will not be descheduled.
3
The volume name and mount path in the container where the job should be mounted.
4
Path in the container where the policy file you created will be stored.
5
Specify the name of the service account you created.

The policy file is mounted as a volume from the configuration map.

16.3.6. Run the Descheduler

To run the descheduler as a job in a pod:

# oc create -f <file-name>.yaml

For example:

# oc create -f descheduler-job.yaml

16.4. Custom Scheduling

16.4.1. Overview

You can run multiple, custom schedulers alongside the default scheduler and configure which scheduler to use for each pods.

To schedule a given pod using a specific scheduler, specify the name of the scheduler in that pod specification.

Note

Information on how to create the scheduler binary is outside the scope of this document. For an example, see Configure Multiple Schedulers in the Kubernetes documentation.

16.4.2. Package the Scheduler

The general process for including a custom scheduler in your cluster involves creating an image and including that image in a deployment.

  1. Package your scheduler binary into a container image.
  2. Create a container image containing the scheduler binary.

    For example:

    FROM <source-image>
    ADD <path-to-binary> /usr/local/bin/kube-scheduler
  3. Save the file as Dockerfile, build the image, and push it to a registry.

    For example:

    docker build -t <dest_env_registry_ip>:<port>/<namespace>/<image name>:<tag>
    docker push <dest_env_registry_ip>:<port>/<namespace>/<image name>:<tag>
  4. In OpenShift Container Platform, create a deployment for the custom scheduler.

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: custom-scheduler
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: custom-scheduler
    subjects:
    - kind: ServiceAccount
      name: custom-scheduler
      namespace: kube-system
    roleRef:
      kind: ClusterRole
      name: system:kube-scheduler
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: custom-scheduler
      namespace: kube-system
      labels:
        app: custom-scheduler
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: custom-scheduler
      template:
        metadata:
          labels:
            app: custom-scheduler
        spec:
          serviceAccount: custom-scheduler
          containers:
            - name: custom-scheduler
              image: "<namespace>/<image name>:<tag>" 1
              imagePullPolicy: Always
    1
    Specify the container image you created for the custom scheduler.

16.4.3. Deploying Pods using a Custom Scheduler

After your custom scheduler is deployed in your cluster, you can configure pods to use that scheduler instead of the default scheduler.

  1. Create or edit a pod configuration and specify the name of the scheduler with the schedulerName parameter. The name must be unique.

    Sample pod specification with scheduler

    apiVersion: v1
    kind: Pod
    metadata:
      name: custom-scheduler-example
      labels:
        name: custom-scheduler-example
    spec:
      schedulerName: custom-scheduler 1
      containers:
      - name: pod-with-second-annotation-container
        image: docker.io/ocpqe/hello-pod

    1
    The name of the scheduler to use. When no scheduler name is supplied, the pod is automatically scheduled using the default scheduler.
  2. Run the following command to create the pod:

    $ oc create -f <file-name>.yaml

    For example:

    $ oc create -f custom-scheduler-example.yaml
  3. Run the following command to check that the pod was created:

    $ oc get pod <file-name>

    For example:

    $ oc get pod custom-scheduler-example
    
    NAME                       READY     STATUS    RESTARTS   AGE
    custom-scheduler-example   1/1       Running   0          4m
  4. Run the following command to check that the custom scheduler scheduled the pod:

    $ oc describe pod <pod-name>

    For example:

    $ oc describe pod custom-scheduler-example

    The name of the scheduler is listed, as shown in the following truncated output:

    ...
    
    Events:
      FirstSeen  LastSeen  Count  From                SubObjectPath  Type       Reason Message
      ---------  --------  -----  ----                -------------  --------   ------ -------
      1m         1m        1      custom-scheduler    Normal         Scheduled  Successfully assigned custom-scheduler to <$node1>
    
    ...

16.5. Controlling Pod Placement

16.5.1. Overview

As a cluster administrator, you can set a policy to prevent application developers with certain roles from targeting specific nodes when scheduling pods.

The Pod Node Constraints admission controller ensures that pods are deployed onto only specified node hosts using labels and prevents users without a specific role from using the nodeSelector field to schedule pods.

16.5.2. Constraining Pod Placement Using Node Name

Use the Pod Node Constraints admission controller to ensure a pod is deployed onto only a specified node host by assigning it a label and specifying this in the nodeName setting in a pod configuration.

  1. Ensure you have the desired labels (see Updating Labels on Nodes for details) and node selector set up in your environment.

    For example, make sure that your pod configuration features the nodeName value indicating the desired label:

    apiVersion: v1
    kind: Pod
    spec:
      nodeName: <value>
  2. Modify the master configuration file, /etc/origin/master/master-config.yaml, to add PodNodeConstraints to the admissionConfig section:

    ...
    admissionConfig:
      pluginConfig:
        PodNodeConstraints:
          configuration:
            apiversion: v1
            kind: PodNodeConstraintsConfig
    ...
  3. Restart OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers

16.5.3. Constraining Pod Placement Using a Node Selector

Using node selectors, you can ensure that pods are only placed onto nodes with specific labels. As a cluster administrator, you can use the Pod Node Constraints admission controller to set a policy that prevents users without the pods/binding permission from using node selectors to schedule pods.

The nodeSelectorLabelBlacklist field of a master configuration file gives you control over the labels that certain roles can specify in a pod configuration’s nodeSelector field. Users, service accounts, and groups that have the pods/binding permission role can specify any node selector. Those without the pods/binding permission are prohibited from setting a nodeSelector for any label that appears in nodeSelectorLabelBlacklist.

For example, an OpenShift Container Platform cluster might consist of five data centers spread across two regions. In the U.S., us-east, us-central, and us-west; and in the Asia-Pacific region (APAC), apac-east and apac-west. Each node in each geographical region is labeled accordingly. For example, region: us-east.

Note

See Updating Labels on Nodes for details on assigning labels.

As a cluster administrator, you can create an infrastructure where application developers should be deploying pods only onto the nodes closest to their geographical location. You can create a node selector, grouping the U.S. data centers into superregion: us and the APAC data centers into superregion: apac.

To maintain an even loading of resources per data center, you can add the desired region to the nodeSelectorLabelBlacklist section of a master configuration. Then, whenever a developer located in the U.S. creates a pod, it is deployed onto a node in one of the regions with the superregion: us label. If the developer tries to target a specific region for their pod (for example, region: us-east), they receive an error. If they try again, without the node selector on their pod, it can still be deployed onto the region they tried to target, because superregion: us is set as the project-level node selector, and nodes labeled region: us-east are also labeled superregion: us.

  1. Ensure you have the desired labels (see Updating Labels on Nodes for details) and node selector set up in your environment.

    For example, make sure that your pod configuration features the nodeSelector value indicating the desired label:

    apiVersion: v1
    kind: Pod
    spec:
      nodeSelector:
        <key>: <value>
    ...
  2. Modify the master configuration file, /etc/origin/master/master-config.yaml, to add nodeSelectorLabelBlacklist to the admissionConfig section with the labels that are assigned to the node hosts you want to deny pod placement:

    ...
    admissionConfig:
      pluginConfig:
        PodNodeConstraints:
          configuration:
            apiversion: v1
            kind: PodNodeConstraintsConfig
            nodeSelectorLabelBlacklist:
              - kubernetes.io/hostname
              - <label>
    ...
  3. Restart OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers

16.5.4. Control Pod Placement to Projects

The Pod Node Selector admission controller allows you to force pods onto nodes associated with a specific project and prevent pods from being scheduled in those nodes.

The Pod Node Selector admission controller determines where a pod can be placed using labels on projects and node selectors specified in pods. A new pod will be placed on a node associated with a project only if the node selectors in the pod match the labels in the project.

After the pod is created, the node selectors are merged into the pod so that the pod specification includes the labels originally included in the specification and any new labels from the node selectors. The example below illustrates the merging effect.

The Pod Node Selector admission controller also allows you to create a list of labels that are permitted in a specific project. This list acts as a whitelist that lets developers know what labels are acceptable to use in a project and gives administrators greater control over labeling in a cluster.

To activate the Pod Node Selector admission controller:

  1. Configure the Pod Node Selector admission controller and whitelist, using one of the following methods:

    • Add the following to the master configuration file, /etc/origin/master/master-config.yaml:

      admissionConfig:
        pluginConfig:
          PodNodeSelector:
            configuration:
              podNodeSelectorPluginConfig: 1
                clusterDefaultNodeSelector: "k3=v3" 2
                ns1: region=west,env=test,infra=fedora,os=fedora 3
      1
      Adds the Pod Node Selector admission controller plug-in.
      2
      Creates default labels for all nodes.
      3
      Creates a whitelist of permitted labels in the specified project. Here, the project is ns1 and the labels are the key=value pairs that follow.
    • Create a file containing the admission controller information:

      podNodeSelectorPluginConfig:
          clusterDefaultNodeSelector: "k3=v3"
           ns1: region=west,env=test,infra=fedora,os=fedora

      Then, reference the file in the master configuration:

      admissionConfig:
        pluginConfig:
          PodNodeSelector:
            location: <path-to-file>
      Note

      If a project does not have node selectors specified, the pods associated with that project will be merged using the default node selector (clusterDefaultNodeSelector).

  2. Restart OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers
  3. Create a project object that includes the scheduler.alpha.kubernetes.io/node-selector annotation and labels.

    apiVersion: v1
    kind: Namespace
    metadata
      name: ns1
      annotations:
        scheduler.alpha.kubernetes.io/node-selector: env=test,infra=fedora 1
    spec: {},
    status: {}
    1
    Annotation to create the labels to match the project label selector. Here, the key/value labels are env=test and infra=fedora.
    Note

    When using the Pod Node Selector admission controller, you cannot use oc adm new-project <project-name> for setting project node selector. When you set the project node selector using the oc adm new-project myproject --node-selector='type=user-node,region=<region> command, OpenShift Container Platform sets the openshift.io/node-selector annotation, which is processed by NodeEnv admission plugin.

  4. Create a pod specification that includes the labels in the node selector, for example:

    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        name: hello-pod
      name: hello-pod
    spec:
      containers:
        - image: "docker.io/ocpqe/hello-pod:latest"
          imagePullPolicy: IfNotPresent
          name: hello-pod
          ports:
            - containerPort: 8080
              protocol: TCP
          resources: {}
          securityContext:
            capabilities: {}
            privileged: false
          terminationMessagePath: /dev/termination-log
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      nodeSelector: 1
        env: test
        os: fedora
      serviceAccount: ""
    status: {}
    1
    Node selectors to match project labels.
  5. Create the pod in the project:

    # oc create -f pod.yaml --namespace=ns1
  6. Check that the node selector labels were added to the pod configuration:

    get pod pod1 --namespace=ns1 -o json
    
    nodeSelector": {
     "env": "test",
     "infra": "fedora",
     "os": "fedora"
    }

    The node selectors are merged into the pod and the pod should be scheduled in the appropriate project.

If you create a pod with a label that is not specified in the project specification, the pod is not scheduled on the node.

For example, here the label env: production is not in any project specification:

nodeSelector:
 "env: production"
 "infra": "fedora",
 "os": "fedora"

If there is a node that does not have a node selector annotation, the pod will be scheduled there.

16.6. Pod Priority and Preemption

16.6.1. Applying pod priority and preemption

You can enable pod priority and preemption in your cluster. Pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. Pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node Pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.

To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.

Preemption is controlled by the disablePreemption parameter in the scheduler configuration file, which is set to false by default.

16.6.2. About pod priority

When the Pod Priority and Preemption feature is enabled, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.

16.6.2.1. Pod priority classes

You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.

A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted. By default, OpenShift Container Platform has two reserved priority classes for critical system pods to have guaranteed scheduling.

  • System-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are sdn-ovs, sdn, and so forth.
  • System-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth.
Note

If you upgrade your existing cluster, the priority of your existing pods is effectively zero. However, existing pods with the scheduler.alpha.kubernetes.io/critical-pod annotation are automatically converted to system-cluster-critical class.

16.6.2.2. Pod priority names

After you have one or more priority classes, you can create pods that specify a priority class name in a pod specification. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.

The following YAML is an example of a pod configuration that uses the priority class created in the preceding example. The priority admission controller checks the specification and resolves the priority of the pod to 1000000.

16.6.3. About pod preemption

When a developer creates a pod, the pod goes into a queue. When the Pod Priority and Preemption feature is enabled, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.

When the scheduler preempts one or more pods on a node, the nominatedNodeName field of higher-priority pod specification is set to the name of the node, along with the nodename field. The scheduler uses the nominatedNodeName field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.

After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the nominatedNodeName field and nodeName field of the pod specification might be different.

Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the nominatedNodeName of the pending pod, making the pod eligible for another node.

Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.

The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.

16.6.3.1. Pod preemption and other scheduler settings

If you enable pod priority and preemption, consider your other scheduler settings:

Pod priority and pod disruption budget
A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OpenShift Container Platform respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.
Pod priority and pod affinity
Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.

If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.

To prevent this situation, carefully configure pod affinity with equal-priority pods.

16.6.3.2. Graceful termination of preempted pods

When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.

To minimize this gap, configure a small graceful termination period for lower-priority pods.

16.6.4. Pod priority example scenarios

Pod priority and preemption assigns a priority to pods for scheduling. The scheduler will preempt (evict) lower-priority pods in order to schedule higher-priority pods.

Typical preemption scenario

Pod P is a pending pod.

  1. The scheduler locates Node N, where the removal of one or more pods would enable Pod P to be scheduled on that node.
  2. The scheduler deletes the lower-priority pods from the Node N and schedules Pod P on the node.
  3. The nominatedNodeName field of Pod P is set to the name of Node N.
Note

Pod P is not necessarily scheduled to the nominated node.

Preemption and termination periods

The preempted pod has a long termination period.

  1. The scheduler preempts a lower-priority pod on Node N.
  2. The scheduler waits for the pod to gracefully terminate.
  3. For other scheduling reasons, Node M becomes available.
  4. The scheduler can then schedule Pod P on Node M.

16.6.5. Configuring priority and preemption

You apply pod priority and preemption by creating a priority class objects and associating pods to the priority using the priorityClassName in your pod specifications.

Sample priority class object

apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: high-priority 1
value: 1000000 2
globalDefault: false 3
description: "This priority class should be used for XYZ service pods only." 4

1
The name of the priority class object.
2
The priority value of the object.
3
Optional field that indicates whether this priority class should be used for pods without a priority class name specified. This field is false by default. Only one priority class with globalDefault set to true can exist in the cluster. If there is no priority class with globalDefault:true, the priority of pods with no priority class name is zero. Adding a priority class with globalDefault:true affects only pods created after the priority class is added and does not change the priorities of existing pods.
4
Optional arbitrary text string that describes which pods developers should use with this priority class.

Sample pod specification with priority class name

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority 1

1
Specify the priority class to use with this pod.

To configure your cluster to use priority and preemption:

  1. Create one or more priority classes:

    1. Specify a name and value for the priority.
    2. Optionally specify the globalDefault field in the priority class and a description.
  2. Create pods or edit existing pods to include the name of a priority class. You can add the priority name directly to the pod configuration or to a pod template:

16.6.6. Disabling priority and preemption

You can disable the pod priority and preemption feature.

After the feature is disabled, the existing pods keep their priority fields, but preemption is disabled, and priority fields are ignored. If the feature is disabled, you cannot set a priority class name in new pods.

Important

Critical pods rely on scheduler preemption to be scheduled when a cluster is under resource pressure. For this reason, Red Hat recommends not disabling preemption. DaemonSet pods are scheduled by the DaemonSet controller and not affected by disabling preemption.

To disable the preemption for the cluster:

  1. Modify the master-config.yaml to set the disablePreemption parameter in the schedulerArgs section to false.

    disablePreemption=false
  2. Restart the OpenShift Container Platform master service and scheduler to apply the changes.

    # master-restart api
    # master-restart scheduler

16.7. Advanced Scheduling

16.7.1. Overview

Advanced scheduling involves configuring a pod so that the pod is required to run on particular nodes or has a preference to run on particular nodes.

Generally, advanced scheduling is not necessary, as the OpenShift Container Platform automatically places pods in a reasonable manner. For example, the default scheduler attempts to distribute pods across the nodes evenly and considers the available resources in a node. However, you might want more control over where a pod is placed.

If a pod needs to be on a machine with a faster disk speed (or prevented from being placed on that machine) or pods from two different services need to be located so they can communicate, you can use advanced scheduling to make that happen.

To ensure that appropriate new pods are scheduled on a dedicated group of nodes and prevent other new pods from being scheduled on those nodes, you can combine these methods as needed.

16.7.2. Using Advanced Scheduling

There are several ways to invoke advanced scheduling in your cluster:

Pod Affinity and Anti-affinity

Pod affinity allows a pod to specify an affinity (or anti-affinity) towards a group of pods (for an application’s latency requirements, due to security, and so forth) it can be placed with. The node does not have control over the placement.

Pod affinity uses labels on nodes and label selectors on pods to create rules for pod placement. Rules can be mandatory (required) or best-effort (preferred).

See Using Pod Affinity and Anti-affinity.

Node Affinity

Node affinity allows a pod to specify an affinity (or anti-affinity) towards a group of nodes (due to their special hardware, location, requirements for high availability, and so forth) it can be placed on. The node does not have control over the placement.

Node affinity uses labels on nodes and label selectors on pods to create rules for pod placement. Rules can be mandatory (required) or best-effort (preferred).

See Using Node Affinity.

Node Selectors

Node selectors are the simplest form of advanced scheduling. Like node affinity, node selectors also use labels on nodes and label selectors on pods to allow a pod to control the nodes on which it can be placed. However, node selectors do not have required and preferred rules that node affinities have.

See Using Node Selectors.

Taints and Tolerations

Taints/Tolerations allow the node to control which pods should (or should not) be scheduled on them. Taints are labels on a node and tolerations are labels on a pod. The labels on the pod must match (or tolerate) the label (taint) on the node in order to be scheduled.

Taints/tolerations have one advantage over affinities. For example, if you add to a cluster a new group of nodes with different labels, you would need to update affinities on each of the pods you want to access the node and on any other pods you do not want to use the new nodes. With taints/tolerations, you would only need to update those pods that are required to land on those new nodes, because other pods would be repelled.

See Using Taints and Tolerations.

16.8. Advanced Scheduling and Node Affinity

16.8.1. Overview

Node affinity is a set of rules used by the scheduler to determine where a pod can be placed. The rules are defined using custom labels on nodes and label selectors specified in pods. Node affinity allows a pod to specify an affinity (or anti-affinity) towards a group of nodes it can be placed on. The node does not have control over the placement.

For example, you could configure a pod to only run on a node with a specific CPU or in a specific availability zone.

There are two types of node affinity rules: required and preferred.

Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Note

If labels on a node change at runtime that results in an node affinity rule on a pod no longer being met, the pod continues to run on the node.

16.8.2. Configuring Node Affinity

You configure node affinity through the pod specification file. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.

The following example is a pod specification with a rule that requires the pod be placed on a node with a label whose key is e2e-az-NorthSouth and whose value is either e2e-az-North or e2e-az-South:

Sample pod configuration file with a node affinity required rule

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity: 1
      requiredDuringSchedulingIgnoredDuringExecution: 2
        nodeSelectorTerms:
        - matchExpressions:
          - key: e2e-az-NorthSouth 3
            operator: In 4
            values:
            - e2e-az-North 5
            - e2e-az-South 6
  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod

1
The stanza to configure node affinity.
2
Defines a required rule.
3 5 6
The key/value pair (label) that must be matched to apply the rule.
4
The operator represents the relationship between the label on the node and the set of values in the matchExpression parameters in the pod specification. This value can be In, NotIn, Exists, or DoesNotExist, Lt, or Gt.

The following example is a node specification with a preferred rule that a node with a label whose key is e2e-az-EastWest and whose value is either e2e-az-East or e2e-az-West is preferred for the pod:

Sample pod configuration file with a node affinity preferred rule

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity: 1
      preferredDuringSchedulingIgnoredDuringExecution: 2
      - weight: 1 3
        preference:
          matchExpressions:
          - key: e2e-az-EastWest 4
            operator: In 5
            values:
            - e2e-az-East 6
            - e2e-az-West 7
  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod

1
The stanza to configure node affinity.
2
Defines a preferred rule.
3
Specifies a weight for a preferred rule. The node with highest weight is preferred.
4 6 7
The key/value pair (label) that must be matched to apply the rule.
5
The operator represents the relationship between the label on the node and the set of values in the matchExpression parameters in the pod specification. This value can be In, NotIn, Exists, or DoesNotExist, Lt, or Gt.

There is no explicit node anti-affinity concept, but using the NotIn or DoesNotExist operator replicates that behavior.

Note

If you are using node affinity and node selectors in the same pod configuration, note the following:

  • If you configure both nodeSelector and nodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node.
  • If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
  • If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions are satisfied.

16.8.2.1. Configuring a Required Node Affinity Rule

Required rules must be met before a pod can be scheduled on a node.

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler is required to place on the node.

  1. Add a label to a node by editing the node configuration or by using the oc label node command:

    $ oc label node node1 e2e-az-name=e2e-az1
    Note

    To modify a node in your cluster, update the node configuration maps as needed. Do not manually edit the node-config.yaml file.

  2. In the pod specification, use the nodeAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter:

    1. Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same key and value parameters as the label in the node.
    2. Specify an operator. The operator can be In, NotIn, Exists, DoesNotExist, Lt, or Gt. For example, use the operator In to require the label to be in the node:

      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: e2e-az-name
                  operator: In
                  values:
                  - e2e-az1
                  - e2e-az2
  3. Create the pod:

    $ oc create -f e2e-az2.yaml

16.8.2.2. Configuring a Preferred Node Affinity Rule

Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler tries to place on the node.

  1. Add a label to a node by editing the node configuration or by executing the oc label node command:

    $ oc label node node1 e2e-az-name=e2e-az3
    Note

    To modify a node in your cluster, update the node configuration maps as needed. Do not manually edit the node-config.yaml file.

  2. In the pod specification, use the nodeAffinity stanza to configure the preferredDuringSchedulingIgnoredDuringExecution parameter:

    1. Specify a weight for the node, as a number 1-100. The node with highest weight is preferred.
    2. Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same key and value parameters as the label in the node:

            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                - key: e2e-az-name
                  operator: In
                  values:
                  - e2e-az3
  3. Specify an operator. The operator can be In, NotIn, Exists, DoesNotExist, Lt, or Gt. For example, use the operator In to require the label to be in the node.
  4. Create the pod.

    $ oc create -f e2e-az3.yaml

16.8.3. Examples

The following examples demonstrate node affinity.

16.8.3.1. Node Affinity with Matching Labels

The following example demonstrates node affinity for a node and pod with matching labels:

  • The Node1 node has the label zone:us:

    $ oc label node node1 zone=us
  • The pod pod-s1 has the zone and us key/value pair under a required node affinity rule:

    $ cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
    spec:
      containers:
        - image: "docker.io/ocpqe/hello-pod"
          name: hello-pod
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: "zone"
                  operator: In
                  values:
                  - us
  • Create the pod using the standard command:

    $ oc create -f pod-s1.yaml
    pod "pod-s1" created
  • The pod pod-s1 can be scheduled on Node1:

    $ oc get pod -o wide
    NAME     READY     STATUS       RESTARTS   AGE      IP      NODE
    pod-s1   1/1       Running      0          4m       IP1     node1

16.8.3.2. Node Affinity with No Matching Labels

The following example demonstrates node affinity for a node and pod without matching labels:

  • The Node1 node has the label zone:emea:

    $ oc label node node1 zone=emea
  • The pod pod-s1 has the zone and us key/value pair under a required node affinity rule:

    $ cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
    spec:
      containers:
        - image: "docker.io/ocpqe/hello-pod"
          name: hello-pod
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: "zone"
                  operator: In
                  values:
                  - us
  • The pod pod-s1 cannot be scheduled on Node1:

    $ oc describe pod pod-s1
    
    ...
    Events:
     FirstSeen LastSeen Count From              SubObjectPath  Type                Reason
     --------- -------- ----- ----              -------------  --------            ------
     1m        33s      8     default-scheduler Warning        FailedScheduling    No nodes are available that match all of the following predicates:: MatchNodeSelector (1).

16.9. Advanced Scheduling and Pod Affinity and Anti-affinity

16.9.1. Overview

Pod affinity and pod anti-affinity allow you to specify rules about how pods should be placed relative to other pods. The rules are defined using custom labels on nodes and label selectors specified in pods. Pod affinity/anti-affinity allows a pod to specify an affinity (or anti-affinity) towards a group of pods it can be placed with. The node does not have control over the placement.

For example, using affinity rules, you could spread or pack pods within a service or relative to pods in other services. Anti-affinity rules allow you to prevent pods of a particular service from scheduling on the same nodes as pods of another service that are known to interfere with the performance of the pods of the first service. Or, you could spread the pods of a service across nodes or availability zones to reduce correlated failures.

Pod affinity/anti-affinity allows you to constrain which nodes your pod is eligible to be scheduled on based on the labels on other pods. A label is a key/value pair.

  • Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
  • Pod anti-affinity can prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod.

There are two types of pod affinity rules: required and preferred.

Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Note

Depending on your pod priority and preemption settings, the scheduler might not be able to find an appropriate node for a pod without violating affinity requirements. If so, a pod might not be scheduled.

To prevent this situation, carefully configure pod affinity with equal-priority pods.

16.9.2. Configuring Pod Affinity and Anti-affinity

You configure pod affinity/anti-affinity through the pod specification files. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.

The following example shows a pod specification configured for pod affinity and anti-affinity.

In this example, the pod affinity rule indicates that the pod can schedule onto a node only if that node has at least one already-running pod with a label that has the key security and value S1. The pod anti-affinity rule says that the pod prefers to not schedule onto a node if that node is already running a pod with label having key security and value S2.

Sample pod config file with pod affinity

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity: 1
      requiredDuringSchedulingIgnoredDuringExecution: 2
      - labelSelector:
          matchExpressions:
          - key: security 3
            operator: In 4
            values:
            - S1 5
        topologyKey: failure-domain.beta.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

1
Stanza to configure pod affinity.
2
Defines a required rule.
3 5
The key and value (label) that must be matched to apply the rule.
4
The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

Sample pod config file with pod anti-affinity

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity: 1
      preferredDuringSchedulingIgnoredDuringExecution: 2
      - weight: 100  3
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security 4
              operator: In 5
              values:
              - S2
          topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

1
Stanza to configure pod anti-affinity.
2
Defines a preferred rule.
3
Specifies a weight for a preferred rule. The node with the highest weight is preferred.
4
Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
5
The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.
Note

If labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod continues to run on the node.

16.9.2.1. Configuring an Affinity Rule

The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses affinity to allow scheduling with that pod.

  1. Create a pod with a specific label in the pod specification:

    $ cat team4.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: security-s1
      labels:
        security: S1
    spec:
      containers:
      - name: security-s1
        image: docker.io/ocpqe/hello-pod
  2. When creating other pods, edit the pod specification as follows:

    1. Use the podAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter or preferredDuringSchedulingIgnoredDuringExecution parameter:
    2. Specify the key and value that must be met. If you want the new pod to be scheduled with the other pod, use the same key and value parameters as the label on the first pod.

          podAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: security
                  operator: In
                  values:
                  - S1
              topologyKey: failure-domain.beta.kubernetes.io/zone
    3. Specify an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.
    4. Specify a topologyKey, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
  3. Create the pod.

    $ oc create -f <pod-spec>.yaml

16.9.2.2. Configuring an Anti-affinity Rule

The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses an anti-affinity preferred rule to attempt to prevent scheduling with that pod.

  1. Create a pod with a specific label in the pod specification:

    $ cat team4.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: security-s2
      labels:
        security: S2
    spec:
      containers:
      - name: security-s2
        image: docker.io/ocpqe/hello-pod
  2. When creating other pods, edit the pod specification to set the following parameters:
  3. Use the podAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter or preferredDuringSchedulingIgnoredDuringExecution parameter:

    1. Specify a weight for the node, 1-100. The node that with highest weight is preferred.
    2. Specify the key and values that must be met. If you want the new pod to not be scheduled with the other pod, use the same key and value parameters as the label on the first pod.

          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: security
                    operator: In
                    values:
                    - S2
                topologyKey: kubernetes.io/hostname
    3. For a preferred rule, specify a weight, 1-100.
    4. Specify an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.
  4. Specify a topologyKey, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
  5. Create the pod.

    $ oc create -f <pod-spec>.yaml

16.9.3. Examples

The following examples demonstrate pod affinity and pod anti-affinity.

16.9.3.1. Pod Affinity

The following example demonstrates pod affinity for pods with matching labels and label selectors.

  • The pod team4 has the label team:4.

    $ cat team4.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: team4
      labels:
         team: "4"
    spec:
      containers:
      - name: ocp
        image: docker.io/ocpqe/hello-pod
  • The pod team4a has the label selector team:4 under podAffinity.

    $ cat pod-team4a.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: team4a
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: team
                operator: In
                values:
                - "4"
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pod-affinity
        image: docker.io/ocpqe/hello-pod
  • The team4a pod is scheduled on the same node as the team4 pod.

16.9.3.2. Pod Anti-affinity

The following example demonstrates pod anti-affinity for pods with matching labels and label selectors.

  • The pod pod-s1 has the label security:s1.

    $ cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
      labels:
        security: s1
    spec:
      containers:
      - name: ocp
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 has the label selector security:s1 under podAntiAffinity.

    $ cat pod-s2.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s2
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: security
                operator: In
                values:
                - s1
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pod-antiaffinity
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 cannot be scheduled on the same node as pod-s1.

16.9.3.3. Pod Affinity with no Matching Labels

The following example demonstrates pod affinity for pods without matching labels and label selectors.

  • The pod pod-s1 has the label security:s1.

    $ cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
      labels:
        security: s1
    spec:
      containers:
      - name: ocp
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 has the label selector security:s2.

    $ cat pod-s2.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s2
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: security
                operator: In
                values:
                - s2
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pod-affinity
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 is not scheduled unless there is a node with a pod that has the security:s2 label. If there is no other pod with that label, the new pod remains in a pending state:

    NAME      READY     STATUS    RESTARTS   AGE       IP        NODE
    pod-s2    0/1       Pending   0          32s       <none>

16.10. Advanced Scheduling and Node Selectors

16.10.1. Overview

A node selector specifies a map of key-value pairs. The rules are defined using custom labels on nodes and selectors specified in pods.

For the pod to be eligible to run on a node, the pod must have the indicated key-value pairs as the label on the node.

If you are using node affinity and node selectors in the same pod configuration, see the important considerations below.

16.10.2. Configuring Node Selectors

Using nodeSelector in a pod configuration, you can ensure that pods are only placed onto nodes with specific labels.

  1. Ensure you have the desired labels (see Updating Labels on Nodes for details) and node selector set up in your environment.

    For example, make sure that your pod configuration features the nodeSelector value indicating the desired label:

    apiVersion: v1
    kind: Pod
    spec:
      nodeSelector:
        <key>: <value>
    ...
  2. Modify the master configuration file, /etc/origin/master/master-config.yaml, to add nodeSelectorLabelBlacklist to the admissionConfig section with the labels that are assigned to the node hosts you want to deny pod placement:

    ...
    admissionConfig:
      pluginConfig:
        PodNodeConstraints:
          configuration:
            apiversion: v1
            kind: PodNodeConstraintsConfig
            nodeSelectorLabelBlacklist:
              - kubernetes.io/hostname
              - <label>
    ...
  3. Restart OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers
Note

If you are using node selectors and node affinity in the same pod configuration, note the following:

  • If you configure both nodeSelector and nodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node.
  • If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
  • If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions are satisfied.

16.11. Advanced Scheduling and Taints and Tolerations

16.11.1. Overview

Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.

16.11.2. Taints and Tolerations

A taint allows a node to refuse pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the node specification (NodeSpec) and apply tolerations to a pod through the pod specification (PodSpec). A taint on a node instructs the node to repel all pods that do not tolerate the taint.

Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.

Table 16.1. Taint and toleration components

ParameterDescription

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

NoSchedule

  • New pods that do not match the taint are not scheduled onto that node.
  • Existing pods on the node remain.

PreferNoSchedule

  • New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.
  • Existing pods on the node remain.

NoExecute

  • New pods that do not match the taint cannot be scheduled onto that node.
  • Existing pods on the node that do not have a matching toleration are removed.

operator

Equal

The key/value/effect parameters must match. This is the default.

Exists

The key/effect parameters must match. You must leave a blank value parameter, which matches any.

A toleration matches a taint:

  • If the operator parameter is set to Equal:

    • the key parameters are the same;
    • the value parameters are the same;
    • the effect parameters are the same.
  • If the operator parameter is set to Exists:

    • the key parameters are the same;
    • the effect parameters are the same.

16.11.2.1. Using Multiple Taints

You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Container Platform processes multiple taints and tolerations as follows:

  1. Process the taints for which the pod has a matching toleration.
  2. The remaining unmatched taints have the indicated effects on the pod:

    • If there is at least one unmatched taint with effect NoSchedule, OpenShift Container Platform cannot schedule a pod onto that node.
    • If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OpenShift Container Platform tries to not schedule the pod onto the node.
    • If there is at least one unmatched taint with effect NoExecute, OpenShift Container Platform evicts the pod from the node (if it is already running on the node), or the pod is not scheduled onto the node (if it is not yet running on the node).

      • Pods that do not tolerate the taint are evicted immediately.
      • Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever.
      • Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

For example:

  • The node has the following taints:

    $ oc adm taint nodes node1 key1=value1:NoSchedule
    $ oc adm taint nodes node1 key1=value1:NoExecute
    $ oc adm taint nodes node1 key2=value2:NoSchedule
  • The pod has the following tolerations:

    tolerations:
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoSchedule"
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoExecute"

In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

16.11.3. Adding a Taint to an Existing Node

You add a taint to a node using the oc adm taint command with the parameters described in the Taint and toleration components table:

$ oc adm taint nodes <node-name> <key>=<value>:<effect>

For example:

$ oc adm taint nodes node1 key1=value1:NoExecute

The example places a taint on node1 that has key key1, value value1, and taint effect NoExecute.

16.11.4. Adding a Toleration to a Pod

To add a toleration to a pod, edit the pod specification to include a tolerations section:

Sample pod configuration file with Equal operator

tolerations:
- key: "key1" 1
  operator: "Equal" 2
  value: "value1" 3
  effect: "NoExecute" 4
  tolerationSeconds: 3600 5

1 2 3 4
The toleration parameters, as described in the Taint and toleration components table.
5
The tolerationSeconds parameter specifies how long a pod can remain bound to a node before being evicted. See Using Toleration Seconds to Delay Pod Evictions below.

Sample pod configuration file with Exists operator

tolerations:
- key: "key1"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 3600

Both of these tolerations match the taint created by the oc adm taint command above. A pod with either toleration would be able to schedule onto node1.

16.11.4.1. Using Toleration Seconds to Delay Pod Evictions

You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the pod specification. If a taint with the NoExecute effect is added to a node, any pods that do not tolerate the taint are evicted immediately (pods that do tolerate the taint are not evicted). However, if a pod that to be evicted has the tolerationSeconds parameter, the pod is not evicted until that time period expires.

For example:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

Here, if this pod is running but does not have a matching taint, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.

16.11.4.1.1. Setting a Default Value for Toleration Seconds

This plug-in sets the default forgiveness toleration for pods, to tolerate the node.kubernetes.io/not-ready:NoExecute and node.kubernetes.io/unreachable:NoExecute taints for five minutes.

If the pod configuration provided by the user already has either toleration, the default is not added.

To enable Default Toleration Seconds:

  1. Modify the master configuration file (/etc/origin/master/master-config.yaml) to Add DefaultTolerationSeconds to the admissionConfig section:

    admissionConfig:
      pluginConfig:
        DefaultTolerationSeconds:
          configuration:
            kind: DefaultAdmissionConfig
            apiVersion: v1
            disable: false
  2. Restart OpenShift for the changes to take effect:

    # master-restart api
    # master-restart controllers
  3. Verify that the default was added:

    1. Create a pod:

      $ oc create -f </path/to/file>

      For example:

      $ oc create -f hello-pod.yaml
      pod "hello-pod" created
    2. Check the pod tolerations:

      $ oc describe pod <pod-name> |grep -i toleration

      For example:

      $ oc describe pod hello-pod |grep -i toleration
      Tolerations:    node.kubernetes.io/not-ready=:Exists:NoExecute for 300s

16.11.5. Pod Eviction for Node Problems

OpenShift Container Platform can be configured to represent node unreachable and node not ready conditions as taints. This allows per-pod specification of how long to remain bound to a node that becomes unreachable or not ready, rather than using the default of five minutes.

When the Taint Based Evictions feature is enabled, the taints are automatically added by the node controller and the normal logic for evicting pods from Ready nodes is disabled.

  • If a node enters a not ready state, the node.kubernetes.io/not-ready:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.
  • If a node enters a not reachable state, the node.kubernetes.io/unreachable:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.

To enable Taint Based Evictions:

  1. Modify the master configuration file (/etc/origin/master/master-config.yaml) to add the following to the kubernetesMasterConfig section:

    kubernetesMasterConfig:
       controllerArguments:
         feature-gates:
         - TaintBasedEvictions=true
  2. Check that the taint is added to a node:

    $ oc describe node $node | grep -i taint
    
    Taints: node.kubernetes.io/not-ready:NoExecute
  3. Restart OpenShift for the changes to take effect:

    # master-restart api
    # master-restart controllers
  4. Add a toleration to pods:

    tolerations:
    - key: "node.kubernetes.io/unreachable"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 6000

    or

    tolerations:
    - key: "node.kubernetes.io/not-ready"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 6000
Note

To maintain the existing rate limiting behavior of pod evictions due to node problems, the system adds the taints in a rate-limited way. This prevents massive pod evictions in scenarios such as the master becoming partitioned from the nodes.

16.11.6. Daemonsets and Tolerations

DaemonSet pods are created with NoExecute tolerations for node.kubernetes.io/unreachable and node.kubernetes.io/not-ready with no tolerationSeconds to ensure that DaemonSet pods are never evicted due to these problems, even when the Default Toleration Seconds feature is disabled.

16.11.7. Examples

Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that should not be running on a node. A few of typical scenrios are:

16.11.7.1. Dedicating a Node for a User

You can specify a set of nodes for exclusive use by a particular set of users.

To specify dedicated nodes:

  1. Add a taint to those nodes:

    For example:

    $ oc adm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    Only the pods with the tolerations are allowed to use the dedicated nodes.

16.11.7.2. Binding a User to a Node

You can configure a node so that particular users can use only the dedicated nodes.

To configure a node so that users can use only that node:

  1. Add a taint to those nodes:

    For example:

    $ oc adm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    The admission controller should add a node affinity to require that the pods can only schedule onto nodes labeled with the key:value label (dedicated=groupName).

  3. Add a label similar to the taint (such as the key:value label) to the dedicated nodes.

16.11.7.3. Nodes with Special Hardware

In a cluster where a small subset of nodes have specialized hardware (for example GPUs), you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.

To ensure pods are blocked from the specialized hardware:

  1. Taint the nodes that have the specialized hardware using one of the following commands:

    $ oc adm taint nodes <node-name> disktype=ssd:NoSchedule
    $ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule
  2. Adding a corresponding toleration to pods that use the special hardware using an admission controller.

For example, the admission controller could use some characteristic(s) of the pod to determine that the pod should be allowed to use the special nodes by adding a toleration.

To ensure pods can only use the specialized hardware, you need some additional mechanism. For example, you could label the nodes that have the special hardware and use node affinity on the pods that need the hardware.