Chapter 2. Controlling pod placement onto nodes (scheduling)

2.1. Controlling pod placement using the scheduler

Pod scheduling is an internal process that determines placement of new pods onto nodes within the cluster.

The scheduler code has a clean separation that watches new pods as they get created and identifies the most suitable node to host them. It then creates bindings (pod to node bindings) for the pods using the master API.

Default pod scheduling
OpenShift Container Platform comes with a default scheduler that serves the needs of most users. The default scheduler uses both inherent and customization tools to determine the best fit for a pod.
Advanced pod scheduling

In situations where you might want more control over where new pods are placed, the OpenShift Container Platform advanced scheduling features allow you to configure a pod so that the pod is required or has a preference to run on a particular node or alongside a specific pod by.

2.1.1. Scheduler Use Cases

One of the important use cases for scheduling within OpenShift Container Platform is to support flexible affinity and anti-affinity policies.

2.1.1.1. Infrastructure Topological Levels

Administrators can define multiple topological levels for their infrastructure (nodes) by specifying labels on nodes. For example: region=r1, zone=z1, rack=s1.

These label names have no particular meaning and administrators are free to name their infrastructure levels anything, such as city/building/room. Also, administrators can define any number of levels for their infrastructure topology, with three levels usually being adequate (such as: regionszonesracks). Administrators can specify affinity and anti-affinity rules at each of these levels in any combination.

2.1.1.2. Affinity

Administrators should be able to configure the scheduler to specify affinity at any topological level, or even at multiple levels. Affinity at a particular level indicates that all pods that belong to the same service are scheduled onto nodes that belong to the same level. This handles any latency requirements of applications by allowing administrators to ensure that peer pods do not end up being too geographically separated. If no node is available within the same affinity group to host the pod, then the pod is not scheduled.

If you need greater control over where the pods are scheduled, see Controlling pod placement on nodes using node affinity rules and Placing pods relative to other pods using affinity and anti-affinity rules.

These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

2.1.1.3. Anti-Affinity

Administrators should be able to configure the scheduler to specify anti-affinity at any topological level, or even at multiple levels. Anti-affinity (or 'spread') at a particular level indicates that all pods that belong to the same service are spread across nodes that belong to that level. This ensures that the application is well spread for high availability purposes. The scheduler tries to balance the service pods across all applicable nodes as evenly as possible.

If you need greater control over where the pods are scheduled, see Controlling pod placement on nodes using node affinity rules and Placing pods relative to other pods using affinity and anti-affinity rules.

These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

2.2. Configuring the default scheduler to control pod placement

The default OpenShift Container Platform pod scheduler is responsible for determining placement of new pods onto nodes within the cluster. It reads data from the pod and tries to find a node that is a good fit based on configured policies. It is completely independent and exists as a standalone/pluggable solution. It does not modify the pod and just creates a binding for the pod that ties the pod to the particular node.

A selection of predicates and priorities defines the policy for the scheduler. See Modifying scheduler policy for a list of predicates and priorities.

Sample default scheduler object

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  creationTimestamp: 2019-05-20T15:39:01Z
  generation: 1
  name: cluster
  resourceVersion: "1491"
  selfLink: /apis/config.openshift.io/v1/schedulers/cluster
  uid: 6435dd99-7b15-11e9-bd48-0aec821b8e34
spec:
  policy: 1
    name: scheduler-policy
  defaultNodeSelector: type=user-node,region=east 2

1
You can specify the name of a custom scheduler policy file.
2
Optional: Specify a default node selector to restrict pod placement to specific nodes. The default node selector is applied to the pods created in all namespaces. Pods can be scheduled on nodes with labels that match the default node selector and any existing pod node selectors. Namespaces having project-wide node selectors are not impacted even if this field is set.

2.2.1. Understanding default scheduling

The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:

Filters the Nodes
The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates.
Prioritize the Filtered List of Nodes
This is achieved by passing each node through a series of priority_ functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each priority function. The node score provided by each priority function is multiplied by the weight (default weight for most priorities is 1) and then combined by adding the scores for each node provided by all the priorities. This weight attribute can be used by administrators to give higher importance to some priorities.
Select the Best Fit Node
The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.

2.2.1.1. Understanding Scheduler Policy

The selection of the predicate and priorities defines the policy for the scheduler.

The scheduler configuration file is a JSON file, which must be named policy.cfg, that specifies the predicates and priorities the scheduler will consider.

In the absence of the scheduler policy file, the default scheduler behavior is used.

Important

The predicates and priorities defined in the scheduler configuration file completely override the default scheduler policy. If any of the default predicates and priorities are required, you must explicitly specify the functions in the policy configuration.

Sample scheduler config map

apiVersion: v1
data:
  policy.cfg: |
    {
        "kind" : "Policy",
        "apiVersion" : "v1",
        "predicates" : [
                {"name" : "MaxGCEPDVolumeCount"},
                {"name" : "GeneralPredicates"}, 1
                {"name" : "MaxAzureDiskVolumeCount"},
                {"name" : "MaxCSIVolumeCountPred"},
                {"name" : "CheckVolumeBinding"},
                {"name" : "MaxEBSVolumeCount"},
                {"name" : "MatchInterPodAffinity"},
                {"name" : "CheckNodeUnschedulable"},
                {"name" : "NoDiskConflict"},
                {"name" : "NoVolumeZoneConflict"},
                {"name" : "PodToleratesNodeTaints"}
                ],
        "priorities" : [
                {"name" : "LeastRequestedPriority", "weight" : 1},
                {"name" : "BalancedResourceAllocation", "weight" : 1},
                {"name" : "ServiceSpreadingPriority", "weight" : 1},
                {"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
                {"name" : "NodeAffinityPriority", "weight" : 1},
                {"name" : "TaintTolerationPriority", "weight" : 1},
                {"name" : "ImageLocalityPriority", "weight" : 1},
                {"name" : "SelectorSpreadPriority", "weight" : 1},
                {"name" : "InterPodAffinityPriority", "weight" : 1},
                {"name" : "EqualPriority", "weight" : 1}
                ]
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2019-09-17T08:42:33Z"
  name: scheduler-policy
  namespace: openshift-config
  resourceVersion: "59500"
  selfLink: /api/v1/namespaces/openshift-config/configmaps/scheduler-policy
  uid: 17ee8865-d927-11e9-b213-02d1e1709840`

1
The GeneralPredicates predicate represents the PodFitsResources, HostName, PodFitsHostPorts, and MatchNodeSelector predicates. Because you are not allowed to configure the same predicate multiple times, the GeneralPredicates predicate cannot be used alongside any of the four represented predicates.

2.2.2. Creating a scheduler policy file

You can change the default scheduling behavior by creating a JSON file with the desired predicates and priorities. You then generate a config map from the JSON file and point the cluster Scheduler object to use the config map.

Procedure

To configure the scheduler policy:

  1. Create a JSON file named policy.cfg with the desired predicates and priorities.

    Sample scheduler JSON file

    {
    "kind" : "Policy",
    "apiVersion" : "v1",
    "predicates" : [ 1
            {"name" : "MaxGCEPDVolumeCount"},
            {"name" : "GeneralPredicates"},
            {"name" : "MaxAzureDiskVolumeCount"},
            {"name" : "MaxCSIVolumeCountPred"},
            {"name" : "CheckVolumeBinding"},
            {"name" : "MaxEBSVolumeCount"},
            {"name" : "MatchInterPodAffinity"},
            {"name" : "CheckNodeUnschedulable"},
            {"name" : "NoDiskConflict"},
            {"name" : "NoVolumeZoneConflict"},
            {"name" : "PodToleratesNodeTaints"}
            ],
    "priorities" : [ 2
            {"name" : "LeastRequestedPriority", "weight" : 1},
            {"name" : "BalancedResourceAllocation", "weight" : 1},
            {"name" : "ServiceSpreadingPriority", "weight" : 1},
            {"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
            {"name" : "NodeAffinityPriority", "weight" : 1},
            {"name" : "TaintTolerationPriority", "weight" : 1},
            {"name" : "ImageLocalityPriority", "weight" : 1},
            {"name" : "SelectorSpreadPriority", "weight" : 1},
            {"name" : "InterPodAffinityPriority", "weight" : 1},
            {"name" : "EqualPriority", "weight" : 1}
            ]
    }

    1
    Add the predicates as needed.
    2
    Add the priorities as needed.
  2. Create a config map based on the scheduler JSON file:

    $ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name> 1
    1
    Enter a name for the config map.

    For example:

    $ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy

    Example output

    configmap/scheduler-policy created

  3. Edit the Scheduler Operator custom resource to add the config map:

    $ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"<configmap-name>"}}}' --type=merge 1
    1
    Specify the name of the config map.

    For example:

    $ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge

    After making the change to the Scheduler config resource, wait for the openshift-kube-apiserver pods to redeploy. This can take several minutes. Until the pods redeploy, new scheduler does not take effect.

  4. Verify the scheduler policy is configured by viewing the log of a scheduler pod in the openshift-kube-scheduler namespace. The following command checks for the predicates and priorities that are being registered by the scheduler:

    $ oc logs <scheduler-pod> | grep predicates

    For example:

    $ oc logs openshift-kube-scheduler-ip-10-0-141-29.ec2.internal | grep predicates

    Example output

    Creating scheduler with fit predicates 'map[MaxGCEPDVolumeCount:{} MaxAzureDiskVolumeCount:{} CheckNodeUnschedulable:{} NoDiskConflict:{} NoVolumeZoneConflict:{} GeneralPredicates:{} MaxCSIVolumeCountPred:{} CheckVolumeBinding:{} MaxEBSVolumeCount:{} MatchInterPodAffinity:{} PodToleratesNodeTaints:{}]' and priority functions 'map[InterPodAffinityPriority:{} LeastRequestedPriority:{} ServiceSpreadingPriority:{} ImageLocalityPriority:{} SelectorSpreadPriority:{} EqualPriority:{} BalancedResourceAllocation:{} NodePreferAvoidPodsPriority:{} NodeAffinityPriority:{} TaintTolerationPriority:{}]'

2.2.3. Modifying scheduler policies

You change scheduling behavior by creating or editing your scheduler policy config map in the openshift-config project. Add and remove predicates and priorities to the config map to create a scheduler policy.

Procedure

To modify the current custom scheduling, use one of the following methods:

  • Edit the scheduler policy config map:

    $ oc edit configmap <configmap-name>  -n openshift-config

    For example:

    $ oc edit configmap scheduler-policy -n openshift-config

    Example output

    apiVersion: v1
    data:
      policy.cfg: |
        {
            "kind" : "Policy",
            "apiVersion" : "v1",
            "predicates" : [ 1
                    {"name" : "MaxGCEPDVolumeCount"},
                    {"name" : "GeneralPredicates"},
                    {"name" : "MaxAzureDiskVolumeCount"},
                    {"name" : "MaxCSIVolumeCountPred"},
                    {"name" : "CheckVolumeBinding"},
                    {"name" : "MaxEBSVolumeCount"},
                    {"name" : "MatchInterPodAffinity"},
                    {"name" : "CheckNodeUnschedulable"},
                    {"name" : "NoDiskConflict"},
                    {"name" : "NoVolumeZoneConflict"},
                    {"name" : "PodToleratesNodeTaints"}
                    ],
            "priorities" : [ 2
                    {"name" : "LeastRequestedPriority", "weight" : 1},
                    {"name" : "BalancedResourceAllocation", "weight" : 1},
                    {"name" : "ServiceSpreadingPriority", "weight" : 1},
                    {"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
                    {"name" : "NodeAffinityPriority", "weight" : 1},
                    {"name" : "TaintTolerationPriority", "weight" : 1},
                    {"name" : "ImageLocalityPriority", "weight" : 1},
                    {"name" : "SelectorSpreadPriority", "weight" : 1},
                    {"name" : "InterPodAffinityPriority", "weight" : 1},
                    {"name" : "EqualPriority", "weight" : 1}
                    ]
        }
    kind: ConfigMap
    metadata:
      creationTimestamp: "2019-09-17T17:44:19Z"
      name: scheduler-policy
      namespace: openshift-config
      resourceVersion: "15370"
      selfLink: /api/v1/namespaces/openshift-config/configmaps/scheduler-policy

    1
    Add or remove predicates as needed.
    2
    Add, remove, or change the weight of predicates as needed.

    It can take a few minutes for the scheduler to restart the pods with the updated policy.

  • Change the policies and predicates being used:

    1. Remove the scheduler policy config map:

      $ oc delete configmap -n openshift-config <name>

      For example:

      $ oc delete configmap -n openshift-config  scheduler-policy
    2. Edit the policy.cfg file to add and remove policies and predicates as needed.

      For example:

      $ vi policy.cfg

      Example output

      apiVersion: v1
      data:
        policy.cfg: |
          {
          "kind" : "Policy",
          "apiVersion" : "v1",
          "predicates" : [
                  {"name" : "MaxGCEPDVolumeCount"},
                  {"name" : "GeneralPredicates"},
                  {"name" : "MaxAzureDiskVolumeCount"},
                  {"name" : "MaxCSIVolumeCountPred"},
                  {"name" : "CheckVolumeBinding"},
                  {"name" : "MaxEBSVolumeCount"},
                  {"name" : "MatchInterPodAffinity"},
                  {"name" : "CheckNodeUnschedulable"},
                  {"name" : "NoDiskConflict"},
                  {"name" : "NoVolumeZoneConflict"},
                  {"name" : "PodToleratesNodeTaints"}
                  ],
          "priorities" : [
                  {"name" : "LeastRequestedPriority", "weight" : 1},
                  {"name" : "BalancedResourceAllocation", "weight" : 1},
                  {"name" : "ServiceSpreadingPriority", "weight" : 1},
                  {"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
                  {"name" : "NodeAffinityPriority", "weight" : 1},
                  {"name" : "TaintTolerationPriority", "weight" : 1},
                  {"name" : "ImageLocalityPriority", "weight" : 1},
                  {"name" : "SelectorSpreadPriority", "weight" : 1},
                  {"name" : "InterPodAffinityPriority", "weight" : 1},
                  {"name" : "EqualPriority", "weight" : 1}
                  ]
          }

    3. Re-create the scheduler policy config map based on the scheduler JSON file:

      $ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name> 1
      1
      Enter a name for the config map.

      For example:

      $ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy

      Example output

      configmap/scheduler-policy created

2.2.3.1. Understanding the scheduler predicates

Predicates are rules that filter out unqualified nodes.

There are several predicates provided by default in OpenShift Container Platform. Some of these predicates can be customized by providing certain parameters. Multiple predicates can be combined to provide additional filtering of nodes.

2.2.3.1.1. Static Predicates

These predicates do not take any configuration parameters or inputs from the user. These are specified in the scheduler configuration using their exact name.

2.2.3.1.1.1. Default Predicates

The default scheduler policy includes the following predicates:

The NoVolumeZoneConflict predicate checks that the volumes a pod requests are available in the zone.

{"name" : "NoVolumeZoneConflict"}

The MaxEBSVolumeCount predicate checks the maximum number of volumes that can be attached to an AWS instance.

{"name" : "MaxEBSVolumeCount"}

The MaxAzureDiskVolumeCount predicate checks the maximum number of Azure Disk Volumes.

{"name" : "MaxAzureDiskVolumeCount"}

The PodToleratesNodeTaints predicate checks if a pod can tolerate the node taints.

{"name" : "PodToleratesNodeTaints"}

The CheckNodeUnschedulable predicate checks if a pod can be scheduled on a node with Unschedulable spec.

{"name" : "CheckNodeUnschedulable"}

The CheckVolumeBinding predicate evaluates if a pod can fit based on the volumes, it requests, for both bound and unbound PVCs.

  • For PVCs that are bound, the predicate checks that the corresponding PV’s node affinity is satisfied by the given node.
  • For PVCs that are unbound, the predicate searched for available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.

The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound PVCs can be matched with an available and node-compatible PV.

{"name" : "CheckVolumeBinding"}

The NoDiskConflict predicate checks if the volume requested by a pod is available.

{"name" : "NoDiskConflict"}

The MaxGCEPDVolumeCount predicate checks the maximum number of Google Compute Engine (GCE) Persistent Disks (PD).

{"name" : "MaxGCEPDVolumeCount"}

MaxCSIVolumeCountPred

The MatchInterPodAffinity predicate checks if the pod affinity/anti-affinity rules permit the pod.

{"name" : "MatchInterPodAffinity"}
2.2.3.1.1.2. Other Static Predicates

OpenShift Container Platform also supports the following predicates:

Note

The CheckNode-* predicates cannot be used if the Taint Nodes By Condition feature is enabled. The Taint Nodes By Condition feature is enabled by default.

The CheckNodeCondition predicate checks if a pod can be scheduled on a node reporting out of disk, network unavailable, or not ready conditions.

{"name" : "CheckNodeCondition"}

The CheckNodeLabelPresence predicate checks if all of the specified labels exist on a node, regardless of their value.

{"name" : "CheckNodeLabelPresence"}

The checkServiceAffinity predicate checks that ServiceAffinity labels are homogeneous for pods that are scheduled on a node.

{"name" : "checkServiceAffinity"}

The PodToleratesNodeNoExecuteTaints predicate checks if a pod tolerations can tolerate a node NoExecute taints.

{"name" : "PodToleratesNodeNoExecuteTaints"}
2.2.3.1.2. General Predicates

The following general predicates check whether non-critical predicates and essential predicates pass. Non-critical predicates are the predicates that only non-critical pods must pass and essential predicates are the predicates that all pods must pass.

The default scheduler policy includes the general predicates.

Non-critical general predicates

The PodFitsResources predicate determines a fit based on resource availability (CPU, memory, GPU, and so forth). The nodes can declare their resource capacities and then pods can specify what resources they require. Fit is based on requested, rather than used resources.

{"name" : "PodFitsResources"}
Essential general predicates

The PodFitsHostPorts predicate determines if a node has free ports for the requested pod ports (absence of port conflicts).

{"name" : "PodFitsHostPorts"}

The HostName predicate determines fit based on the presence of the Host parameter and a string match with the name of the host.

{"name" : "HostName"}

The MatchNodeSelector predicate determines fit based on node selector (nodeSelector) queries defined in the pod.

{"name" : "MatchNodeSelector"}

2.2.3.2. Understanding the scheduler priorities

Priorities are rules that rank nodes according to preferences.

A custom set of priorities can be specified to configure the scheduler. There are several priorities provided by default in OpenShift Container Platform. Other priorities can be customized by providing certain parameters. Multiple priorities can be combined and different weights can be given to each in order to impact the prioritization.

2.2.3.2.1. Static Priorities

Static priorities do not take any configuration parameters from the user, except weight. A weight is required to be specified and cannot be 0 or negative.

These are specified in the scheduler policy config map in the openshift-config project.

2.2.3.2.1.1. Default Priorities

The default scheduler policy includes the following priorities. Each of the priority function has a weight of 1 except NodePreferAvoidPodsPriority, which has a weight of 10000.

The NodeAffinityPriority priority prioritizes nodes according to node affinity scheduling preferences

{"name" : "NodeAffinityPriority", "weight" : 1}

The TaintTolerationPriority priority prioritizes nodes that have a fewer number of intolerable taints on them for a pod. An intolerable taint is one which has key PreferNoSchedule.

{"name" : "TaintTolerationPriority", "weight" : 1}

The ImageLocalityPriority priority prioritizes nodes that already have requested pod container’s images.

{"name" : "ImageLocalityPriority", "weight" : 1}

The SelectorSpreadPriority priority looks for services, replication controllers (RC), replication sets (RS), and stateful sets that match the pod, then finds existing pods that match those selectors. The scheduler favors nodes that have fewer existing matching pods. Then, it schedules the pod on a node with the smallest number of pods that match those selectors as the pod being scheduled.

{"name" : "SelectorSpreadPriority", "weight" : 1}

The InterPodAffinityPriority priority computes a sum by iterating through the elements of weightedPodAffinityTerm and adding weight to the sum if the corresponding PodAffinityTerm is satisfied for that node. The node(s) with the highest sum are the most preferred.

{"name" : "InterPodAffinityPriority", "weight" : 1}

The LeastRequestedPriority priority favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes nodes that have the highest available/remaining capacity.

{"name" : "LeastRequestedPriority", "weight" : 1}

The BalancedResourceAllocation priority favors nodes with balanced resource usage rate. It calculates the difference between the consumed CPU and memory as a fraction of capacity, and prioritizes the nodes based on how close the two metrics are to each other. This should always be used together with LeastRequestedPriority.

{"name" : "BalancedResourceAllocation", "weight" : 1}

The NodePreferAvoidPodsPriority priority ignores pods that are owned by a controller other than a replication controller.

{"name" : "NodePreferAvoidPodsPriority", "weight" : 10000}
2.2.3.2.1.2. Other Static Priorities

OpenShift Container Platform also supports the following priorities:

The EqualPriority priority gives an equal weight of 1 to all nodes, if no priority configurations are provided. We recommend using this priority only for testing environments.

{"name" : "EqualPriority", "weight" : 1}

The MostRequestedPriority priority prioritizes nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.

{"name" : "MostRequestedPriority", "weight" : 1}

The ServiceSpreadingPriority priority spreads pods by minimizing the number of pods belonging to the same service onto the same machine.

{"name" : "ServiceSpreadingPriority", "weight" : 1}
2.2.3.2.2. Configurable Priorities

You can configure these priorities in the scheduler policy config map, in the openshift-config namespace, to add labels to affect how the priorities work.

The type of the priority function is identified by the argument that they take. Since these are configurable, multiple priorities of the same type (but different configuration parameters) can be combined as long as their user-defined names are different.

For information on using these priorities, see Modifying Scheduler Policy.

The ServiceAntiAffinity priority takes a label and ensures a good spread of the pods belonging to the same service across the group of nodes based on the label values. It gives the same score to all nodes that have the same value for the specified label. It gives a higher score to nodes within a group with the least concentration of pods.

{
"kind": "Policy",
"apiVersion": "v1",

"priorities":[
    {
        "name":"<name>", 1
        "weight" : 1 2
        "argument":{
            "serviceAntiAffinity":{
                "label": "<label>" 3
                }
           }
       }
   ]
}
1
Specify a name for the priority.
2
Specify a weight. Enter a non-zero positive value.
3
Specify a label to match.

For example:

{
"kind": "Policy",
"apiVersion": "v1",
"priorities": [
    {
        "name":"RackSpread",
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {
                "label": "rack"
                }
           }
       }
   ]
}
Note

In some situations using the ServiceAntiAffinity parameter based on custom labels does not spread pod as expected. See this Red Hat Solution.

The labelPreference parameter gives priority based on the specified label. If the label is present on a node, that node is given priority. If no label is specified, priority is given to nodes that do not have a label. If multiple priorities with the labelPreference parameter are set, all of the priorities must have the same weight.

{
"kind": "Policy",
"apiVersion": "v1",
"priorities":[
    {
        "name":"<name>", 1
        "weight" : 1 2
        "argument":{
            "labelPreference":{
                "label": "<label>", 3
                "presence": true 4
                }
            }
        }
    ]
}
1
Specify a name for the priority.
2
Specify a weight. Enter a non-zero positive value.
3
Specify a label to match.
4
Specify whether the label is required, either true or false.

2.2.4. Sample Policy Configurations

The configuration below specifies the default scheduler configuration, if it were to be specified using the scheduler policy file.

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "RegionZoneAffinity", 1
        "argument": {
            "serviceAffinity": {  2
              "labels": ["region, zone"]  3
           }
        }
     }
  ],
"priorities": [
    {
        "name":"RackSpread", 4
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {  5
                "label": "rack"  6
                }
           }
       }
   ]
}
1
The name for the predicate.
2
The type of predicate.
3
The labels for the predicate.
4
The name for the priority.
5
The type of priority.
6
The labels for the priority.

In all of the sample configurations below, the list of predicates and priority functions is truncated to include only the ones that pertain to the use case specified. In practice, a complete/meaningful scheduler policy should include most, if not all, of the default predicates and priorities listed above.

The following example defines three topological levels, region (affinity) → zone (affinity) → rack (anti-affinity):

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "RegionZoneAffinity",
        "argument": {
            "serviceAffinity": {
              "labels": ["region, zone"]
           }
        }
     }
  ],
"priorities": [
    {
        "name":"RackSpread",
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {
                "label": "rack"
                }
           }
       }
   ]
}

The following example defines three topological levels, city (affinity) → building (anti-affinity) → room (anti-affinity):

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "CityAffinity",
        "argument": {
            "serviceAffinity": {
              "label": "city"
           }
        }
     }
  ],
"priorities": [
    {
        "name":"BuildingSpread",
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {
                "label": "building"
                }
           }
       },
    {
        "name":"RoomSpread",
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {
                "label": "room"
                }
           }
       }
   ]
}

The following example defines a policy to only use nodes with the 'region' label defined and prefer nodes with the 'zone' label defined:

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "RequireRegion",
        "argument": {
            "labelPreference": {
                "labels": ["region"],
                "presence": true
           }
        }
     }
  ],
"priorities": [
    {
        "name":"ZonePreferred",
        "weight" : 1,
        "argument": {
            "labelPreference": {
                "label": "zone",
                "presence": true
                }
           }
       }
   ]
}

The following example combines both static and configurable predicates and also priorities:

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "RegionAffinity",
        "argument": {
            "serviceAffinity": {
                "labels": ["region"]
           }
        }
     },
    {
        "name": "RequireRegion",
        "argument": {
            "labelsPresence": {
                "labels": ["region"],
                "presence": true
           }
        }
     },
    {
        "name": "BuildingNodesAvoid",
        "argument": {
            "labelsPresence": {
                "label": "building",
                "presence": false
           }
        }
     },
     {"name" : "PodFitsPorts"},
     {"name" : "MatchNodeSelector"}
     ],
"priorities": [
    {
        "name": "ZoneSpread",
        "weight" : 2,
        "argument": {
            "serviceAntiAffinity":{
                "label": "zone"
                }
           }
       },
    {
        "name":"ZonePreferred",
        "weight" : 1,
        "argument": {
            "labelPreference":{
                "label": "zone",
                "presence": true
                }
           }
       },
    {"name" : "ServiceSpreadingPriority", "weight" : 1}
    ]
}

2.3. Placing pods relative to other pods using affinity and anti-affinity rules

Affinity is a property of pods that controls the nodes on which they prefer to be scheduled. Anti-affinity is a property of pods that prevents a pod from being scheduled on a node.

In OpenShift Container Platform pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods.

2.3.1. Understanding pod affinity

Pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods.

  • Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
  • Pod anti-affinity can prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod.

For example, using affinity rules, you could spread or pack pods within a service or relative to pods in other services. Anti-affinity rules allow you to prevent pods of a particular service from scheduling on the same nodes as pods of another service that are known to interfere with the performance of the pods of the first service. Or, you could spread the pods of a service across nodes or availability zones to reduce correlated failures.

There are two types of pod affinity rules: required and preferred.

Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Note

Depending on your pod priority and preemption settings, the scheduler might not be able to find an appropriate node for a pod without violating affinity requirements. If so, a pod might not be scheduled.

To prevent this situation, carefully configure pod affinity with equal-priority pods.

You configure pod affinity/anti-affinity through the Pod spec files. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.

The following example shows a Pod spec configured for pod affinity and anti-affinity.

In this example, the pod affinity rule indicates that the pod can schedule onto a node only if that node has at least one already-running pod with a label that has the key security and value S1. The pod anti-affinity rule says that the pod prefers to not schedule onto a node if that node is already running a pod with label having key security and value S2.

Sample Pod config file with pod affinity

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity: 1
      requiredDuringSchedulingIgnoredDuringExecution: 2
      - labelSelector:
          matchExpressions:
          - key: security 3
            operator: In 4
            values:
            - S1 5
        topologyKey: failure-domain.beta.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

1
Stanza to configure pod affinity.
2
Defines a required rule.
3 5
The key and value (label) that must be matched to apply the rule.
4
The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

Sample Pod config file with pod anti-affinity

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity: 1
      preferredDuringSchedulingIgnoredDuringExecution: 2
      - weight: 100  3
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security 4
              operator: In 5
              values:
              - S2
          topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

1
Stanza to configure pod anti-affinity.
2
Defines a preferred rule.
3
Specifies a weight for a preferred rule. The node with the highest weight is preferred.
4
Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
5
The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.
Note

If labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod continues to run on the node.

2.3.2. Configuring a pod affinity rule

The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses affinity to allow scheduling with that pod.

Procedure

  1. Create a pod with a specific label in the Pod spec:

    $ cat team4.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: security-s1
      labels:
        security: S1
    spec:
      containers:
      - name: security-s1
        image: docker.io/ocpqe/hello-pod
  2. When creating other pods, edit the Pod spec as follows:

    1. Use the podAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter or preferredDuringSchedulingIgnoredDuringExecution parameter:
    2. Specify the key and value that must be met. If you want the new pod to be scheduled with the other pod, use the same key and value parameters as the label on the first pod.

          podAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: security
                  operator: In
                  values:
                  - S1
              topologyKey: failure-domain.beta.kubernetes.io/zone
    3. Specify an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.
    4. Specify a topologyKey, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
  3. Create the pod.

    $ oc create -f <pod-spec>.yaml

2.3.3. Configuring a pod anti-affinity rule

The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses an anti-affinity preferred rule to attempt to prevent scheduling with that pod.

Procedure

  1. Create a pod with a specific label in the Pod spec:

    $ cat team4.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: security-s2
      labels:
        security: S2
    spec:
      containers:
      - name: security-s2
        image: docker.io/ocpqe/hello-pod
  2. When creating other pods, edit the Pod spec to set the following parameters:
  3. Use the podAntiAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter or preferredDuringSchedulingIgnoredDuringExecution parameter:

    1. Specify a weight for the node, 1-100. The node that with highest weight is preferred.
    2. Specify the key and values that must be met. If you want the new pod to not be scheduled with the other pod, use the same key and value parameters as the label on the first pod.

          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: security
                    operator: In
                    values:
                    - S2
                topologyKey: kubernetes.io/hostname
    3. For a preferred rule, specify a weight, 1-100.
    4. Specify an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.
  4. Specify a topologyKey, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
  5. Create the pod.

    $ oc create -f <pod-spec>.yaml

2.3.4. Sample pod affinity and anti-affinity rules

The following examples demonstrate pod affinity and pod anti-affinity.

2.3.4.1. Pod Affinity

The following example demonstrates pod affinity for pods with matching labels and label selectors.

  • The pod team4 has the label team:4.

    $ cat team4.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: team4
      labels:
         team: "4"
    spec:
      containers:
      - name: ocp
        image: docker.io/ocpqe/hello-pod
  • The pod team4a has the label selector team:4 under podAffinity.

    $ cat pod-team4a.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: team4a
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: team
                operator: In
                values:
                - "4"
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pod-affinity
        image: docker.io/ocpqe/hello-pod
  • The team4a pod is scheduled on the same node as the team4 pod.

2.3.4.2. Pod Anti-affinity

The following example demonstrates pod anti-affinity for pods with matching labels and label selectors.

  • The pod pod-s1 has the label security:s1.

    cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
      labels:
        security: s1
    spec:
      containers:
      - name: ocp
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 has the label selector security:s1 under podAntiAffinity.

    cat pod-s2.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s2
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: security
                operator: In
                values:
                - s1
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pod-antiaffinity
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 cannot be scheduled on the same node as pod-s1.

2.3.4.3. Pod Affinity with no Matching Labels

The following example demonstrates pod affinity for pods without matching labels and label selectors.

  • The pod pod-s1 has the label security:s1.

    $ cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
      labels:
        security: s1
    spec:
      containers:
      - name: ocp
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 has the label selector security:s2.

    $ cat pod-s2.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s2
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: security
                operator: In
                values:
                - s2
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pod-affinity
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 is not scheduled unless there is a node with a pod that has the security:s2 label. If there is no other pod with that label, the new pod remains in a pending state:

    Example output

    NAME      READY     STATUS    RESTARTS   AGE       IP        NODE
    pod-s2    0/1       Pending   0          32s       <none>

2.4. Controlling pod placement on nodes using node affinity rules

Affinity is a property of pods that controls the nodes on which they prefer to be scheduled.

In OpenShift Container Platform node affinity is a set of rules used by the scheduler to determine where a pod can be placed. The rules are defined using custom labels on the nodes and label selectors specified in pods.

2.4.1. Understanding node affinity

Node affinity allows a pod to specify an affinity towards a group of nodes it can be placed on. The node does not have control over the placement.

For example, you could configure a pod to only run on a node with a specific CPU or in a specific availability zone.

There are two types of node affinity rules: required and preferred.

Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Note

If labels on a node change at runtime that results in an node affinity rule on a pod no longer being met, the pod continues to run on the node.

You configure node affinity through the Pod spec file. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.

The following example is a Pod spec with a rule that requires the pod be placed on a node with a label whose key is e2e-az-NorthSouth and whose value is either e2e-az-North or e2e-az-South:

Example pod configuration file with a node affinity required rule

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity: 1
      requiredDuringSchedulingIgnoredDuringExecution: 2
        nodeSelectorTerms:
        - matchExpressions:
          - key: e2e-az-NorthSouth 3
            operator: In 4
            values:
            - e2e-az-North 5
            - e2e-az-South 6
  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod

1
The stanza to configure node affinity.
2
Defines a required rule.
3 5 6
The key/value pair (label) that must be matched to apply the rule.
4
The operator represents the relationship between the label on the node and the set of values in the matchExpression parameters in the Pod spec. This value can be In, NotIn, Exists, or DoesNotExist, Lt, or Gt.

The following example is a node specification with a preferred rule that a node with a label whose key is e2e-az-EastWest and whose value is either e2e-az-East or e2e-az-West is preferred for the pod:

Example pod configuration file with a node affinity preferred rule

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity: 1
      preferredDuringSchedulingIgnoredDuringExecution: 2
      - weight: 1 3
        preference:
          matchExpressions:
          - key: e2e-az-EastWest 4
            operator: In 5
            values:
            - e2e-az-East 6
            - e2e-az-West 7
  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod

1
The stanza to configure node affinity.
2
Defines a preferred rule.
3
Specifies a weight for a preferred rule. The node with highest weight is preferred.
4 6 7
The key/value pair (label) that must be matched to apply the rule.
5
The operator represents the relationship between the label on the node and the set of values in the matchExpression parameters in the Pod spec. This value can be In, NotIn, Exists, or DoesNotExist, Lt, or Gt.

There is no explicit node anti-affinity concept, but using the NotIn or DoesNotExist operator replicates that behavior.

Note

If you are using node affinity and node selectors in the same pod configuration, note the following:

  • If you configure both nodeSelector and nodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node.
  • If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
  • If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions are satisfied.

2.4.2. Configuring a required node affinity rule

Required rules must be met before a pod can be scheduled on a node.

Procedure

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler is required to place on the node.

  1. Add a label to a node using the oc label node command:

    $ oc label node node1 e2e-az-name=e2e-az1
  2. In the Pod spec, use the nodeAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter:

    1. Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same key and value parameters as the label in the node.
    2. Specify an operator. The operator can be In, NotIn, Exists, DoesNotExist, Lt, or Gt. For example, use the operator In to require the label to be in the node:

      Example output

      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: e2e-az-name
                  operator: In
                  values:
                  - e2e-az1
                  - e2e-az2

  3. Create the pod:

    $ oc create -f e2e-az2.yaml

2.4.3. Configuring a preferred node affinity rule

Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Procedure

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler tries to place on the node.

  1. Add a label to a node using the oc label node command:

    $ oc label node node1 e2e-az-name=e2e-az3
  2. In the Pod spec, use the nodeAffinity stanza to configure the preferredDuringSchedulingIgnoredDuringExecution parameter:

    1. Specify a weight for the node, as a number 1-100. The node with highest weight is preferred.
    2. Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same key and value parameters as the label in the node:

      spec:
        affinity:
          nodeAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                - key: e2e-az-name
                  operator: In
                  values:
                  - e2e-az3
    3. Specify an operator. The operator can be In, NotIn, Exists, DoesNotExist, Lt, or Gt. For example, use the Operator In to require the label to be in the node.
  3. Create the pod.

    $ oc create -f e2e-az3.yaml

2.4.4. Sample node affinity rules

The following examples demonstrate node affinity.

2.4.4.1. Node affinity with matching labels

The following example demonstrates node affinity for a node and pod with matching labels:

  • The Node1 node has the label zone:us:

    $ oc label node node1 zone=us
  • The pod-s1 pod has the zone and us key/value pair under a required node affinity rule:

    $ cat pod-s1.yaml

    Example output

    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
    spec:
      containers:
        - image: "docker.io/ocpqe/hello-pod"
          name: hello-pod
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: "zone"
                  operator: In
                  values:
                  - us

  • The pod-s1 pod can be scheduled on Node1:

    $ oc get pod -o wide

    Example output

    NAME     READY     STATUS       RESTARTS   AGE      IP      NODE
    pod-s1   1/1       Running      0          4m       IP1     node1

2.4.4.2. Node affinity with no matching labels

The following example demonstrates node affinity for a node and pod without matching labels:

  • The Node1 node has the label zone:emea:

    $ oc label node node1 zone=emea
  • The pod-s1 pod has the zone and us key/value pair under a required node affinity rule:

    $ cat pod-s1.yaml

    Example output

    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
    spec:
      containers:
        - image: "docker.io/ocpqe/hello-pod"
          name: hello-pod
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: "zone"
                  operator: In
                  values:
                  - us

  • The pod-s1 pod cannot be scheduled on Node1:

    $ oc describe pod pod-s1

    Example output

    ...
    
    Events:
     FirstSeen LastSeen Count From              SubObjectPath  Type                Reason
     --------- -------- ----- ----              -------------  --------            ------
     1m        33s      8     default-scheduler Warning        FailedScheduling    No nodes are available that match all of the following predicates:: MatchNodeSelector (1).

2.4.5. Additional resources

For information about changing node labels, see Understanding how to update labels on nodes.

2.5. Placing pods onto overcommited nodes

In an overcommited state, the sum of the container compute resource requests and limits exceeds the resources available on the system. Overcommitment might be desirable in development environments where a trade-off of guaranteed performance for capacity is acceptable.

Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.

2.5.1. Understanding overcommitment

Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.

OpenShift Container Platform administrators can control the level of overcommit and manage container density on nodes by configuring masters to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange object specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.

Note

That these overrides have no effect if no limits have been set on containers. Create a LimitRange object with default limits, per individual project, or in the project template, in order to ensure that the overrides apply.

After these overrides, the container limits and requests must still be validated by any LimitRange object in the project. It is possible, for example, for developers to specify a limit close to the minimum limit, and have the request then be overridden below the minimum limit, causing the pod to be forbidden. This unfortunate user experience should be addressed with future work, but for now, configure this capability and LimitRange objects with caution.

2.5.2. Understanding nodes overcommitment

In an overcommitted environment, it is important to properly configure your node to provide best system behavior.

When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

To ensure this behavior, OpenShift Container Platform configures the kernel to always overcommit memory by setting the vm.overcommit_memory parameter to 1, overriding the default operating system setting.

OpenShift Container Platform also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom parameter to 0. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority

You can view the current setting by running the following commands on your nodes:

$ sysctl -a |grep commit

Example output

vm.overcommit_memory = 1

$ sysctl -a |grep panic

Example output

vm.panic_on_oom = 0

Note

The above flags should already be set on nodes, and no further action is required.

You can also perform the following configurations for each node:

  • Disable or enforce CPU limits using CPU CFS quotas
  • Reserve resources for system processes
  • Reserve memory across quality of service tiers

2.6. Controlling pod placement using node taints

Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.

2.6.1. Understanding taints and tolerations

A taint allows a node to refuse a pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the Node specification (NodeSpec) and apply tolerations to a pod through the Pod specification (PodSpec). When you apply a taint a node, the scheduler cannot place a pod on that node unless the pod can tolerate the taint.

Example taint in a node specification

spec:
....
  template:
....
    spec:
      taints:
      - effect: NoExecute
        key: key1
        value: value1
....

Example toleration in a Pod spec

spec:
....
  template:
....
    spec
      tolerations:
      - key: "key1"
        operator: "Equal"
        value: "value1"
        effect: "NoExecute"
        tolerationSeconds: 3600
....

Taints and tolerations consist of a key, value, and effect.

Table 2.1. Taint and toleration components

ParameterDescription

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

NoSchedule [1]

  • New pods that do not match the taint are not scheduled onto that node.
  • Existing pods on the node remain.

PreferNoSchedule

  • New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.
  • Existing pods on the node remain.

NoExecute

  • New pods that do not match the taint cannot be scheduled onto that node.
  • Existing pods on the node that do not have a matching toleration are removed.

operator

Equal

The key/value/effect parameters must match. This is the default.

Exists

The key/effect parameters must match. You must leave a blank value parameter, which matches any.

  1. If you add a NoSchedule taint to a master node, the node must have the node-role.kubernetes.io/master=:NoSchedule taint, which is added by default.

    For example:

    apiVersion: v1
    kind: Node
    metadata:
      annotations:
        machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
        machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
    ...
    spec:
      taints:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
    ...

A toleration matches a taint:

  • If the operator parameter is set to Equal:

    • the key parameters are the same;
    • the value parameters are the same;
    • the effect parameters are the same.
  • If the operator parameter is set to Exists:

    • the key parameters are the same;
    • the effect parameters are the same.

The following taints are built into OpenShift Container Platform:

  • node.kubernetes.io/not-ready: The node is not ready. This corresponds to the node condition Ready=False.
  • node.kubernetes.io/unreachable: The node is unreachable from the node controller. This corresponds to the node condition Ready=Unknown.
  • node.kubernetes.io/out-of-disk: The node has insufficient free space on the node for adding new pods. This corresponds to the node condition OutOfDisk=True.
  • node.kubernetes.io/memory-pressure: The node has memory pressure issues. This corresponds to the node condition MemoryPressure=True.
  • node.kubernetes.io/disk-pressure: The node has disk pressure issues. This corresponds to the node condition DiskPressure=True.
  • node.kubernetes.io/network-unavailable: The node network is unavailable.
  • node.kubernetes.io/unschedulable: The node is unschedulable.
  • node.cloudprovider.kubernetes.io/uninitialized: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.

2.6.1.1. Understanding how to use toleration seconds to delay pod evictions

You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the Pod specification or MachineSet object. If a taint with the NoExecute effect is added to a node, a pod that does tolerate the taint, which has the tolerationSeconds parameter, the pod is not evicted until that time period expires.

Example output

spec:
....
  template:
....
    spec
      tolerations:
      - key: "key1"
        operator: "Equal"
        value: "value1"
        effect: "NoExecute"
        tolerationSeconds: 3600

Here, if this pod is running but does not have a matching toleration, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.

2.6.1.2. Understanding how to use multiple taints

You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Container Platform processes multiple taints and tolerations as follows:

  1. Process the taints for which the pod has a matching toleration.
  2. The remaining unmatched taints have the indicated effects on the pod:

    • If there is at least one unmatched taint with effect NoSchedule, OpenShift Container Platform cannot schedule a pod onto that node.
    • If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OpenShift Container Platform tries to not schedule the pod onto the node.
    • If there is at least one unmatched taint with effect NoExecute, OpenShift Container Platform evicts the pod from the node if it is already running on the node, or the pod is not scheduled onto the node if it is not yet running on the node.

      • Pods that do not tolerate the taint are evicted immediately.
      • Pods that tolerate the taint without specifying tolerationSeconds in their Pod specification remain bound forever.
      • Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

For example:

  • Add the following taints to the node:

    $ oc adm taint nodes node1 key1=value1:NoSchedule
    $ oc adm taint nodes node1 key1=value1:NoExecute
    $ oc adm taint nodes node1 key2=value2:NoSchedule
  • The pod has the following tolerations:

    spec:
    ....
      template:
    ....
        spec
          tolerations:
          - key: "key1"
            operator: "Equal"
            value: "value1"
            effect: "NoSchedule"
          - key: "key1"
            operator: "Equal"
            value: "value1"
            effect: "NoExecute"

In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

2.6.1.3. Understanding pod scheduling and node conditions (taint node by condition)

The Taint Nodes By Condition feature, which is enabled by default, automatically taints nodes that report conditions such as memory pressure and disk pressure. If a node reports a condition, a taint is added until the condition clears. The taints have the NoSchedule effect, which means no pod can be scheduled on the node unless the pod has a matching toleration.

The scheduler checks for these taints on nodes before scheduling pods. If the taint is present, the pod is scheduled on a different node. Because the scheduler checks for taints and not the actual node conditions, you configure the scheduler to ignore some of these node conditions by adding appropriate pod tolerations.

To ensure backward compatibility, the daemon set controller automatically adds the following tolerations to all daemons:

  • node.kubernetes.io/memory-pressure
  • node.kubernetes.io/disk-pressure
  • node.kubernetes.io/out-of-disk (only for critical pods)
  • node.kubernetes.io/unschedulable (1.10 or later)
  • node.kubernetes.io/network-unavailable (host network only)

You can also add arbitrary tolerations to daemon sets.

2.6.1.4. Understanding evicting pods by condition (taint-based evictions)

The Taint-Based Evictions feature, which is enabled by default, evicts pods from a node that experiences specific conditions, such as not-ready and unreachable. When a node experiences one of these conditions, OpenShift Container Platform automatically adds taints to the node, and starts evicting and rescheduling the pods on different nodes.

Taint Based Evictions have a NoExecute effect, where any pod that does not tolerate the taint is evicted immediately and any pod that does tolerate the taint will never be evicted, unless the pod uses the tolerationSeconds parameter.

The tolerationSeconds parameter allows you to specify how long a pod stays bound to a node that has a node condition. If the condition still exists after the tolerationSeconds period, the taint remains on the node and the pods with a matching toleration are evicted. If the condition clears before the tolerationSeconds period, pods with matching tolerations are not removed.

If you use the tolerationSeconds parameter with no value, pods are never evicted because of the not ready and unreachable node conditions.

Note

OpenShift Container Platform evicts pods in a rate-limited way to prevent massive pod evictions in scenarios such as the master becoming partitioned from the nodes.

OpenShift Container Platform automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300, unless the Pod configuration specifies either toleration.

spec:
....
  template:
....
    spec
      tolerations:
      - key: node.kubernetes.io/not-ready
        operator: Exists
        effect: NoExecute
        tolerationSeconds: 300 1
      - key: node.kubernetes.io/unreachable
        operator: Exists
        effect: NoExecute
        tolerationSeconds: 300
1
These tolerations ensure that the default pod behavior is to remain bound for five minutes after one of these node conditions problems is detected.

You can configure these tolerations as needed. For example, if you have an application with a lot of local state, you might want to keep the pods bound to node for a longer time in the event of network partition, allowing for the partition to recover and avoiding pod eviction.

Pods spawned by a daemon set are created with NoExecute tolerations for the following taints with no tolerationSeconds:

  • node.kubernetes.io/unreachable
  • node.kubernetes.io/not-ready

As a result, daemon set pods are never evicted because of these node conditions.

2.6.1.5. Tolerating all taints

You can configure a pod to tolerate all taints by adding an operator: "Exists" toleration with no key and value parameters. Pods with this toleration are not removed from a node that has taints.

Pod spec for tolerating all taints

spec:
....
  template:
....
    spec
      tolerations:
      - operator: "Exists"

2.6.2. Adding taints and tolerations

You add tolerations to pods and taints to nodes to allow the node to control which pods should or should not be scheduled on them. For existing pods and nodes, you should add the toleration to the pod first, then add the taint to the node to avoid pods being removed from the node before you can add the toleration.

Procedure

  1. Add a toleration to a pod by editing the Pod spec to include a tolerations stanza:

    Sample pod configuration file with an Equal operator

    spec:
    ....
      template:
    ....
        spec:
          tolerations:
          - key: "key1" 1
            value: "value1"
            operator: "Equal"
            effect: "NoExecute"
            tolerationSeconds: 3600 2

    1
    The toleration parameters, as described in the Taint and toleration components table.
    2
    The tolerationSeconds parameter specifies how long a pod can remain bound to a node before being evicted.

    For example:

    Sample pod configuration file with an Exists operator

    spec:
    ....
      template:
    ....
        spec:
          tolerations:
          - key: "key1"
            operator: "Exists" 1
            effect: "NoExecute"
            tolerationSeconds: 3600

    1
    The Exists operator does not take a value.

    This example places a taint on node1 that has key key1, value value1, and taint effect NoExecute.

  2. Add a taint to a node by using the following command with the parameters described in the Taint and toleration components table:

    $ oc adm taint nodes <node_name> <key>=<value>:<effect>

    For example:

    $ oc adm taint nodes node1 key1=value1:NoExecute

    This command places a taint on node1 that has key key1, value value1, and effect NoExecute.

    Note

    If you add a NoSchedule taint to a master node, the node must have the node-role.kubernetes.io/master=:NoSchedule taint, which is added by default.

    For example:

    apiVersion: v1
    kind: Node
    metadata:
      annotations:
        machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
        machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
    ...
    spec:
      taints:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
    ...

    The tolerations on the Pod match the taint on the node. A pod with either toleration can be scheduled onto node1.

2.6.2.1. Adding taints and tolerations using a machine set

You can add taints to nodes using a machine set. All nodes associated with the MachineSet object are updated with the taint. Tolerations respond to taints added by a machine set in the same manner as taints added directly to the nodes.

Procedure

  1. Add a toleration to a pod by editing the Pod spec to include a tolerations stanza:

    Sample pod configuration file with Equal operator

    spec:
    ....
      template:
    ....
        spec:
          tolerations:
          - key: "key1" 1
            value: "value1"
            operator: "Equal"
            effect: "NoExecute"
            tolerationSeconds: 3600 2

    1
    The toleration parameters, as described in the Taint and toleration components table.
    2
    The tolerationSeconds parameter specifies how long a pod is bound to a node before being evicted.

    For example:

    Sample pod configuration file with Exists operator

    spec:
    ....
      template:
    ....
        spec:
          tolerations:
          - key: "key1"
            operator: "Exists"
            effect: "NoExecute"
            tolerationSeconds: 3600

  2. Add the taint to the MachineSet object:

    1. Edit the MachineSet YAML for the nodes you want to taint or you can create a new MachineSet object:

      $ oc edit machineset <machineset>
    2. Add the taint to the spec.template.spec section:

      Example taint in a node specification

      spec:
      ....
        template:
      ....
          spec:
            taints:
            - effect: NoExecute
              key: key1
              value: value1
      ....

      This example places a taint that has the key key1, value value1, and taint effect NoExecute on the nodes.

    3. Scale down the machine set to 0:

      $ oc scale --replicas=0 machineset <machineset> -n openshift-machine-api

      Wait for the machines to be removed.

    4. Scale up the machine set as needed:

      $ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

      Wait for the machines to start. The taint is added to the nodes associated with the MachineSet object.

2.6.2.2. Binding a user to a node using taints and tolerations

If you want to dedicate a set of nodes for exclusive use by a particular set of users, add a toleration to their pods. Then, add a corresponding taint to those nodes. The pods with the tolerations are allowed to use the tainted nodes, or any other nodes in the cluster.

If you want ensure the pods are scheduled to only those tainted nodes, also add a label to the same set of nodes and add a node affinity to the pods so that the pods can only be scheduled onto nodes with that label.

Procedure

To configure a node so that users can use only that node:

  1. Add a corresponding taint to those nodes:

    For example:

    $ oc adm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a toleration to the pods by writing a custom admission controller.

2.6.2.3. Controlling nodes with special hardware using taints and tolerations

In a cluster where a small subset of nodes have specialized hardware, you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.

You can achieve this by adding a toleration to pods that need the special hardware and tainting the nodes that have the specialized hardware.

Procedure

To ensure nodes with specialized hardware are reserved for specific pods:

  1. Add a toleration to pods that need the special hardware.

    For example:

    spec:
    ....
      template:
    ....
        spec:
          tolerations:
          - key: "disktype"
            value: "ssd"
            operator: "Equal"
            effect: "NoSchedule"
            tolerationSeconds: 3600
  2. Taint the nodes that have the specialized hardware using one of the following commands:

    $ oc adm taint nodes <node-name> disktype=ssd:NoSchedule

    Or:

    $ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule

2.6.3. Removing taints and tolerations

You can remove taints from nodes and tolerations from pods as needed. You should add the toleration to the pod first, then add the taint to the node to avoid pods being removed from the node before you can add the toleration.

Procedure

To remove taints and tolerations:

  1. To remove a taint from a node:

    $ oc adm taint nodes <node-name> <key>-

    For example:

    $ oc adm taint nodes ip-10-0-132-248.ec2.internal key1-

    Example output

    node/ip-10-0-132-248.ec2.internal untainted

  2. To remove a toleration from a pod, edit the Pod spec to remove the toleration:

    spec:
    ....
      template:
    ....
        spec:
          tolerations:
          - key: "key2"
            operator: "Exists"
            effect: "NoExecute"
            tolerationSeconds: 3600

2.7. Placing pods on specific nodes using node selectors

A node selector specifies a map of key/value pairs that are defined using custom labels on nodes and selectors specified in pods.

For the pod to be eligible to run on a node, the pod must have the same key/value node selector as the label on the node.

2.7.1. About node selectors

You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.

You can use a node selector to place specific pods on specific nodes, cluster-wide node selectors to place new pods on specific nodes anywhere in the cluster, and project node selectors to place new pods in a project on specific nodes.

For example, as a cluster administrator, you can create an infrastructure where application developers can deploy pods only onto the nodes closest to their geographical location by including a node selector in every pod they create. In this example, the cluster consists of five data centers spread across two regions. In the U.S., label the nodes as us-east, us-central, or us-west. In the Asia-Pacific region (APAC), label the nodes as apac-east or apac-west. The developers can add a node selector to the pods they create to ensure the pods get scheduled on those nodes.

A pod is not scheduled if the Pod object contains a node selector, but no node has a matching label.

Important

If you are using node selectors and node affinity in the same pod configuration, the following rules control pod placement onto nodes:

  • If you configure both nodeSelector and nodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node.
  • If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
  • If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions are satisfied.
Node selectors on specific pods and nodes

You can control which node a specific pod is scheduled on by using node selectors and labels.

To use node selectors and labels, first label the node to avoid pods being descheduled, then add the node selector to the pod.

Note

You cannot add a node selector directly to an existing scheduled pod. You must label the object that controls the pod, such as deployment config.

For example, the following Node object has the region: east label:

Sample Node object with a label

kind: Node
apiVersion: v1
metadata:
  name: ip-10-0-131-14.ec2.internal
  selfLink: /api/v1/nodes/ip-10-0-131-14.ec2.internal
  uid: 7bc2580a-8b8e-11e9-8e01-021ab4174c74
  resourceVersion: '478704'
  creationTimestamp: '2019-06-10T14:46:08Z'
  labels:
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/zone: us-east-1a
    node.openshift.io/os_version: '4.5'
    node-role.kubernetes.io/worker: ''
    failure-domain.beta.kubernetes.io/region: us-east-1
    node.openshift.io/os_id: rhcos
    beta.kubernetes.io/instance-type: m4.large
    kubernetes.io/hostname: ip-10-0-131-14
    beta.kubernetes.io/arch: amd64
    region: east 1

1
Label to match the pod node selector.

A pod has the type: user-node,region: east node selector:

Sample Pod object with node selectors

apiVersion: v1
kind: Pod

....

spec:
  nodeSelector: 1
    region: east
    type: user-node

1
Node selectors to match the node label.

When you create the pod using the example pod spec, it can be scheduled on the example node.

Default cluster-wide node selectors

With default cluster-wide node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the default node selectors to the pod and schedules the pod on nodes with matching labels.

For example, the following Scheduler object has the default cluster-wide region=east and type=user-node node selectors:

Example Scheduler Operator Custom Resource

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
...

spec:
  defaultNodeSelector: type=user-node,region=east
...

A node in that cluster has the type=user-node,region=east labels:

Example Node object

apiVersion: v1
kind: Node
metadata:
  name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4
...
  labels:
    region: east
    type: user-node
...

Example Pod object with a node selector

apiVersion: v1
kind: Pod
...

spec:
  nodeSelector:
    region: east
...

When you create the pod using the example pod spec in the example cluster, the pod is created with the cluster-wide node selector and is scheduled on the labeled node:

Example pod list with the pod on the labeled node

NAME     READY   STATUS    RESTARTS   AGE   IP           NODE                                       NOMINATED NODE   READINESS GATES
pod-s1   1/1     Running   0          20s   10.131.2.6   ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4   <none>           <none>

Note

If the project where you create the pod has a project node selector, that selector takes preference over a cluster-wide node selector. Your pod is not created or scheduled if the pod does not have the project node selector.

Project node selectors

With project node selectors, when you create a pod in this project, OpenShift Container Platform adds the node selectors to the pod and schedules the pods on a node with matching labels. If there is a cluster-wide default node selector, a project node selector takes preference.

For example, the following project has the region=east node selector:

Example Namespace object

apiVersion: v1
kind: Namespace
metadata:
  name: east-region
  annotations:
    openshift.io/node-selector: "region=east"
...

The following node has the type=user-node,region=east labels:

Example Node object

apiVersion: v1
kind: Node
metadata:
  name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4
...
  labels:
    region: east
    type: user-node
...

When you create the pod using the example pod spec in this example project, the pod is created with the project node selectors and is scheduled on the labeled node:

Example Pod object

apiVersion: v1
kind: Pod
metadata:
  namespace: east-region
...
spec:
  nodeSelector:
    region: east
    type: user-node
...

Example pod list with the pod on the labeled node

NAME     READY   STATUS    RESTARTS   AGE   IP           NODE                                       NOMINATED NODE   READINESS GATES
pod-s1   1/1     Running   0          20s   10.131.2.6   ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4   <none>           <none>

A pod in the project is not created or scheduled if the pod contains different node selectors. For example, if you deploy the following pod into the example project, it is not be created:

Example Pod object with an invalid node selector

apiVersion: v1
kind: Pod
...

spec:
  nodeSelector:
    region: west

....

2.7.2. Using node selectors to control pod placement

You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.

You add labels to a node, a machine set, or a machine config. Adding the label to the machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.

To add node selectors to an existing pod, add a node selector to the controlling object for that pod, such as a ReplicaSet object, DaemonSet object, StatefulSet object, Deployment object, or DeploymentConfig object. Any existing pods under that controlling object are recreated on a node with a matching label. If you are creating a new pod, you can add the node selector directly to the Pod spec.

Note

You cannot add a node selector directly to an existing scheduled pod.

Prerequisites

To add a node selector to existing pods, determine the controlling object for that pod. For example, the router-default-66d5cf9464-m2g75 pod is controlled by the router-default-66d5cf9464 replica set:

$ oc describe pod router-default-66d5cf9464-7pwkc

Name:               router-default-66d5cf9464-7pwkc
Namespace:          openshift-ingress

....

Controlled By:      ReplicaSet/router-default-66d5cf9464

The web console lists the controlling object under ownerReferences in the pod YAML:

  ownerReferences:
    - apiVersion: apps/v1
      kind: ReplicaSet
      name: router-default-66d5cf9464
      uid: d81dd094-da26-11e9-a48a-128e7edf0312
      controller: true
      blockOwnerDeletion: true

Procedure

  1. Add labels to a node by using a machine set or editing the node directly:

    • Use a MachineSet object to add labels to nodes managed by the machine set when a node is created:

      1. Run the following command to add labels to a MachineSet object:

        $ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

        For example:

        $ oc patch MachineSet abc612-msrtw-worker-us-east-1c  --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api
      2. Verify that the labels are added to the MachineSet object by using the oc edit command:

        For example:

        $ oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api

        Example MachineSet object

        apiVersion: machine.openshift.io/v1beta1
        kind: MachineSet
        
        ....
        
        spec:
        ...
          template:
            metadata:
        ...
            spec:
              metadata:
                labels:
                  region: east
                  type: user-node
        ....

    • Add labels directly to a node:

      1. Edit the Node object for the node:

        $ oc label nodes <name> <key>=<value>

        For example, to label a node:

        $ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
      2. Verify that the labels are added to the node:

        $ oc get nodes -l type=user-node,region=east

        Example output

        NAME                          STATUS   ROLES    AGE   VERSION
        ip-10-0-142-25.ec2.internal   Ready    worker   17m   v1.18.3+002a51f

  2. Add the matching node selector a pod:

    • To add a node selector to existing and future pods, add a node selector to the controlling object for the pods:

      Example ReplicaSet object with labels

      kind: ReplicaSet
      
      ....
      
      spec:
      
      ....
      
        template:
          metadata:
            creationTimestamp: null
            labels:
              ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
              pod-template-hash: 66d5cf9464
          spec:
            nodeSelector:
              beta.kubernetes.io/os: linux
              node-role.kubernetes.io/worker: ''
              type: user-node 1

      1
      Add the node selector.
    • To add a node selector to a specific, new pod, add the selector to the Pod object directly:

      Example Pod object with a node selector

      apiVersion: v1
      kind: Pod
      
      ....
      
      spec:
        nodeSelector:
          region: east
          type: user-node

      Note

      You cannot add a node selector directly to an existing scheduled pod.

2.7.3. Creating default cluster-wide node selectors

You can use default cluster-wide node selectors on pods together with labels on nodes to constrain all pods created in a cluster to specific nodes.

With cluster-wide node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the default node selectors to the pod and schedules the pod on nodes with matching labels.

You configure cluster-wide node selectors by editing the Scheduler Operator custom resource (CR). You add labels to a node, a machine set, or a machine config. Adding the label to the machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.

Note

You can add additional key/value pairs to a pod. But you cannot add a different value for a default key.

Procedure

To add a default cluster-wide node selector:

  1. Edit the Scheduler Operator CR to add the default cluster-wide node selectors:

    $ oc edit scheduler cluster

    Example Scheduler Operator CR with a node selector

    apiVersion: config.openshift.io/v1
    kind: Scheduler
    metadata:
      name: cluster
    ...
    
    spec:
      defaultNodeSelector: type=user-node,region=east 1
      mastersSchedulable: false
      policy:
        name: ""

    1
    Add a node selector with the appropriate <key>:<value> pairs.

    After making this change, wait for the pods in the openshift-kube-apiserver project to redeploy. This can take several minutes. The default cluster-wide node selector does not take effect until the pods redeploy.

  2. Add labels to a node by using a machine set or editing the node directly:

    • Use a machine set to add labels to nodes managed by the machine set when a node is created:

      1. Run the following command to add labels to a MachineSet object:

        $ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api 1
        1
        Add a <key>/<value> pair for each label.

        For example:

        $ oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api
      2. Verify that the labels are added to the MachineSet object by using the oc edit command:

        For example:

        $ oc edit MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

        Example output

        apiVersion: machine.openshift.io/v1beta1
        kind: MachineSet
        metadata:
        ...
        spec:
        ...
          template:
            metadata:
        ...
            spec:
              metadata:
                labels:
                  region: east
                  type: user-node

      3. Redeploy the nodes associated with that machine set by scaling down to 0 and scaling up the nodes:

        For example:

        $ oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
        $ oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
      4. When the nodes are ready and available, verify that the label is added to the nodes by using the oc get command:

        $ oc get nodes -l <key>=<value>

        For example:

        $ oc get nodes -l type=user-node

        Example output

        NAME                                       STATUS   ROLES    AGE   VERSION
        ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp   Ready    worker   61s   v1.18.3+002a51f

    • Add labels directly to a node:

      1. Edit the Node object for the node:

        $ oc label nodes <name> <key>=<value>

        For example, to label a node:

        $ oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 type=user-node region=east
      2. Verify that the labels are added to the node using the oc get command:

        $ oc get nodes -l <key>=<value>,<key>=<value>

        For example:

        $ oc get nodes -l type=user-node,region=east

        Example output

        NAME                                       STATUS   ROLES    AGE   VERSION
        ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49   Ready    worker   17m   v1.18.3+002a51f

2.7.4. Creating project-wide node selectors

You can use node selectors in a project together with labels on nodes to constrain all pods created in that project to the labeled nodes.

When you create a pod in this project, OpenShift Container Platform adds the node selectors to the pods in the project and schedules the pods on a node with matching labels in the project. If there is a cluster-wide default node selector, a project node selector takes preference.

You add node selectors to a project by editing the Namespace object to add the openshift.io/node-selector parameter. You add labels to a node, a machine set, or a machine config. Adding the label to the machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.

A pod is not scheduled if the Pod object contains a node selector, but no project has a matching node selector. When you create a pod from that spec, you receive an error similar to the following message:

Example error message

Error from server (Forbidden): error when creating "pod.yaml": pods "pod-4" is forbidden: pod node label selector conflicts with its project node label selector

Note

You can add additional key/value pairs to a pod. But you cannot add a different value for a project key.

Procedure

To add a default project node selector:

  1. Create a project or edit an existing project to add the openshift.io/node-selector parameter:

    $ oc edit project <name>
    apiVersion: v1
    kind: Project
    metadata:
      annotations:
        openshift.io/node-selector: "type=user-node,region=east" 1
        openshift.io/sa.scc.mcs: s0:c17,c14
        openshift.io/sa.scc.supplemental-groups: 1000300000/10000
        openshift.io/sa.scc.uid-range: 1000300000/10000
      creationTimestamp: 2019-06-10T14:39:45Z
      labels:
        openshift.io/run-level: "0"
      name: demo
      resourceVersion: "401885"
      selfLink: /api/v1/namespaces/openshift-kube-apiserver
      uid: 96ecc54b-8b8d-11e9-9f54-0a9ae641edd0
    spec:
      finalizers:
      - kubernetes
    status:
      phase: Active
    1
    Add the openshift.io/node-selector with the appropriate <key>:<value> pairs.
  2. Add labels to a node by using a machine set or editing the node directly:

    • Use a MachineSet object to add labels to nodes managed by the machine set when a node is created:

      1. Run the following command to add labels to a MachineSet object:

        $ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

        For example:

        $ oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api
      2. Verify that the labels are added to the MachineSet object by using the oc edit command:

        For example:

        $ oc edit MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

        Example output

        apiVersion: machine.openshift.io/v1beta1
        kind: MachineSet
        metadata:
        ...
        spec:
        ...
          template:
            metadata:
        ...
            spec:
              metadata:
                labels:
                  region: east
                  type: user-node

      3. Redeploy the nodes associated with that machine set:

        For example:

        $ oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
        $ oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
      4. When the nodes are ready and available, verify that the label is added to the nodes by using the oc get command:

        $ oc label MachineSet abc612-msrtw-worker-us-east-1c type=user-node region=east

        For example:

        $ oc get nodes -l type=user-node

        Example output

        NAME                                       STATUS   ROLES    AGE   VERSION
        ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp   Ready    worker   61s   v1.18.3+002a51f

    • Add labels directly to a node:

      1. Edit the Node object to add labels:

        $ oc label <resource> <name> <key>=<value>

        For example, to label a node:

        $ oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-c-tgq49 type=user-node region=east
      2. Verify that the labels are added to the Node object using the oc get command:

        $ oc get nodes -l <key>=<value>

        For example:

        $ oc get nodes -l type=user-node,region=east

        Example output

        NAME                                       STATUS   ROLES    AGE   VERSION
        ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49   Ready    worker   17m   v1.18.3+002a51f

2.8. Evicting pods using the descheduler

While the scheduler is used to determine the most suitable node to host a new pod, the descheduler can be used to evict a running pod so that the pod can be rescheduled onto a more suitable node.

Important

The descheduler is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.

2.8.1. About the descheduler

You can use the descheduler to evict pods based on specific strategies so that the pods can be rescheduled onto more appropriate nodes.

You can benefit from descheduling running pods in situations such as the following:

  • Nodes are underutilized or overutilized.
  • Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
  • Node failure requires pods to be moved.
  • New nodes are added to clusters.
  • Pods have been restarted too many times.
Important

The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.

When the descheduler decides to evict pods from a node, it employs the following general mechanism:

  • Critical pods with priorityClassName set to system-cluster-critical or system-node-critical are never evicted.
  • Static, mirrored, or stand-alone pods that are not part of a replication controller, replica set, deployment, or job are never evicted because these pods will not be recreated.
  • Pods associated with daemon sets are never evicted.
  • Pods with local storage are never evicted.
  • Best effort pods are evicted before burstable and guaranteed pods.
  • All types of pods with the descheduler.alpha.kubernetes.io/evict annotation are evicted. This annotation is used to override checks that prevent eviction, and the user can select which pod is evicted. Users should know how and if the pod will be recreated.
  • Pods subject to pod disruption budget (PDB) are not evicted if descheduling violates its pod disruption budget (PDB). The pods are evicted by using eviction subresource to handle PDB.

2.8.2. Descheduler strategies

The following descheduler strategies are available:

Low node utilization

The LowNodeUtilization strategy finds nodes that are underutilized and evicts pods, if possible, from other nodes in the hope that recreation of evicted pods will be scheduled on these underutilized nodes.

The underutilization of nodes is determined by several configurable threshold parameters: CPU, memory, and number of pods. If a node’s usage is below the configured thresholds for all parameters (CPU, memory, and number of pods), then the node is considered to be underutilized.

You can also set a target threshold for CPU, memory, and number of pods. If a node’s usage is above the configured target thresholds for any of the parameters, then the node’s pods might be considered for eviction.

Additionally, you can use the NumberOfNodes parameter to set the strategy to activate only when the number of underutilized nodes is above the configured value. This can be helpful in large clusters where a few nodes might be underutilized frequently or for a short period of time.

Duplicate pods

The RemoveDuplicates strategy ensures that there is only one pod associated with a replica set, replication controller, deployment, or job running on same node. If there are more, then those duplicate pods are evicted for better spreading of pods in a cluster.

This situation could occur after a node failure, when a pod is moved to another node, leading to more than one pod associated with a replica set, replication controller, deployment, or job on that node. After the failed node is ready again, this strategy evicts the duplicate pod.

Violation of inter-pod anti-affinity

The RemovePodsViolatingInterPodAntiAffinity strategy ensures that pods violating inter-pod anti-affinity are removed from nodes.

This situation could occur when anti-affinity rules are created for pods that are already running on the same node.

Violation of node affinity

The RemovePodsViolatingNodeAffinity strategy ensures that pods violating node affinity are removed from nodes.

This situation could occur if a node no longer satisfies a pod’s affinity rule. If another node is available that satisfies the affinity rule, then the pod is evicted.

Violation of node taints

The RemovePodsViolatingNodeTaints strategy ensures that pods violating NoSchedule taints on nodes are removed.

This situation could occur if a pod is set to tolerate a taint key=value:NoSchedule and is running on a tainted node. If the node’s taint is updated or removed, the taint is no longer satisfied by the pod’s tolerations and the pod is evicted.

Too many restarts

The RemovePodsHavingTooManyRestarts strategy ensures that pods that have been restarted too many times are removed from nodes.

This situation could occur if a pod is scheduled on a node that is unable to start it. For example, if the node is having network issues and is unable to mount a networked persistent volume, then the pod should be evicted so that it can be scheduled on another node. Another example is if the pod is crashlooping.

This strategy has two configurable parameters: PodRestartThreshold and IncludingInitContainers. If a pod is restarted more than the configured PodRestartThreshold value, then the pod is evicted. You can use the IncludingInitContainers parameter to specify whether restarts for Init Containers should be calculated into the PodRestartThreshold value.

2.8.3. Installing the descheduler

The descheduler is not available by default. To enable the descheduler, you must install the Kube Descheduler Operator from OperatorHub. After the Kube Descheduler Operator is installed, you can then configure the eviction strategies.

Prerequisites

  • Cluster administrator privileges.
  • Access to the OpenShift Container Platform web console.

Procedure

  1. Log in to the OpenShift Container Platform web console.
  2. Create the required namespace for the Kube Descheduler Operator.

    1. Navigate to AdministrationNamespaces and click Create Namespace.
    2. Enter openshift-kube-descheduler-operator in the Name field and click Create.
  3. Install the Kube Descheduler Operator.

    1. Navigate to OperatorsOperatorHub.
    2. Type Kube Descheduler Operator into the filter box.
    3. Select the Kube Descheduler Operator and click Install.
    4. On the Install Operator page, select A specific namespace on the cluster. Select openshift-kube-descheduler-operator from the drop-down menu.
    5. Adjust the values for the Update Channel and Approval Strategy to the desired values.
    6. Click Install.
  4. Create a descheduler instance.

    1. From the OperatorsInstalled Operators page, click the Kube Descheduler Operator.
    2. Select the Kube Descheduler tab and click Create KubeDescheduler.
    3. Edit the settings as necessary and click Create.

You can now configure the strategies for the descheduler. There are no strategies enabled by default.

2.8.4. Configuring descheduler strategies

You can configure which strategies the descheduler uses to evict pods.

Prerequisites

  • Cluster administrator privileges.

Procedure

  1. Edit the KubeDescheduler object:

    $ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
  2. Specify one or more strategies in the spec.strategies section.

    apiVersion: operator.openshift.io/v1beta1
    kind: KubeDescheduler
    metadata:
      name: cluster
      namespace: openshift-kube-descheduler-operator
    spec:
      deschedulingIntervalSeconds: 3600
      strategies:
        - name: "LowNodeUtilization" 1
          params:
           - name: "CPUThreshold"
             value: "10"
           - name: "MemoryThreshold"
             value: "20"
           - name: "PodsThreshold"
             value: "30"
           - name: "MemoryTargetThreshold"
             value: "40"
           - name: "CPUTargetThreshold"
             value: "50"
           - name: "PodsTargetThreshold"
             value: "60"
           - name: "NumberOfNodes"
             value: "3"
        - name: "RemoveDuplicates" 2
        - name: "RemovePodsHavingTooManyRestarts" 3
          params:
           - name: "PodRestartThreshold"
             value: "10"
           - name: "IncludingInitContainers"
             value: "false"
    1
    The LowNodeUtilization strategy provides additional parameters, such as CPUThreshold and MemoryThreshold, that you can optionally configure.
    2
    The RemoveDuplicates, RemovePodsViolatingInterPodAntiAffinity, RemovePodsViolatingNodeAffinity, and RemovePodsViolatingNodeTaints strategies do not have any additional parameters to configure.
    3
    The RemovePodsHavingTooManyRestarts strategy requires the PodRestartThreshold parameter to be set. It also provides the optional IncludingInitContainers parameter.

    You can enable multiple strategies and the order that the strategies are specified in is not important.

  3. Save the file to apply the changes.

2.8.5. Configuring additional descheduler settings

You can configure additional settings for the descheduler, such as how frequently it runs.

Prerequisites

  • Cluster administrator privileges.

Procedure

  1. Edit the KubeDescheduler object:

    $ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
  2. Configure additional settings as necessary:

    apiVersion: operator.openshift.io/v1beta1
    kind: KubeDescheduler
    metadata:
      name: cluster
      namespace: openshift-kube-descheduler-operator
    spec:
      deschedulingIntervalSeconds: 3600 1
      flags:
      - --dry-run 2
      image: quay.io/openshift/origin-descheduler:4.5 3
    ...
    1
    Set number of seconds between descheduler runs. A value of 0 in this field runs the descheduler once and exits.
    2
    Set one or more flags to append to the descheduler pod. This flag must be in the format ready to pass to the binary.
    3
    Set the descheduler container image to deploy.
  3. Save the file to apply the changes.

2.8.6. Uninstalling the descheduler

You can remove the descheduler from your cluster by removing the descheduler instance and uninstalling the Kube Descheduler Operator. This procedure also cleans up the KubeDescheduler CRD and openshift-kube-descheduler-operator namespace.

Prerequisites

  • Cluster administrator privileges.
  • Access to the OpenShift Container Platform web console.

Procedure

  1. Log in to the OpenShift Container Platform web console.
  2. Delete the descheduler instance.

    1. From the OperatorsInstalled Operators page, click Kube Descheduler Operator.
    2. Select the Kube Descheduler tab.
    3. Click the Options menu kebab next to the cluster entry and select Delete KubeDescheduler.
    4. In the confirmation dialog, click Delete.
  3. Uninstall the Kube Descheduler Operator.

    1. Navigate to OperatorsInstalled Operators,
    2. Click the Options menu kebab next to the Kube Descheduler Operator entry and select Uninstall Operator.
    3. In the confirmation dialog, click Uninstall.
  4. Delete the openshift-kube-descheduler-operator namespace.

    1. Navigate to AdministrationNamespaces.
    2. Enter openshift-kube-descheduler-operator into the filter box.
    3. Click the Options menu kebab next to the openshift-kube-descheduler-operator entry and select Delete Namespace.
    4. In the confirmation dialog, enter openshift-kube-descheduler-operator and click Delete.
  5. Delete the KubeDescheduler CRD.

    1. Navigate to AdministrationCustom Resource Definitions.
    2. Enter KubeDescheduler into the filter box.
    3. Click the Options menu kebab next to the KubeDescheduler entry and select Delete CustomResourceDefinition.
    4. In the confirmation dialog, click Delete.