Chapter 23. Allocating Node Resources

23.1. Overview

To provide more reliable scheduling and minimize node resource overcommitment, each node can reserve a portion of its resources for use by all underlying node components (e.g., kubelet, kube-proxy, Docker) and the remaining system components (e.g., sshd, NetworkManager) on the host. Once specified, the scheduler has more information about the resources (e.g., memory, CPU) a node has allocated for pods.

23.2. Configuring Nodes for Allocated Resources

Resources reserved for node components are based on two node settings: kube-reserved and system-reserved.

You can use the kube-reserved setting to reserve a portion of memory dedicated for kubelet and the system-reserved setting for the rest of the system processes.


Use caution when setting these parameters. Enforcing kube-reserved and system-reserved limits can lead to critical system services on that node being starved of CPU or OOM killed. For more information on the effects of these settings, see Computing Allocated Resources.



Resources reserved for node resources, such as kubelet, the Docker daemon, SkyDNS, and kube-proxy. Default is none.


Resources reserved for the remaining system components. Default is none.

You can see which services are controlled by system-reserved with a tool such as libcgroup:

$ yum install libcgroup-tools
$ lscgroup memory:/system.slice

You can set these in the kubeletArguments section of the node configuration map by using a set of <resource_type>=<resource_quantity> pairs (e.g., cpu=200m,memory=512Mi). Add the section if it does not already exist:

Example 23.1. Node Allocatable Resources Settings

    - "cpu=200m,memory=512Mi"
    - "cpu=200m,memory=512Mi"

Do not edit the node-config.yaml file directly.

To determine appropriate values for these settings, determine a node’s resource usage using the node summary API. For more information, see System Resources Reported by Node.

After setting kube-reserved and system-reserved:

  • Monitor the memory usage of a node for high-water marks:

    $ ps aux | grep <service-name>

    For example:

    $ ps aux | grep atomic-openshift-node
    root       11089 11.5  0.3   112712  996  pts/1  R+    16:23  0:00  grep --color=auto atomic-openshift-node

    If this value is close to your kube-reserved mark, you can increase the kube-reserved value.

  • Monitor the memory usage of system services with a tool such as libcgroup:

    $ yum install libcgroup-tools
    $ cgget -g memory  /system.slice | grep memory.usage_in_bytes

    If this value is close to your system-reserved mark, you can increase the system-reserved value.

  • Use the OpenShift Container Platform Cluster Loader to measure performance metrics of your deployment at various cluster states.

23.3. Computing Allocated Resources

An allocated amount of a resource is computed based on the following formula:

[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - [Hard-Eviction-Thresholds]

The withholding of Hard-Eviction-Thresholds from allocatable is a change in behavior to improve system reliability now that allocatable is enforced for end-user pods at the node level. The experimental-allocatable-ignore-eviction setting is available to preserve legacy behavior, but it will be deprecated in a future release.

If [Allocatable] is negative, it is set to 0.

23.4. Viewing Node Allocatable Resources and Capacity

To see a node’s current capacity and allocatable resources, you can run:

$ oc get node/<node_name> -o yaml
    cpu: "4"
    memory: 8010948Ki
    pods: "110"
    cpu: "4"
    memory: 8010948Ki
    pods: "110"

23.5. System Resources Reported by Node

Each node reports system resources utilized by the container runtime and kubelet. To better aid your ability to configure --system-reserved and --kube-reserved, you can determine a node’s resource usage using the node summary API, which is accessible at <master>/api/v1/nodes/<node>/proxy/stats/summary.

For instance, to access the resources from cluster.node22 node, you can run:

$ curl <certificate details> https://<master>/api/v1/nodes/cluster.node22/proxy/stats/summary
    "node": {
        "nodeName": "cluster.node22",
        "systemContainers": [
                "cpu": {
                    "usageCoreNanoSeconds": 929684480915,
                    "usageNanoCores": 190998084
                "memory": {
                    "rssBytes": 176726016,
                    "usageBytes": 1397895168,
                    "workingSetBytes": 1050509312
                "name": "kubelet"
                "cpu": {
                    "usageCoreNanoSeconds": 128521955903,
                    "usageNanoCores": 5928600
                "memory": {
                    "rssBytes": 35958784,
                    "usageBytes": 129671168,
                    "workingSetBytes": 102416384
                "name": "runtime"

See REST API Overview for more details about certificate details.

23.6. Node enforcement

The node is able to limit the total amount of resources that pods may consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from starving system services (for example: container runtime, node agent, etc.) for resources. It is strongly encouraged that administrators reserve resources based on the desired node utilization target in order to improve node reliability.

The node enforces resource constraints using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy separate from system daemons.

To configure node enforcement, use the following parameters in the appropriate node configuration map.

Example 23.2. Node Cgroup Settings

    - "true" 1
    - "systemd" 2
    - "pods" 3
Enable or disable the new cgroup hierarchy managed by the node. Any change of this setting requires a full drain of the node. This flag must be true to allow the node to enforce node allocatable. We do not recommend users change this value.
The cgroup driver used by the node when managing cgroup hierarchies. This value must match the driver associated with the container runtime. Valid values are systemd and cgroupfs. The default is systemd.
A comma-delimited list of scopes for where the node should enforce node resource constraints. Valid values are pods, system-reserved, and kube-reserved. The default is pods. We do not recommend users change this value.

Optionally, the node can be made to enforce kube-reserved and system-reserved by specifying those tokens in the enforce-node-allocatable flag. If specified, the corresponding --kube-reserved-cgroup or --system-reserved-cgroup needs to be provided. In future releases, the node and container runtime will be packaged in a common cgroup separate from system.slice. Until that time, we do not recommend users change the default value of enforce-node-allocatable flag.

Administrators should treat system daemons similar to Guaranteed pods. System daemons can burst within their bounding control groups and this behavior needs to be managed as part of cluster deployments. Enforcing system-reserved limits can lead to critical system services on that node being starved of CPU or OOM killed. The recommendation is to enforce system-reserved only if operators have profiled their nodes exhaustively to determine precise estimates and are confident in their ability to recover if any process in that group is OOM killed.

As a result, we strongly recommended that users only enforce node allocatable for pods by default, and set aside appropriate reservations for system daemons to maintain overall node reliability.

You can verify which cgroup driver is set using the following command:

$ systemctl status atomic-openshift-node -l | grep cgroup-driver=

The output includes a response similar to the following:


For more information on managing and troubleshooting cgroup drivers, see Introduction to Control Groups (Cgroups).

23.7. Eviction Thresholds

If a node is under memory pressure, it can impact the entire node and all pods running on it. If a system daemon is using more than its reserved amount of memory, an OOM event may occur that can impact the entire node and all pods running on it. To avoid (or reduce the probability of) system OOMs the node provides Out Of Resource Handling.

By reserving some memory via the --eviction-hard flag, the node attempts to evict pods whenever memory availability on the node drops below the absolute value or percentage. If system daemons did not exist on a node, pods are limited to the memory capacity - eviction-hard. For this reason, resources set aside as a buffer for eviction before reaching out of memory conditions are not available for pods.

Here is an example to illustrate the impact of node allocatable for memory:

  • Node capacity is 32Gi
  • --kube-reserved is 2Gi
  • --system-reserved is 1Gi
  • --eviction-hard is set to <100Mi.

For this node, the effective node allocatable value is 28.9Gi. If the node and system components use up all their reservation, the memory available for pods is 28.9Gi, and kubelet will evict pods when it exceeds this usage.

If we enforce node allocatable (28.9Gi) via top level cgroups, then pods can never exceed 28.9Gi. Evictions would not be performed unless system daemons are consuming more than 3.1Gi of memory.

If system daemons do not use up all their reservation, with the above example, pods would face memcg OOM kills from their bounding cgroup before node evictions kick in. To better enforce QoS under this situation, the node applies the hard eviction thresholds to the top-level cgroup for all pods to be Node Allocatable + Eviction Hard Thresholds.

If system daemons do not use up all their reservation, the node will evict pods whenever they collectively consume more than 28.9Gi of memory. If eviction does not occur in time, a pod will be OOM killed if pods collectively consume 29Gi of memory.

23.8. Scheduler

The scheduler now uses the value of node.Status.Allocatable instead of node.Status.Capacity to decide if a node will become a candidate for pod scheduling.

By default, the node will report its machine capacity as fully schedulable by the cluster.