OpenShift Node is overloaded and/or using all of its memory when pods are scheduled

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP) 3
  • Red Hat OpenShift Container Platform (RHOCP) 4
  • Azure Red Hat Openshift (ARO)
    • v4.x
  • Red Hat OpenShift in AWS (ROSA)
    • v4.x

Issue

  • RHOCP node is invoking OOM Killer.
  • RHOCP node is under high CPU load.
  • A pod is scheduled to the node and uses all of the node's resources, crashing the node.
  • FailedScheduling due to X Insufficient memory or X Insufficient CPU.
  • Pods are in pending state
  • Pods are in CrashLoopBackOff error

Resolution

  • Verify if pods running on the cluster are using limits. See the following links below for more information:

For Red Hat OpenShift Container Platform 3

For Red Hat OpenShift Container Platform 4

Root Cause

One of the most common root causes of this behavior is that pods are getting scheduled to the nodes without any resource limits. The best practice is to set only memory limits and implement OOM Killer monitoring.

Diagnostic Steps

  • Run oc adm top nodes to check cluster nodes CPU and Memory utilization.

  • In order to see which workload have exceeded over either by CPU/MEMORY limits, It will show a list of the pods running on the node and the requests/limits set. Run oc describe node

  • Run oc describe node | grep -A10 Allocated to show the allocated resources of the node.

Example:

oc describe node xxxx-xxxx-xxxx | grep -A10 Allocated

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests        Limits
  --------                       --------        ------
  cpu                            15479m (99%)    26650m (171%)
  memory                         26362Mi (42%)   47652Mi (76%)
  ephemeral-storage              117761Mi (25%)  471050Mi (102%)
  hugepages-1Gi                  0 (0%)          0 (0%)
  hugepages-2Mi                  0 (0%)          0 (0%)
  attachable-volumes-azure-disk  0               0
  • Per example it shows that CPU for this node has an overall request of 99%, and limits of 171 %. Meaning that, if the pods that are running on this node ever hit their own limits it will cause pod eviction event.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments