OpenShift Node is overloaded and/or using all of its memory when pods are scheduled
Environment
- Red Hat OpenShift Container Platform (RHOCP) 3
- Red Hat OpenShift Container Platform (RHOCP) 4
- Azure Red Hat Openshift (ARO)
- v4.x
- Red Hat OpenShift in AWS (ROSA)
- v4.x
Issue
- RHOCP node is invoking OOM Killer.
- RHOCP node is under high CPU load.
- A pod is scheduled to the node and uses all of the node's resources, crashing the node.
FailedScheduling due to X Insufficient memory or X Insufficient CPU.
- Pods are in pending state
- Pods are in CrashLoopBackOff error
Resolution
- Verify if pods running on the cluster are using limits. See the following links below for more information:
For Red Hat OpenShift Container Platform 3
- Capacity management monitoring
- Protecting nodes system resources
- Allocating node resources
- Out of resource handling
- Overcommitting Node
For Red Hat OpenShift Container Platform 4
Root Cause
One of the most common root causes of this behavior is that pods are getting scheduled to the nodes without any resource limits. The best practice is to set only memory limits and implement OOM Killer monitoring.
Diagnostic Steps
-
Run
oc adm top nodes
to check cluster nodes CPU and Memory utilization. -
In order to see which workload have exceeded over either by CPU/MEMORY limits, It will show a list of the pods running on the node and the requests/limits set. Run
oc describe node
-
Run
oc describe node | grep -A10 Allocated
to show the allocated resources of the node.
Example:
oc describe node xxxx-xxxx-xxxx | grep -A10 Allocated
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 15479m (99%) 26650m (171%)
memory 26362Mi (42%) 47652Mi (76%)
ephemeral-storage 117761Mi (25%) 471050Mi (102%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-azure-disk 0 0
- Per example it shows that CPU for this node has an overall request of 99%, and limits of 171 %. Meaning that, if the pods that are running on this node ever hit their own limits it will cause pod eviction event.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments