Due to High Memory Consumption aws-efs-csi-driver-controller is in CrashLoopBackOff

Solution Verified - Updated 2024-06-13T20:02:39+00:00 -

Environment

Red Hat Openshift Service on AWS (ROSA)
- 4.x
Red Hat Openshift Dedicated (OSD)
- 4.x
Red Hat Openshift Container Platform (OCP)
- 4.x

Issue

One or more of the following was observed with the aws-efs-csi driver

CSI driver pods going into crashloop status after creating PVC

In some instances CSI driver pods without limits impacted the control plane:
* Cluster is unstable and control plane components are continuously going in CrashLoopBackOff
* Controller manager continuously crashing and restarting

Resolution

Use the EFS driver version included in these releases or greater:
OpenShift Container Platform 4.10.52 RHSA-2023:0698
OpenShift Container Platform 4.11.27 RHSA-2023:0651
OpenShift Container Platform 4.12.0 RHSA-2022:7399
Do not create/delete PVCs on EFS volumes with high frequency as this behavior may potentially result in the behavior observed.
Ensure the created storage class using this provisioner has a reasonable range set between the "gidRangeStart:" and "gidRangeEnd:". (Any range beyond 1,000 would not be used as that is the maximum limit of PVCs for each EFS volume)

Root Cause

Create/Delete of EFS volumes with high frequency was observed to occasionally spike memory. Without the limits set the aws-efs-csi-driver-controller replicas were observed to sporadically spike to upwards of 20GB memory
usage in their csi-driver container according to Prometheus container memory usage metrics.
Without those memory limits set in some instances control plane nodes with limited memory were stressed to exhaustion
In other instances it was observed that storage classes using a gid range of several hundred million to 2 billion exhausted memory while driver was searching and allocating a gid for the PVC.

Diagnostic Steps

Get the details of CSI driver Pods:

In openshift-cluster-csi-drivers project

$ oc get pods -o wide
aws-efs-csi-driver-controller-xxxxxxxxxx-xxxxx   2/4     CrashLoopBackOff   206 (2m47s ago)   6h25m   xx.xxx.x.xxx   ip-xx-xxx-x-xxx.us-west-x.compute.internal   <none>           <none>

$ oc adm top nodes ip-xx-xxx-x-xxx.us-west-x.compute.internal
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-xx-xxx-x-xxx.us-west-x.compute.internal   6400m        80%    6359Mi          22%

Get the kube-controller-pod details

$ oc get pods

kube-controller-manager-ip-xx-xxx-x-xxx.us-west-x.compute.internal         3/4     CrashLoopBackOff   98 (4m32s ago)    5h8m

Ensure pods have the limits set, for example:

oc describe pod aws-efs-csi-driver-controller-6ddd68db54-b7sq6 -n openshift-cluster-csi-drivers
[...]
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  1Gi

oc describe pod aws-efs-csi-driver-node-7vg9l -n openshift-cluster-csi-drivers
[...]
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  1Gi

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

Due to High Memory Consumption aws-efs-csi-driver-controller is in CrashLoopBackOff

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links