Due to High Memory Consumption aws-efs-csi-driver-controller is in CrashLoopBackOff

Solution Verified - Updated -

Environment

  • Red Hat Openshift Service on AWS (ROSA)
    • 4.x
  • Red Hat Openshift Dedicated (OSD)
    • 4.x
  • Red Hat Openshift Container Platform (OCP)
    • 4.x

Issue

One or more of the following was observed with the aws-efs-csi driver

  • CSI driver pods going into crashloop status after creating PVC

In some instances CSI driver pods without limits impacted the control plane:
* Cluster is unstable and control plane components are continuously going in CrashLoopBackOff
* Controller manager continuously crashing and restarting

Resolution

  • Use the EFS driver version included in these releases or greater:
    OpenShift Container Platform 4.10.52 RHSA-2023:0698
    OpenShift Container Platform 4.11.27 RHSA-2023:0651
    OpenShift Container Platform 4.12.0 RHSA-2022:7399

  • Do not create/delete PVCs on EFS volumes with high frequency as this behavior may potentially result in the behavior observed.

  • Ensure the created storage class using this provisioner has a reasonable range set between the "gidRangeStart:" and "gidRangeEnd:". (Any range beyond 1,000 would not be used as that is the maximum limit of PVCs for each EFS volume)

Root Cause

  • Create/Delete of EFS volumes with high frequency was observed to occasionally spike memory. Without the limits set the aws-efs-csi-driver-controller replicas were observed to sporadically spike to upwards of 20GB memory
    usage in their csi-driver container according to Prometheus container memory usage metrics.
  • Without those memory limits set in some instances control plane nodes with limited memory were stressed to exhaustion

  • In other instances it was observed that storage classes using a gid range of several hundred million to 2 billion exhausted memory while driver was searching and allocating a gid for the PVC.

Diagnostic Steps

  • Get the details of CSI driver Pods:
In openshift-cluster-csi-drivers project

$ oc get pods -o wide
aws-efs-csi-driver-controller-xxxxxxxxxx-xxxxx   2/4     CrashLoopBackOff   206 (2m47s ago)   6h25m   xx.xxx.x.xxx   ip-xx-xxx-x-xxx.us-west-x.compute.internal   <none>           <none>

$ oc adm top nodes ip-xx-xxx-x-xxx.us-west-x.compute.internal
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-xx-xxx-x-xxx.us-west-x.compute.internal   6400m        80%    6359Mi          22% 
  • Get the kube-controller-pod details
$ oc get pods

kube-controller-manager-ip-xx-xxx-x-xxx.us-west-x.compute.internal         3/4     CrashLoopBackOff   98 (4m32s ago)    5h8m

Ensure pods have the limits set, for example:

oc describe pod aws-efs-csi-driver-controller-6ddd68db54-b7sq6 -n openshift-cluster-csi-drivers
[...]
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  1Gi

oc describe pod aws-efs-csi-driver-node-7vg9l -n openshift-cluster-csi-drivers
[...]
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  1Gi

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments