Due to High Memory Consumption aws-efs-csi-driver-controller is in CrashLoopBackOff
Environment
- Red Hat Openshift Service on AWS (ROSA)
- 4.x
- Red Hat Openshift Dedicated (OSD)
- 4.x
- Red Hat Openshift Container Platform (OCP)
- 4.x
Issue
One or more of the following was observed with the aws-efs-csi driver
- CSI driver pods going into crashloop status after creating PVC
In some instances CSI driver pods without limits impacted the control plane:
* Cluster is unstable and control plane components are continuously going in CrashLoopBackOff
* Controller manager continuously crashing and restarting
Resolution
-
Use the EFS driver version included in these releases or greater:
OpenShift Container Platform 4.10.52 RHSA-2023:0698
OpenShift Container Platform 4.11.27 RHSA-2023:0651
OpenShift Container Platform 4.12.0 RHSA-2022:7399 -
Do not create/delete PVCs on EFS volumes with high frequency as this behavior may potentially result in the behavior observed.
-
Ensure the created storage class using this provisioner has a reasonable range set between the "gidRangeStart:" and "gidRangeEnd:". (Any range beyond 1,000 would not be used as that is the maximum limit of PVCs for each EFS volume)
Root Cause
Create/Delete
of EFS volumes with high frequency was observed to occasionally spike memory. Without the limits set theaws-efs-csi-driver-controller
replicas were observed to sporadically spike to upwards of20GB
memory
usage in their csi-driver container according to Prometheus container memory usage metrics.-
Without those memory limits set in some instances control plane nodes with limited memory were stressed to exhaustion
-
In other instances it was observed that storage classes using a gid range of several hundred million to 2 billion exhausted memory while driver was searching and allocating a gid for the PVC.
Diagnostic Steps
- Get the details of CSI driver Pods:
In openshift-cluster-csi-drivers project
$ oc get pods -o wide
aws-efs-csi-driver-controller-xxxxxxxxxx-xxxxx 2/4 CrashLoopBackOff 206 (2m47s ago) 6h25m xx.xxx.x.xxx ip-xx-xxx-x-xxx.us-west-x.compute.internal <none> <none>
$ oc adm top nodes ip-xx-xxx-x-xxx.us-west-x.compute.internal
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-xx-xxx-x-xxx.us-west-x.compute.internal 6400m 80% 6359Mi 22%
- Get the kube-controller-pod details
$ oc get pods
kube-controller-manager-ip-xx-xxx-x-xxx.us-west-x.compute.internal 3/4 CrashLoopBackOff 98 (4m32s ago) 5h8m
Ensure pods have the limits set, for example:
oc describe pod aws-efs-csi-driver-controller-6ddd68db54-b7sq6 -n openshift-cluster-csi-drivers
[...]
Restart Count: 0
Limits:
cpu: 100m
memory: 1Gi
oc describe pod aws-efs-csi-driver-node-7vg9l -n openshift-cluster-csi-drivers
[...]
Restart Count: 0
Limits:
cpu: 100m
memory: 1Gi
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments