How to detect pods which have overstepped their ephemeral-storage limit

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform 4.

Issue

At the time when this is being written, there is not any specific pod status to indicate that it has violated its ephemeral-storage limit. When that happens, the status of the pod becomes ContainerStatusUnknown.

Resolution

A pod in ContainerStatusUnknown status has overstepped its ephemeral-storage limit if there is an event in the project with the following text (see message):

- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: "<date>"
  involvedObject:
    apiVersion: v1
    kind: Pod
    name: <pod_name>
    namespace: sre-ci-test
    resourceVersion: "1359040712"
    uid: 425703de-3fdd-42b7-a34c-858ddb2aed8a
  kind: Event
  lastTimestamp: "2022-07-19T09:10:13Z"
  message: 'Pod ephemeral local storage usage exceeds the total limit of containers 0. '
[...]

Root Cause

SRVKP-2552 was created to request that pods killed due to this reason get a specific status instead of the generic ContainerStatusUnknown.

Diagnostic Steps

The following command can be executed to search for these messages in a project:

oc -n <project_name> event | grep -F 'Pod ephemeral local storage usage exceeds the total limit of containers'

Output example:

1h47m       Warning   Evicted                 pod/<pod_name>                    Pod ephemeral local storage usage exceeds the total limit of containers 0.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments