OCP4 Liveliness probes failing after defender invoked oom-killer

Solution Verified - Updated -

Environment

Red Hat OpenShift Container Platform (RHOCP):

  • 4.x
  • Twistlock

Issue

  • Pods are failing liveliness probes.
  • Pods are sporadically restarting in all namespaces.

Resolution

  1. Increase the memory limits for the TwistLock defender daemon set.
    • Alternatively, the daemon set can be removed.
oc edit ds/twistlock-defender-ds

resources:
limits:
  cpu: xxx
  memory: xxxMi <--- Increase 
requests:
  cpu: xxx
  memory: xxxMi <--- set the same value as limits.memory 

Recommended values:

cpu: 1
memory: 4Gi

From Twistlock documentation

Issue:
Users encounter out-of-memory (OOM) issues when running an inline container firewall on Defender nodes operating within Kubernetes (k8s), particularly when there are more than 10 protected workloads on the same node.

Reason:
The hierarchical nature of cgroups in Kubernetes means that the memory limit set for the pod takes precedence over the memory limit set specifically for the Defender cgroup. Consequently, this hierarchy can lead to OOM errors, especially when multiple protected workloads are present on the same node.

Solution:
When deploying inline firewall instances on a Defender node with more than 10 protected workloads, increase the memory allocation for the Defender pod to 4GB.

Root Cause

This solution is about TwistLock invoking the defender daemon set, so it can be either removed or its resources increased to avoid OOMKill, which is triggered when the cgroups kills the container for lack of resources.
That's not be confused with another solution Readiness probes fail in Openshift pods after Twistock pod is restarted or deleted where the readiness/liveness probes for Openshift pods after Twistock pod is restarted or deleted, which Twistlock container mounts the host's container storage path at /var/lib/containers/storage with read/write permissions, and removes container image layers in use by the crio container engine.

Twistlock is a resource-intensive privileged process that reads images at runtime and indexes containers . When twistlock cannot complete process tasking on time and exits or crashes due to memory constraint, we see corresponding failures on kubelet on the same host where the oomkill was observed, leading to readiness/liveness probe failures in OTHER PODS on the same host - any namespace, intermittently. It is suspected that kubelet processing is interrupted or stalled until the twistlock process is re-instanced (process order-of-operation delay), which causes pods on that host to fail their check-in windows, but this is speculative. The cause of this correlated behavior is not clearly understood, but alleviating the oomkill behavior within twistlock resolves the impact on locally impacted containers in the cluster.

Diagnostic Steps

On a node where probes are failing run the below.
If you see any instances of defender invoking OOM this could be a possible cause of probe failures.

oc debug node/<NodeName>
dmesg | grep oom-killer | grep defender
[xxxx] defender invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=999


Iterate across all nodes:
for i in $(oc get nodes | awk {'print $1'}); do echo $i; oc debug node/$i -- chroot /host sh -c "dmesg | grep oom-killer | grep defender | tail -n 1"; done

[87543.109375] defender invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=99

If no OOM events are found in dmesg but the node-wide probe failure pattern matches (multiple unrelated pods failing probes simultaneously with "Client.Timeout exceeded while awaiting headers"), see KCS-7142369 for the non-OOM Twistlock resource starvation scenario where Defender falls behind on nfqueue packet inspection without triggering an OOM kill.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments