OpenShift 4: Rate limiting and kubelet healthcheck failures during advanced load testing + conntrack tuning on OpenShift 4
Issue
- It was observed on OpenShift 4.14 using OvnKubernetes that Traffic from a load test was rate limited to around 35k requests per second to a single host accepting UDP traffic over the primary interface of the node.
- Traffic originating from kubelet on the host to pods started to degrade at this threshold, leading to lost SYN packets and failed healthprobes on all pods on affected hosts receiving traffic
- observe
chaintoolong
messaging in logs on host:
sudo conntrack -S | grep -v chaintoolong=0
cpu=0 found=9466 invalid=5 insert=0 insert_failed=18776 drop=2 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=18774
- Pods flapping READY status from all projects/namespaces on any pod hosted on the node accepting load test.
Environment
- Red Hat OpenShift Container Platform (RHOCP) 4.14.14 and later (observed still required in 4.16+)
- Openshift-OVN-kubernetes CNI
- Tigera/Calico CNI
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.