Prometheus pods often failing liveness probes and being restarted

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform 4.14 and earlier.

Issue

Prometheus pods whose names are prometheus-k8s-0 and prometheus-k8s-1 often being restarted.

$ oc get pod -n openshift-monitoring
NAME                                                    READY   STATUS    RESTARTS   AGE
[...]
prometheus-k8s-0                                        5/6     Running   122        8d
prometheus-k8s-1                                        5/6     Running   125        8d
[...]

The problem happens in clusters with a high number of nodes and namespaces.

Resolution

Problem solved in bug OCPBUGS-18971, whose access is not public.

Versions fixed and their respective errata:

Root Cause

In clusters with a high number of nodes and namespaces; if a person or a component executes a query which imply a large amount of TCP connections (for example, to check the CPU consumption of all the pods in the cluster), Prometheus becomes overflown and fails its liveness probes.

A piece of software typically executing this kind of queries is the Vertical Pod Autoscaler.

Diagnostic Steps

  • Execute oc logs prometheus-k8s-0 -n openshift-monitoring --previous and oc logs prometheus-k8s-1 -n openshift-monitoring --previous and confirm that there is a SIGTERM signal at the end:
2024-02-14T08:54:22.561386876Z ts=2024-02-14T08:54:22.561Z caller=main.go:974 level=info msg="Server is ready to receive web requests."
2024-02-14T08:54:22.561467314Z ts=2024-02-14T08:54:22.561Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..."
2024-02-14T09:04:01.906108316Z ts=2024-02-14T09:04:01.905Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=1707890400000 maxt=1707897600000 ulid=01HPKDG9MDDVVJ9RAZJM5CX1QA duration=4m1.892974821s
2024-02-14T09:04:02.559038662Z ts=2024-02-14T09:04:02.558Z caller=db.go:1292 level=info component=tsdb msg="Deleting obsolete block" block=01HPGM7VCC02CK7AC79S2F3K3Z
2024-02-14T09:04:42.166461913Z ts=2024-02-14T09:04:42.166Z caller=main.go:824 level=warn msg="Received SIGTERM, exiting gracefully..."
2024-02-14T09:04:42.166461913Z ts=2024-02-14T09:04:42.166Z caller=main.go:848 level=info msg="Stopping scrape discovery manager..."
2024-02-14T09:04:42.166621368Z ts=2024-02-14T09:04:42.166Z caller=main.go:844 level=info msg="Scrape discovery manager stopped"
2024-02-14T09:04:42.166865441Z ts=2024-02-14T09:04:42.166Z caller=main.go:862 level=info msg="Stopping notify discovery manager..."
2024-02-14T09:04:42.166907486Z ts=2024-02-14T09:04:42.166Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..."
2024-02-14T09:04:42.167658796Z ts=2024-02-14T09:04:42.167Z caller=main.go:858 level=info msg="Notify discovery manager stopped"
2024-02-14T09:04:47.800703061Z ts=2024-02-14T09:04:47.800Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped"
2024-02-14T09:04:47.800703061Z ts=2024-02-14T09:04:47.800Z caller=main.go:899 level=info msg="Stopping scrape manager..."
2024-02-14T09:04:48.180078006Z ts=2024-02-14T09:04:48.179Z caller=main.go:891 level=info msg="Scrape manager stopped"
2024-02-14T09:04:48.609881757Z ts=2024-02-14T09:04:48.609Z caller=head.go:840 level=info component=tsdb msg="Head GC completed" duration=46.050800069s
2024-02-14T09:04:48.933952312Z ts=2024-02-14T09:04:48.933Z caller=checkpoint.go:97 level=info component=tsdb msg="Creating checkpoint" from_segment=7094 to_segment=7150 mint=1707897600000
2024-02-14T09:08:20.373599528Z ts=2024-02-14T09:08:20.373Z caller=head.go:1009 level=info component=tsdb msg="WAL checkpoint complete" first=7094 last=7150 duration=3m31.439900915s
2024-02-14T09:08:20.555191199Z ts=2024-02-14T09:08:20.555Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..."
2024-02-14T09:08:20.555191199Z ts=2024-02-14T09:08:20.555Z caller=main.go:1120 level=info msg="Notifier manager stopped"
2024-02-14T09:08:20.555574860Z ts=2024-02-14T09:08:20.555Z caller=main.go:1132 level=info msg="See you next time!"
  • Confirm that there are "Possible SYN flooding" messages in the kernel ring buffer by executing dmesg in the node which was hosting the affected pod before having been restarted:
[13933748.931833] TCP: request_sock_TCP: Possible SYN flooding on port 9091. Sending cookies.  Check SNMP counters.
[13986804.511156] TCP: request_sock_TCP: Possible SYN flooding on port 9090. Sending cookies.  Check SNMP counters.
[14186082.776970] Adding Red Hat flag eBPF/rawtrace.
[14264550.066111] TCP: request_sock_TCP: Possible SYN flooding on port 9090. Sending cookies.  Check SNMP counters.
[14410042.208557] TCP: request_sock_TCP: Possible SYN flooding on port 9090. Sending cookies.  Check SNMP counters.
[14473067.471381] TCP: request_sock_TCP: Possible SYN flooding on port 9090. Sending cookies.  Check SNMP counters.
  • Check the system logs of the node which was hosting the affected pod before having been restarted by executing journalctl and confirm that the liveness probe of the pod has failed:
Feb 14 08:36:27 <node_name> hyperkube[3811]: I0214 08:36:27.139202    3811 kubelet.go:2208] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="openshift-monitoring/pr
ometheus-k8s-0"

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments