Fluentd collector pod is taking long time to complete start-up and to have the metrics endpoint ready, triggering CollectorNodeDown alert to fire
Issue
-
Since OpenShift Container Platform 4 - Cluster Logging 5.8, we are observing lots of
CollectorNodeDown
alerts because of the below error."Get \"https://x.x.x.x:24231/metrics\": dial tcp x.x.x.x:24231: connect: connection refused"
-
There is an increased number of
CollectorNodeDown
alerts firing whencollector
pods are restarting or new Nodes are being added to OpenShift. When checking, we found that themetrics
endpoint is taking a long time to become available, causing thetarget
to be reported asDOWN
.
Environment
- Red Hat OpenShift Container Platform
- 4
- Red Hat OpenShift Logging
- 5.8
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.