Prometheus could not scrape fluentd for more than 10m alert in Alertmanager in OCP 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Red Hat OpenShift Service on AWS (ROSA)
- 4
- Red Hat OpenShift Dedicated (OSD)
- 4
- Red Hat OpenShift Logging (RHOL)
Issue
- Alertmanager is throwing the alert "FluentdNodeDown: Prometheus could not scrape fluentd for more than 10m"
Resolution
Note: It's important to verify that the issue is the same that's reflected in this document following the Diagnostic Steps to verify it. In case that it doesn't match it, then, it could be taken into consideration solution Prometheus reporting some or all fluentd metrics endpoints as TargetDown and solution OCP Prometheus could not scrape fluentd for more than 10m.
There is a workaround for OCP, but it doesn't work in OSD and ROSA, as the label cannot be added to that namespace. Red Hat is working on a fix for OSD and ROSA.
Workaround for OCP
Add to the openshift-logging
namespace the label openshift.io/cluster-monitoring: "true"
:
$ oc edit namespace openshift-logging
...
apiVersion: v1
kind: Namespace
metadata:
name: openshift-operators-redhat
annotations:
openshift.io/node-selector: ""
labels:
openshift.io/cluster-monitoring: "true" <-- needed to add this label
...
Note: If the issue still persists even after making the changes, check if there are any improperly configured taints on the node.
Root Cause
When the Cluster Logging stack is deployed in RHOCP 4.6 and older, if the openshift.io/cluster-monitoring: "true"
label is not set to the openshift-logging
namespace as it's indicated in the OCP documentation and ROSA documentation, then, alertmanager will throw the alert: "Prometheus could not scrape fluentd for more than 10m"
Diagnostic Steps
- Deploy the Cluster Logging stack in RHOCP 4.6 following the "Installing cluster logging" documentation without select "Enable operator recommended cluster monitoring on this namespace from the web console or without adding the label
openshift.io/cluster-monitoring: "true"
to theopenshift-logging
namespace from the CLI. -
Check that all the fluentd pods are running
/// For old versions of Logging $ oc -n openshift-logging get pods -l component=fluentd /// For new versions of Logging $ oc -n openshift-logging get pods -l component=collector
-
Check that the
openshift-logging
namespace has not the labelopenshift.io/cluster-monitoring: "true"
$ oc get namespace openshift-logging -o yaml
- After 10 minutes, check the alerts generated by Alertmanager and one of them will be "Prometheus could not scrape fluentd for more than 10m"
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments