Prometheus could not scrape fluentd for more than 10m alert in Alertmanager in OCP 4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat OpenShift Service on AWS (ROSA)
    • 4
  • Red Hat OpenShift Dedicated (OSD)
    • 4
  • Red Hat OpenShift Logging (RHOL)

Issue

  • Alertmanager is throwing the alert "FluentdNodeDown: Prometheus could not scrape fluentd for more than 10m"

Resolution

Note: It's important to verify that the issue is the same that's reflected in this document following the Diagnostic Steps to verify it. In case that it doesn't match it, then, it could be taken into consideration solution Prometheus reporting some or all fluentd metrics endpoints as TargetDown and solution OCP Prometheus could not scrape fluentd for more than 10m.

There is a workaround for OCP, but it doesn't work in OSD and ROSA, as the label cannot be added to that namespace. Red Hat is working on a fix for OSD and ROSA.

Workaround for OCP

Add to the openshift-logging namespace the label openshift.io/cluster-monitoring: "true":

$ oc edit namespace openshift-logging
...
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-operators-redhat 
  annotations:
    openshift.io/node-selector: ""
  labels:
    openshift.io/cluster-monitoring: "true"  <-- needed to add this label
...

Note: If the issue still persists even after making the changes, check if there are any improperly configured taints on the node.

Root Cause

When the Cluster Logging stack is deployed in RHOCP 4.6 and older, if the openshift.io/cluster-monitoring: "true" label is not set to the openshift-logging namespace as it's indicated in the OCP documentation and ROSA documentation, then, alertmanager will throw the alert: "Prometheus could not scrape fluentd for more than 10m"

Diagnostic Steps

  1. Deploy the Cluster Logging stack in RHOCP 4.6 following the "Installing cluster logging" documentation without select "Enable operator recommended cluster monitoring on this namespace from the web console or without adding the label openshift.io/cluster-monitoring: "true" to the openshift-logging namespace from the CLI.
  2. Check that all the fluentd pods are running

    /// For old versions of Logging
    $ oc -n openshift-logging get pods -l component=fluentd
    
    /// For new versions of Logging
    $ oc -n openshift-logging get pods -l component=collector
    
  3. Check that the openshift-logging namespace has not the label openshift.io/cluster-monitoring: "true"

    $ oc get namespace openshift-logging -o yaml 
    
  4. After 10 minutes, check the alerts generated by Alertmanager and one of them will be "Prometheus could not scrape fluentd for more than 10m"

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments