PrometheusRuleFailures: FluentdQueueLengthIncreasing rule failing to be evaluated
Issue
-
Alertmanageris throwing an alert named PrometheusRuleFailures. -
Below log is observed in
Prometheusrepetitively:
2022-04-27T13:07:30.544866833Z level=warn ts=2022-04-27T13:07:30.544Z caller=manager.go:603 component="rule manager" group=logging_fluentd.alerts msg="Evaluating rule failed" rule="alert: FluentdQueueLengthIncreasing\nexpr: (0 * (kube_pod_start_time{pod=~\".*fluentd.*\"} < time() - 3600)) + on(pod) label_replace((deriv(fluentd_output_status_buffer_queue_length[10m])\n > 0 and delta(fluentd_output_status_buffer_queue_length[1h]) > 1), \"pod\", \"$1\",\n \"hostname\", \"(.*)\")\nfor: 1h\nlabels:\n service: fluentd\n severity: error\nannotations:\n message: For the last hour, fluentd {{ $labels.instance }} average buffer queue\n length has increased continuously.\n summary: Fluentd unable to keep up with traffic over time.\n" err="found duplicate series for the match group {pod=\"collector-xxxxx\"} on the right hand-side of the operation: [{container=\"collector\", endpoint=\"metrics\", hostname=\"collector-xxxx\", instance=\"<ipaddress>:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"fmi_graylog\", pod=\"collector-xxxxx\", service=\"collector\", type=\"elasticsearch\"}, {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-zntwr\", instance=\"<ipaddress>:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-zntwr\", service=\"collector\", type=\"elasticsearch\"}];many-to-many matching not allowed: matching labels must be unique on one side"
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4.x
- Red Hat OpenShift Logging
- 5.x
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.