Cluster operator monitoring unavailable and degraded with prometheus-k8s pods unable to start

Solution Verified - Updated 2024-04-05T21:57:17+00:00 -

Environment

Red Hat OpenShift Container Platform (OCP) 4.x

observed on OCP 4.10.47
Advanced Cluster Management (Hub cluster)

Issue

monitoring cluster operator is unavailable, progressing, and degraded and prometheus-k8s-0 / prometheus-k8s-1 in the openshift-monitoring namespace will not start with the error message . . .

monitoring Cluster Operator error:   Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas

The prometheus-k8s-0 and prometheus-k8s-1 pod in the openshift-monitoring namespace will not start (pending or Init status) with the error messages . . .

Warning   FailedMount   pod/prometheus-k8s-0   MountVolume.SetUp failed for volume "secret-hub-alertmanager-router-ca" : secret "hub-alertmanager-router-ca" not found
Warning   FailedMount   pod/prometheus-k8s-0   MountVolume.SetUp failed for volume "secret-observability-alertmanager-accessor" : secret "observability-alertmanager-accessor" not found
Warning   FailedMount   pod/prometheus-k8s-0   (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-observability-alertmanager-accessor secret-hub-alertmanager-router-ca], unattached volumes=[configmap-metrics-client-ca prometheus-k8s-db secret-kube-rbac-proxy secret-prometheus-k8s-tls prometheus-k8s-rulefiles-0 metrics-client-ca secret-prometheus-k8s-htpasswd secret-prometheus-k8s-proxy config config-out tls-assets web-config secret-grpc-tls secret-metrics-client-certs secret-observability-alertmanager-accessor secret-prometheus-k8s-thanos-sidecar-tls secret-hub-alertmanager-router-ca configmap-serving-certs-ca-bundle kube-api-access-8zf49 configmap-kubelet-serving-ca-bundle prometheus-trusted-ca-bundle secret-kube-etcd-client-certs]: timed out waiting for the condition.

Resolution

Make a backup of the cluster-monitoring-config configmap

   $ oc get cm cluster-monitoring-config -o yaml -n openshift-monitoring > cluster-monitoring-config.yaml

Then make a targeted precise edit of the cluster-monitoring-config configmap

   $ oc edit cm cluster-monitoring-config -n openshift-monitoring

and delete the following section:

      additionalAlertManagerConfigs:
      - apiVersion: v2
        bearerToken:
          key: token
          name: observability-alertmanager-accessor
        pathPrefix: /
        scheme: https
        staticConfigs:
        - alertmanager-open-cluster-management-observability.apps.ocp4.example.com
        tlsConfig:
          ServerName: ""
          ca:
            key: service-ca.crt
            name: hub-alertmanager-router-ca
          insecureSkipVerify: false

Confirm that the edits made were as intended via

     $ oc get cm cluster-monitoring-config -o yaml -n openshift-monitoring > cluster-monitoring-config_new.yaml

Force a restart of the two problematic pods.

     $ oc delete pods prometheus-k8s-0 prometheus-k8s-1 -n openshift-monitoring

Confirm result via checking

    $ oc get co monitoring

    Desired output should be  

            NAME         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
            monitoring   4.x.xx   True        False         False       xxxd

     $ oc get pods -n openshift-monitoring | grep prometheus-k8s

    Desired output should be 

     prometheus-k8s-0                               6/6     Running   0          3h
     prometheus-k8s-1                               6/6     Running   0          3h

Root Cause

The advanced cluster management observability operator (open-cluster-management-addon-observability) was previously installed and removed but the removal was incomplete and this is the resulting leftover cruft.
The prometheus-k8s-0 and prometheus-k8s-1 pods are attempting to access these secrets which are no longer present and as a result these two pods crucial to the monitoring cluster operator are not able to start.
This leaves the monitoring cluster operator in a unavailable, progressing, and degraded state.

Diagnostic Steps

check the output of $ oc get co monitoring

NAME         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring   4.x.x      False       True          True       xxxd

Parse the yaml detail of the monitoring co for Failed as follows:

     $  oc get co monitoring -o yaml | grep Failed -A 2
    message: 'Failed to rollout the stack. Error: updating prometheus-k8s: waiting
      for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s:
      expected 2 replicas, got 0 updated replicas'
    reason: UpdatingPrometheusK8SFailed
    status: "True"
    type: Degraded

    reason: UpdatingPrometheusK8SFailed
    status: "False"
    type: Available

check the output of $ oc get events -n openshift-monitoring

LAST SEEN   TYPE      REASON        OBJECT                 MESSAGE
3d          Warning   FailedMount   pod/prometheus-k8s-0   MountVolume.SetUp failed for volume "secret-hub-alertmanager-router-ca" : secret "hub-alertmanager-router-ca" not found
2d22h       Warning   FailedMount   pod/prometheus-k8s-0   MountVolume.SetUp failed for volume "secret-observability-alertmanager-accessor" : secret "observability-alertmanager-accessor" not found
2d22h       Warning   FailedMount   pod/prometheus-k8s-0   (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-observability-alertmanager-accessor secret-hub-alertmanager-router-ca], unattached volumes=[configmap-metrics-client-ca prometheus-k8s-db secret-kube-rbac-proxy secret-prometheus-k8s-tls prometheus-k8s-rulefiles-0 metrics-client-ca secret-prometheus-k8s-htpasswd secret-prometheus-k8s-proxy config config-out tls-assets web-config secret-grpc-tls secret-metrics-client-certs secret-observability-alertmanager-accessor secret-prometheus-k8s-thanos-sidecar-tls secret-hub-alertmanager-router-ca configmap-serving-certs-ca-bundle kube-api-access-8zf49 configmap-kubelet-serving-ca-bundle prometheus-trusted-ca-bundle secret-kube-etcd-client-certs]: timed out waiting for the condition
2d22h       Warning   FailedMount   pod/prometheus-k8s-1   MountVolume.SetUp failed for volume "secret-hub-alertmanager-router-ca" : secret "hub-alertmanager-router-ca" not found
2d22h       Warning   FailedMount   pod/prometheus-k8s-1   MountVolume.SetUp failed for volume "secret-observability-alertmanager-accessor" : secret "observability-alertmanager-accessor" not found
2d22h       Warning   FailedMount   pod/prometheus-k8s-1   (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-observability-alertmanager-accessor secret-hub-alertmanager-router-ca], unattached volumes=[configmap-metrics-client-ca metrics-client-ca secret-prometheus-k8s-proxy secret-prometheus-k8s-thanos-sidecar-tls secret-observability-alertmanager-accessor secret-grpc-tls secret-metrics-client-certs secret-kube-rbac-proxy prometheus-k8s-rulefiles-0 kube-api-access-lffzr prometheus-k8s-db configmap-serving-certs-ca-bundle configmap-kubelet-serving-ca-bundle secret-prometheus-k8s-tls web-config config secret-prometheus-k8s-htpasswd secret-kube-etcd-client-certs secret-hub-alertmanager-router-ca prometheus-trusted-ca-bundle config-out tls-assets]: timed out waiting for the condition

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

Cluster operator monitoring unavailable and degraded with prometheus-k8s pods unable to start

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links