Cluster operator monitoring unavailable and degraded with prometheus-k8s pods unable to start

Solution Verified - Updated -

Environment

Red Hat OpenShift Container Platform (OCP) 4.x

  • observed on OCP 4.10.47
  • Advanced Cluster Management (Hub cluster)

Issue

  • monitoring cluster operator is unavailable, progressing, and degraded and prometheus-k8s-0 / prometheus-k8s-1 in the openshift-monitoring namespace will not start with the error message . . .
monitoring Cluster Operator error:   Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas
  • The prometheus-k8s-0 and prometheus-k8s-1 pod in the openshift-monitoring namespace will not start (pending or Init status) with the error messages . . .
Warning   FailedMount   pod/prometheus-k8s-0   MountVolume.SetUp failed for volume "secret-hub-alertmanager-router-ca" : secret "hub-alertmanager-router-ca" not found
Warning   FailedMount   pod/prometheus-k8s-0   MountVolume.SetUp failed for volume "secret-observability-alertmanager-accessor" : secret "observability-alertmanager-accessor" not found
Warning   FailedMount   pod/prometheus-k8s-0   (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-observability-alertmanager-accessor secret-hub-alertmanager-router-ca], unattached volumes=[configmap-metrics-client-ca prometheus-k8s-db secret-kube-rbac-proxy secret-prometheus-k8s-tls prometheus-k8s-rulefiles-0 metrics-client-ca secret-prometheus-k8s-htpasswd secret-prometheus-k8s-proxy config config-out tls-assets web-config secret-grpc-tls secret-metrics-client-certs secret-observability-alertmanager-accessor secret-prometheus-k8s-thanos-sidecar-tls secret-hub-alertmanager-router-ca configmap-serving-certs-ca-bundle kube-api-access-8zf49 configmap-kubelet-serving-ca-bundle prometheus-trusted-ca-bundle secret-kube-etcd-client-certs]: timed out waiting for the condition.

Resolution

  1. Make a backup of the cluster-monitoring-config configmap
   $ oc get cm cluster-monitoring-config -o yaml -n openshift-monitoring > cluster-monitoring-config.yaml
  1. Then make a targeted precise edit of the cluster-monitoring-config configmap
   $ oc edit cm cluster-monitoring-config -n openshift-monitoring

and delete the following section:

      additionalAlertManagerConfigs:
      - apiVersion: v2
        bearerToken:
          key: token
          name: observability-alertmanager-accessor
        pathPrefix: /
        scheme: https
        staticConfigs:
        - alertmanager-open-cluster-management-observability.apps.ocp4.example.com
        tlsConfig:
          ServerName: ""
          ca:
            key: service-ca.crt
            name: hub-alertmanager-router-ca
          insecureSkipVerify: false
  1. Confirm that the edits made were as intended via
     $ oc get cm cluster-monitoring-config -o yaml -n openshift-monitoring > cluster-monitoring-config_new.yaml
  1. Force a restart of the two problematic pods.
     $ oc delete pods prometheus-k8s-0 prometheus-k8s-1 -n openshift-monitoring
  1. Confirm result via checking
    $ oc get co monitoring

    Desired output should be  

            NAME         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
            monitoring   4.x.xx   True        False         False       xxxd

     $ oc get pods -n openshift-monitoring | grep prometheus-k8s

    Desired output should be 

     prometheus-k8s-0                               6/6     Running   0          3h
     prometheus-k8s-1                               6/6     Running   0          3h

Root Cause

  • The advanced cluster management observability operator (open-cluster-management-addon-observability) was previously installed and removed but the removal was incomplete and this is the resulting leftover cruft.
  • The prometheus-k8s-0 and prometheus-k8s-1 pods are attempting to access these secrets which are no longer present and as a result these two pods crucial to the monitoring cluster operator are not able to start.
  • This leaves the monitoring cluster operator in a unavailable, progressing, and degraded state.

Diagnostic Steps

  1. check the output of $ oc get co monitoring
NAME         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring   4.x.x      False       True          True       xxxd
  1. Parse the yaml detail of the monitoring co for Failed as follows:
     $  oc get co monitoring -o yaml | grep Failed -A 2
    message: 'Failed to rollout the stack. Error: updating prometheus-k8s: waiting
      for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s:
      expected 2 replicas, got 0 updated replicas'
    reason: UpdatingPrometheusK8SFailed
    status: "True"
    type: Degraded

    reason: UpdatingPrometheusK8SFailed
    status: "False"
    type: Available
  1. check the output of $ oc get events -n openshift-monitoring
LAST SEEN   TYPE      REASON        OBJECT                 MESSAGE
3d          Warning   FailedMount   pod/prometheus-k8s-0   MountVolume.SetUp failed for volume "secret-hub-alertmanager-router-ca" : secret "hub-alertmanager-router-ca" not found
2d22h       Warning   FailedMount   pod/prometheus-k8s-0   MountVolume.SetUp failed for volume "secret-observability-alertmanager-accessor" : secret "observability-alertmanager-accessor" not found
2d22h       Warning   FailedMount   pod/prometheus-k8s-0   (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-observability-alertmanager-accessor secret-hub-alertmanager-router-ca], unattached volumes=[configmap-metrics-client-ca prometheus-k8s-db secret-kube-rbac-proxy secret-prometheus-k8s-tls prometheus-k8s-rulefiles-0 metrics-client-ca secret-prometheus-k8s-htpasswd secret-prometheus-k8s-proxy config config-out tls-assets web-config secret-grpc-tls secret-metrics-client-certs secret-observability-alertmanager-accessor secret-prometheus-k8s-thanos-sidecar-tls secret-hub-alertmanager-router-ca configmap-serving-certs-ca-bundle kube-api-access-8zf49 configmap-kubelet-serving-ca-bundle prometheus-trusted-ca-bundle secret-kube-etcd-client-certs]: timed out waiting for the condition
2d22h       Warning   FailedMount   pod/prometheus-k8s-1   MountVolume.SetUp failed for volume "secret-hub-alertmanager-router-ca" : secret "hub-alertmanager-router-ca" not found
2d22h       Warning   FailedMount   pod/prometheus-k8s-1   MountVolume.SetUp failed for volume "secret-observability-alertmanager-accessor" : secret "observability-alertmanager-accessor" not found
2d22h       Warning   FailedMount   pod/prometheus-k8s-1   (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-observability-alertmanager-accessor secret-hub-alertmanager-router-ca], unattached volumes=[configmap-metrics-client-ca metrics-client-ca secret-prometheus-k8s-proxy secret-prometheus-k8s-thanos-sidecar-tls secret-observability-alertmanager-accessor secret-grpc-tls secret-metrics-client-certs secret-kube-rbac-proxy prometheus-k8s-rulefiles-0 kube-api-access-lffzr prometheus-k8s-db configmap-serving-certs-ca-bundle configmap-kubelet-serving-ca-bundle secret-prometheus-k8s-tls web-config config secret-prometheus-k8s-htpasswd secret-kube-etcd-client-certs secret-hub-alertmanager-router-ca prometheus-trusted-ca-bundle config-out tls-assets]: timed out waiting for the condition

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments