Cluster operator monitoring unavailable and degraded with prometheus-k8s pods unable to start
Environment
Red Hat OpenShift Container Platform (OCP) 4.x
- observed on OCP 4.10.47
- Advanced Cluster Management (Hub cluster)
Issue
- monitoring cluster operator is unavailable, progressing, and degraded and prometheus-k8s-0 / prometheus-k8s-1 in the openshift-monitoring namespace will not start with the error message . . .
monitoring Cluster Operator error: Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas
- The prometheus-k8s-0 and prometheus-k8s-1 pod in the openshift-monitoring namespace will not start (pending or Init status) with the error messages . . .
Warning FailedMount pod/prometheus-k8s-0 MountVolume.SetUp failed for volume "secret-hub-alertmanager-router-ca" : secret "hub-alertmanager-router-ca" not found
Warning FailedMount pod/prometheus-k8s-0 MountVolume.SetUp failed for volume "secret-observability-alertmanager-accessor" : secret "observability-alertmanager-accessor" not found
Warning FailedMount pod/prometheus-k8s-0 (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-observability-alertmanager-accessor secret-hub-alertmanager-router-ca], unattached volumes=[configmap-metrics-client-ca prometheus-k8s-db secret-kube-rbac-proxy secret-prometheus-k8s-tls prometheus-k8s-rulefiles-0 metrics-client-ca secret-prometheus-k8s-htpasswd secret-prometheus-k8s-proxy config config-out tls-assets web-config secret-grpc-tls secret-metrics-client-certs secret-observability-alertmanager-accessor secret-prometheus-k8s-thanos-sidecar-tls secret-hub-alertmanager-router-ca configmap-serving-certs-ca-bundle kube-api-access-8zf49 configmap-kubelet-serving-ca-bundle prometheus-trusted-ca-bundle secret-kube-etcd-client-certs]: timed out waiting for the condition.
Resolution
- Make a backup of the cluster-monitoring-config configmap
$ oc get cm cluster-monitoring-config -o yaml -n openshift-monitoring > cluster-monitoring-config.yaml
- Then make a targeted precise edit of the cluster-monitoring-config configmap
$ oc edit cm cluster-monitoring-config -n openshift-monitoring
and delete the following section:
additionalAlertManagerConfigs:
- apiVersion: v2
bearerToken:
key: token
name: observability-alertmanager-accessor
pathPrefix: /
scheme: https
staticConfigs:
- alertmanager-open-cluster-management-observability.apps.ocp4.example.com
tlsConfig:
ServerName: ""
ca:
key: service-ca.crt
name: hub-alertmanager-router-ca
insecureSkipVerify: false
- Confirm that the edits made were as intended via
$ oc get cm cluster-monitoring-config -o yaml -n openshift-monitoring > cluster-monitoring-config_new.yaml
- Force a restart of the two problematic pods.
$ oc delete pods prometheus-k8s-0 prometheus-k8s-1 -n openshift-monitoring
- Confirm result via checking
$ oc get co monitoring
Desired output should be
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
monitoring 4.x.xx True False False xxxd
$ oc get pods -n openshift-monitoring | grep prometheus-k8s
Desired output should be
prometheus-k8s-0 6/6 Running 0 3h
prometheus-k8s-1 6/6 Running 0 3h
Root Cause
- The advanced cluster management observability operator (open-cluster-management-addon-observability) was previously installed and removed but the removal was incomplete and this is the resulting leftover cruft.
- The prometheus-k8s-0 and prometheus-k8s-1 pods are attempting to access these secrets which are no longer present and as a result these two pods crucial to the monitoring cluster operator are not able to start.
- This leaves the monitoring cluster operator in a unavailable, progressing, and degraded state.
Diagnostic Steps
- check the output of
$ oc get co monitoring
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
monitoring 4.x.x False True True xxxd
- Parse the yaml detail of the monitoring co for Failed as follows:
$ oc get co monitoring -o yaml | grep Failed -A 2
message: 'Failed to rollout the stack. Error: updating prometheus-k8s: waiting
for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s:
expected 2 replicas, got 0 updated replicas'
reason: UpdatingPrometheusK8SFailed
status: "True"
type: Degraded
reason: UpdatingPrometheusK8SFailed
status: "False"
type: Available
- check the output of
$ oc get events -n openshift-monitoring
LAST SEEN TYPE REASON OBJECT MESSAGE
3d Warning FailedMount pod/prometheus-k8s-0 MountVolume.SetUp failed for volume "secret-hub-alertmanager-router-ca" : secret "hub-alertmanager-router-ca" not found
2d22h Warning FailedMount pod/prometheus-k8s-0 MountVolume.SetUp failed for volume "secret-observability-alertmanager-accessor" : secret "observability-alertmanager-accessor" not found
2d22h Warning FailedMount pod/prometheus-k8s-0 (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-observability-alertmanager-accessor secret-hub-alertmanager-router-ca], unattached volumes=[configmap-metrics-client-ca prometheus-k8s-db secret-kube-rbac-proxy secret-prometheus-k8s-tls prometheus-k8s-rulefiles-0 metrics-client-ca secret-prometheus-k8s-htpasswd secret-prometheus-k8s-proxy config config-out tls-assets web-config secret-grpc-tls secret-metrics-client-certs secret-observability-alertmanager-accessor secret-prometheus-k8s-thanos-sidecar-tls secret-hub-alertmanager-router-ca configmap-serving-certs-ca-bundle kube-api-access-8zf49 configmap-kubelet-serving-ca-bundle prometheus-trusted-ca-bundle secret-kube-etcd-client-certs]: timed out waiting for the condition
2d22h Warning FailedMount pod/prometheus-k8s-1 MountVolume.SetUp failed for volume "secret-hub-alertmanager-router-ca" : secret "hub-alertmanager-router-ca" not found
2d22h Warning FailedMount pod/prometheus-k8s-1 MountVolume.SetUp failed for volume "secret-observability-alertmanager-accessor" : secret "observability-alertmanager-accessor" not found
2d22h Warning FailedMount pod/prometheus-k8s-1 (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-observability-alertmanager-accessor secret-hub-alertmanager-router-ca], unattached volumes=[configmap-metrics-client-ca metrics-client-ca secret-prometheus-k8s-proxy secret-prometheus-k8s-thanos-sidecar-tls secret-observability-alertmanager-accessor secret-grpc-tls secret-metrics-client-certs secret-kube-rbac-proxy prometheus-k8s-rulefiles-0 kube-api-access-lffzr prometheus-k8s-db configmap-serving-certs-ca-bundle configmap-kubelet-serving-ca-bundle secret-prometheus-k8s-tls web-config config secret-prometheus-k8s-htpasswd secret-kube-etcd-client-certs secret-hub-alertmanager-router-ca prometheus-trusted-ca-bundle config-out tls-assets]: timed out waiting for the condition
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments