The monitoring operator got stuck during upgrade with the `context deadline exceeded` in RHOCP4

Solution Verified - Updated 2025-02-17T09:32:42+00:00 -

Environment

Red Hat OpenShift Container Platform (RHOCP)
- 4

Issue

The monitoring operator is degraded with the below error:

waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: context deadline exceeded

The alertmanager-main-x pods are in CrashLoopBackOff state as below:

NAME                                           READY   STATUS             RESTARTS      AGE
alertmanager-main-0                            4/5     CrashLoopBackOff   6 (66s ago)   7m22s
alertmanager-main-1                            4/5     CrashLoopBackOff   6 (62s ago)   7m14s

The alertmanager-main-x pod container alertmanager reporting below error log:

ts=2025-02-17T06:53:26.255Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="bad matcher format: DeploymentConfigHasZeroReplicasDevNew"

Resolution

Extract the alertmanager-main secret from openshift-monitoring namespace in alertmanager.yaml through the below command:

$ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alertmanager.yaml

Check the matchers section in alertmanager-main Secret seems incorrect:

Before:

matchers:
    - severity = critical
    - DeploymentConfigHasZeroReplicasDevNew
- receiver: warning
  matchers:
    - severity = warning

The correct syntax for matchers follows the key=value format:

After:

matchers:
    - severity = "critical"
    - alertname= "DeploymentConfigHasZeroReplicasDevNew"
- receiver: warning
  matchers:
    - severity = "warning"

Edit the alertmanager-main Secret in openshift-monitoring namespace:
```
$ oc edit secret alertmanager-main -n openshift-monitoring
```
Correct the matchers section as shown above.
Save and exit the editor.

Restart Alertmanager pods to apply changes:

$ oc delete pod -l app=alertmanager -n openshift-monitoring

Root Cause

The cause of the Monitoring cluster operator being stuck in a degraded state is an invalid matcher syntax in the alertmanager-main secret.
This prevented Alertmanager from reloading, causing timeouts in the Cluster Monitoring Operator.
Correcting the matcher format and restarting Alertmanager resolved the issue.

Diagnostic Steps

Check the status of the monitoring cluster operator, to verify the below error:

waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: context deadline exceeded

Verify the alertmanager-main pods status, which will be in CrashLoopBackOff status

NAME                                           READY   STATUS             RESTARTS      AGE
alertmanager-main-0                            4/5     CrashLoopBackOff   6 (66s ago)   7m22s
alertmanager-main-1                            4/5     CrashLoopBackOff   6 (62s ago)   7m14s

Check alertmanager-main-x pod container alertmanager logs, look for below error:

ts=2025-02-17T06:53:26.255Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="bad matcher format: DeploymentConfigHasZeroReplicasDevNew"

Extract the alertmanager-main secret and check if the matcher section is in correct format from the alrtmanager.yaml generated by the below command:
```
$ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alertmanager.yaml
```

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

The monitoring operator got stuck during upgrade with the `context deadline exceeded` in RHOCP4

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links