The monitoring operator got stuck during upgrade with the `context deadline exceeded` in RHOCP4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4

Issue

  • The monitoring operator is degraded with the below error:

    waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: context deadline exceeded
    
  • The alertmanager-main-x pods are in CrashLoopBackOff state as below:

    NAME                                           READY   STATUS             RESTARTS      AGE
    alertmanager-main-0                            4/5     CrashLoopBackOff   6 (66s ago)   7m22s
    alertmanager-main-1                            4/5     CrashLoopBackOff   6 (62s ago)   7m14s
    
  • The alertmanager-main-x pod container alertmanager reporting below error log:

    ts=2025-02-17T06:53:26.255Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="bad matcher format: DeploymentConfigHasZeroReplicasDevNew"
    

Resolution

  • Extract the alertmanager-main secret from openshift-monitoring namespace in alertmanager.yaml through the below command:

    $ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alertmanager.yaml
    
  • Check the matchers section in alertmanager-main Secret seems incorrect:

    • Before:
    matchers:
        - severity = critical
        - DeploymentConfigHasZeroReplicasDevNew
    - receiver: warning
      matchers:
        - severity = warning
    

    The correct syntax for matchers follows the key=value format:

    • After:
    matchers:
        - severity = "critical"
        - alertname= "DeploymentConfigHasZeroReplicasDevNew"
    - receiver: warning
      matchers:
        - severity = "warning"
    
  • Edit the alertmanager-main Secret in openshift-monitoring namespace:

    $ oc edit secret alertmanager-main -n openshift-monitoring
    
  • Correct the matchers section as shown above.

  • Save and exit the editor.

  • Restart Alertmanager pods to apply changes:

    $ oc delete pod -l app=alertmanager -n openshift-monitoring
    

Root Cause

  • The cause of the Monitoring cluster operator being stuck in a degraded state is an invalid matcher syntax in the alertmanager-main secret.
  • This prevented Alertmanager from reloading, causing timeouts in the Cluster Monitoring Operator.
  • Correcting the matcher format and restarting Alertmanager resolved the issue.

Diagnostic Steps

  • Check the status of the monitoring cluster operator, to verify the below error:

    waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: context deadline exceeded
    
  • Verify the alertmanager-main pods status, which will be in CrashLoopBackOff status

    NAME                                           READY   STATUS             RESTARTS      AGE
    alertmanager-main-0                            4/5     CrashLoopBackOff   6 (66s ago)   7m22s
    alertmanager-main-1                            4/5     CrashLoopBackOff   6 (62s ago)   7m14s
    
  • Check alertmanager-main-x pod container alertmanager logs, look for below error:

    ts=2025-02-17T06:53:26.255Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="bad matcher format: DeploymentConfigHasZeroReplicasDevNew"
    
  • Extract the alertmanager-main secret and check if the matcher section is in correct format from the alrtmanager.yaml generated by the below command:

    $ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alertmanager.yaml
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments