Cluster upgrade is not possible due to the 'alertmanager-main' and 'prometheus-k8s' pods are scheduled to the same node

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Service on AWS [ROSA]
    • 4.x

Issue

  • When attempting to upgrade the cluster, the Red Hat Hybrid Cloud Console shows the events informing that the upgrade maintenance was scheduled, followed by the start of the upgrade at the time scheduled, and eventually get delayed and later failed with the following message:
Cluster upgrade maintenance to version X.X.X on <date> has been cancelled due to unacknowledged user actions.

Resolution

  • Re-schedule the alertmanager-main and prometheus-k8s pods by deleting them, and make sure they are scheduled in different nodes.

    1. Delete the pods from openshift-monitoring. Note: It is not necessary to delete all of them, just one from each replicaset is required:

      $ oc delete pods \
      alertmanager-main-1 \
      prometheus-k8s-1 -n openshift-monitoring
      
    2. Validate that the pods are scheduled in different nodes:

      $ oc get pods -n openshift-monitoring -o wide | egrep "alertmanager-main|prometheus-k8s"
      

Root Cause

  • The upgrade might be blocked due to all the alertmanager-main and prometheus-k8s pods are scheduled to the same node so it can't upgrade without incurring an outage to the monitoring stack:
$ oc get pods -n openshift-monitoring -o wide | egrep "alertmanager-main|prometheus-k8s"
NAME                                               READY   STATUS      RESTARTS   AGE     IP             NODE                                              NOMINATED NODE   READINESS GATES
alertmanager-main-0                                5/5     Running     0          3d12h   10.x.x.31    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>
alertmanager-main-1                                5/5     Running     0          3d12h   10.x.x.35    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>
alertmanager-main-2                                5/5     Running     0          3d12h   10.x.x.34    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>
prometheus-k8s-0                                   7/7     Running     0          3d12h   10.x.x.32    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>
prometheus-k8s-1                                   7/7     Running     0          3d12h   10.x.x.33    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>

Diagnostic Steps

Cluster upgrade maintenance to version X.X.X on <date> has been cancelled due to unacknowledged user actions.
  • Confirm that the alertmanager-main and prometheus-k8s pods are scheduled to the same node:
$ oc get pods -n openshift-monitoring -o wide | egrep "alertmanager-main|prometheus-k8s"

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments