Cluster upgrade is not possible due to the 'alertmanager-main' and 'prometheus-k8s' pods are scheduled to the same node

Solution Verified - Updated 2024-06-13T20:38:53+00:00 -

Environment

Red Hat OpenShift Service on AWS [ROSA]
- 4.x

Issue

When attempting to upgrade the cluster, the Red Hat Hybrid Cloud Console shows the events informing that the upgrade maintenance was scheduled, followed by the start of the upgrade at the time scheduled, and eventually get delayed and later failed with the following message:

Cluster upgrade maintenance to version X.X.X on <date> has been cancelled due to unacknowledged user actions.

Resolution

Re-schedule the alertmanager-main and prometheus-k8s pods by deleting them, and make sure they are scheduled in different nodes.
1. Delete the pods from openshift-monitoring. Note: It is not necessary to delete all of them, just one from each replicaset is required:
```
$ oc delete pods \
alertmanager-main-1 \
prometheus-k8s-1 -n openshift-monitoring
```
2. Validate that the pods are scheduled in different nodes:
```
$ oc get pods -n openshift-monitoring -o wide | egrep "alertmanager-main|prometheus-k8s"
```

Root Cause

The upgrade might be blocked due to all the alertmanager-main and prometheus-k8s pods are scheduled to the same node so it can't upgrade without incurring an outage to the monitoring stack:

$ oc get pods -n openshift-monitoring -o wide | egrep "alertmanager-main|prometheus-k8s"
NAME                                               READY   STATUS      RESTARTS   AGE     IP             NODE                                              NOMINATED NODE   READINESS GATES
alertmanager-main-0                                5/5     Running     0          3d12h   10.x.x.31    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>
alertmanager-main-1                                5/5     Running     0          3d12h   10.x.x.35    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>
alertmanager-main-2                                5/5     Running     0          3d12h   10.x.x.34    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>
prometheus-k8s-0                                   7/7     Running     0          3d12h   10.x.x.32    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>
prometheus-k8s-1                                   7/7     Running     0          3d12h   10.x.x.33    ip-10-x-x-85.ap-northeast-1.compute.internal    <none>           <none>

Diagnostic Steps

Look for the events in the Red Hat Hybrid Cloud Console and see if the upgrade get failed with the message below:

Cluster upgrade maintenance to version X.X.X on <date> has been cancelled due to unacknowledged user actions.

Confirm that the alertmanager-main and prometheus-k8s pods are scheduled to the same node:

$ oc get pods -n openshift-monitoring -o wide | egrep "alertmanager-main|prometheus-k8s"

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

Cluster upgrade is not possible due to the 'alertmanager-main' and 'prometheus-k8s' pods are scheduled to the same node

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links