Cluster upgrade is not possible due to the 'alertmanager-main' and 'prometheus-k8s' pods are scheduled to the same node
Environment
- Red Hat OpenShift Service on AWS [ROSA]
- 4.x
Issue
- When attempting to upgrade the cluster, the Red Hat Hybrid Cloud Console shows the events informing that the upgrade maintenance was scheduled, followed by the start of the upgrade at the time scheduled, and eventually get delayed and later failed with the following message:
Cluster upgrade maintenance to version X.X.X on <date> has been cancelled due to unacknowledged user actions.
Resolution
-
Re-schedule the
alertmanager-mainandprometheus-k8spods by deleting them, and make sure they are scheduled in different nodes.-
Delete the pods from
openshift-monitoring. Note: It is not necessary to delete all of them, just one from eachreplicasetis required:$ oc delete pods \ alertmanager-main-1 \ prometheus-k8s-1 -n openshift-monitoring -
Validate that the pods are scheduled in different nodes:
$ oc get pods -n openshift-monitoring -o wide | egrep "alertmanager-main|prometheus-k8s"
-
Root Cause
- The upgrade might be blocked due to all the
alertmanager-mainandprometheus-k8spods are scheduled to the same node so it can't upgrade without incurring an outage to the monitoring stack:
$ oc get pods -n openshift-monitoring -o wide | egrep "alertmanager-main|prometheus-k8s"
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
alertmanager-main-0 5/5 Running 0 3d12h 10.x.x.31 ip-10-x-x-85.ap-northeast-1.compute.internal <none> <none>
alertmanager-main-1 5/5 Running 0 3d12h 10.x.x.35 ip-10-x-x-85.ap-northeast-1.compute.internal <none> <none>
alertmanager-main-2 5/5 Running 0 3d12h 10.x.x.34 ip-10-x-x-85.ap-northeast-1.compute.internal <none> <none>
prometheus-k8s-0 7/7 Running 0 3d12h 10.x.x.32 ip-10-x-x-85.ap-northeast-1.compute.internal <none> <none>
prometheus-k8s-1 7/7 Running 0 3d12h 10.x.x.33 ip-10-x-x-85.ap-northeast-1.compute.internal <none> <none>
Diagnostic Steps
- Look for the events in the Red Hat Hybrid Cloud Console and see if the upgrade get failed with the message below:
Cluster upgrade maintenance to version X.X.X on <date> has been cancelled due to unacknowledged user actions.
- Confirm that the
alertmanager-mainandprometheus-k8spods are scheduled to the same node:
$ oc get pods -n openshift-monitoring -o wide | egrep "alertmanager-main|prometheus-k8s"
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments