Managed OpenShift Upgrade: PreHealthCheck Notifications and Troubleshooting

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Service on AWS (ROSA) Classic
  • Red Hat OpenShift Dedicated (OSD)

Issue

Cluster owner has been notified via email either immediately upon scheduling a cluster update (upgrade) or ahead of scheduled upgrade time.

Resolution

Managed OpenShift services proactively perform various health checks ahead of cluster update to improve customer experience and reliability of the upgrades. As part of these health checks, if an issue is identified requiring attention from either the cluster owner or from the service SRE, a proactive notification via email is sent.

The following table lists the different notifications, their descriptions and whether customer action is needed for resolution.

No. Notification Message Description Resolution Owner Impact on upgrade
1 Upgrade may face delay due to CriticalAlertsHealthcheckFailed At least 1 critical alert is firing which impacts the upgrade. Red Hat SRE yes
2 Upgrade may face delay due to ClusterOperatorsHealthcheckFailed 1 or more clusteroperators are in unavailable state which impacts the upgrade. Red Hat SRE yes
3 Upgrade may face delay due to CapacityReservationHealthcheckFailed Temporary worker nodes will not be created during the upgrades since the default machine pool is deleted. Customer May Impact
4 Upgrade may face delay due to NodeUnschedulableHealthcheckFailed At least 1 node is in schedulingdisabled state blocking the node update Customer May Impact
5 Upgrade may face delay due to NodeUnschedulableTaintHealthcheckFailed At least 1 node has a node condition based taint affecting pod scheduling and thus the node updates. Customer May Impact
6 Upgrade may face delay due to PDBHealthcheckFailed At least 1 PodDisruptionBudget will block successful drain of workloads. Customer May Impact

Additional Instructions for resolutions

1. Troubleshooting CriticalAlertsHealthcheckFailed

About: We have observed that the cluster has at least one critical alert in the Firing state. Critical alert in Firing state will block the cluster upgrade.

Next Steps: Red Hat SRE is notified for diagnosis and resolution. If the alerting is determined to be caused by customer configuration of the cluster or the cloud environment, a follow-up notification will be sent requesting the action from the cluster owner.

Action(s): None needed unless notified by the Red Hat. To learn more about Alerting or identify which Critical Alerts are firing, please refer to documentation on Alerting. Managed OpenShift service clusters implement custom platform alerts, as outlined in OpenShift documentation

2. Troubleshooting ClusterOperatorsHealthcheckFailed

About: This healthcheck implies that one or more ClusterOperators is in degraded or unavailable state, impacting the cluster upgrade.

Next Steps: Red Hat SRE is notified for diagnosis and resolution. If the alerting is determined to be caused by customer configuration of the cluster or the cloud environment, a follow-up notification will be sent requesting the action from the cluster owner.

Action(s): None needed unless notified by the Red Hat. To learn more about Cluster Operator states, visit OpenShift documentation.

To view details of clusteroperator, use oc get clusteroperator command after logging into the cluster.

3. Troubleshooting CapacityReservationHealthcheckFailed

About: During the upgrade, worker nodes are drained and updated (one at a time by default). During an upgrade the worker node capacity is preserved by creating 1 temporary node per AZ with exact labels and taints matching the drained node in the default 'worker' machine pool. However, if the default 'worker' machinepool is deleted, the temporary node will not be created.

This is notified as a warning that the upgrade might get impacted if all the nodes in the same machineconfigpool are heavily loaded during an upgrade and there’s no extra capacity node to accommodate drain of a node.

Action(s):

# Checking the node utilization for worker nodes
$ oc adm top node -l node-role.kubernetes.io/worker=

Review capacity of worker nodes part of machine pools. If utilization is higher, consider increasing the cluster capacity to avoid drained pods not getting scheduled. Default 'worker' machinepool can not be recreated if deleted .

4. Troubleshooting NodeUnschedulableHealthcheckFailed

About: This failure implies that the cluster has 1 or more nodes in unschedulable state which have been cordoned manually.

Action(s):

# Fetch the list of nodes that are in unschedulable state
$ oc get nodes -ojson | jq -r '.items[] | select(.spec.unschedulable == true) | .metadata.name'

# Review and uncordon the nodes
$ oc adm uncordon $NODE

5. Troubleshooting NodeUnschedulableTaintHealthcheckFailed

About: This failure implies that one or more nodes in the cluster has atleast one of the following taints on the node(s) leading to the node being unschedulable:

  • node.kubernetes.io/memory-pressure
  • node.kubernetes.io/disk-pressure
  • node.kubernetes.io/pid-pressure

Action(s):

# List the nodes which has any of the standard node taints present 
$ oc get nodes -ojson | jq -r '.items[] | select(.spec.taints != null) | select(.spec.taints[].key | contains("node.kubernetes.io")) | .metadata.name'

6. Troubleshooting PDBHealthcheckFailed

About: This failure indicates that certain Pod Disruption Budgets (PDBs) are preventing node drain, which is required for seamless upgrades. The failure scenarios can occur due to following PDB conditions:

  • Current Healthy Pods Less than Desired Healthy Pods:
    If current_healthy < desired_healthy,
    draining a node would reduce the number of healthy pods below the minimum required, thus blocking the drain.
  • Disruptions Allowed is Zero:
    If disruptions_allowed is zero,
    it means no further pod evictions are permitted, blocking the node drain.
  • Max Unavailable Pods Constraint:
    Draining the node would violate the maximum allowed unavailable pods defined in the PDB.

Action(s):

We recommend reviewing your Pod Disruption Budget configurations using the documentation guidelines for setting PDB.

Additionally, please refer KCS.

Root Cause

As part of Managed OpenShift upgrades on OSD v4 and ROSA, the managed-upgrade-operator(MUO) is responsible for performing PreHealthCheck(PHC). These checks are performed in following scenarios:

  • If an upgrade is scheduled more than 2 hours from now in advance.
  • Just before the control plane upgrade starts.

Following are the list of checks performed in PreHealthCheck:

  • Critical alerts firing check
  • ClusterOperators Degraded/Down check
  • Unschedulable Node check
  • Tainted Node check
  • PodDisruptionBudget check
  • No Worker MachinePool check

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Was this helpful?

Get notified when this content is updated

Comments