OpenShift 4 cluster upgrade pre-checks requirements
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Red Hat OpenShift Service on AWS (ROSA)
- 4
- Red Hat OpenShift Dedicated (OSD)
- 4
- Azure Red Hat OpenShift (ARO)
- 4
Issue
- What are the initial requirements before upgrading an OpenShift Cluster?
- How to check the health of the cluster objects?
- How to check the resource allocation on the cluster?
- How to check the status and running condition of the pods?
- Other Pre-checks.
Resolution
Before upgrading the cluster, the below-mentioned checks can be considered to ensure that the cluster is running healthy and is safe to upgrade.
Creating a proactive case
Standard guidelines for upgrade proactive cases
- Date/Time (including timezone) for the Scheduled Maintenance Window.
- Full version number for version being upgraded from and to in
4.y.zformat. - Proper Contact information.
- Standard must-gather.
- Special operators:
- If Red Hat OpenShift Data Foundation is installed on the cluster, refer also to implications to consider when upgrading OpenShift Data Foundation and open a separate support case for OpenShift Data Foundation.
- If Red Hat OpenShift Virtualization or MTV is installed on the cluster, refer also to how to open a Proactive case for OpenShift Virtualization/MTV and open a separate support case for OpenShift Virtualization.
Specific data
-
For self-managed OpenShift, refer also to how to open a PROACTIVE case for patching or upgrading Red Hat OpenShift Container Platform.
-
For OSD/ROSA Classic and ROSA HCP, refer also to how to open a PROACTIVE case for ROSA Classic, ROSA HCP, and OSD Clusters.
-
For ARO, refer also to how to open a PROACTIVE case for ARO Clusters, and provide the ARO Cluster ResourceID and Region which can be fetched using:
Resource ID: $ az aro show -n <cluster_name> -g <resource_group> --subscription <subscription_name> --query id Region: $ az aro show -n <cluster_name> -g <resource_group> --subscription <subscription_name> --query location
Cluster Pre-Checks
Checking Operators
Check if the version of the operators running on the cluster are compatible with the desired OpenShift version. For Red Hat supported operators in OpenShift, refer to OpenShift Operator Life Cycles and Red Hat OpenShift Container Platform Operator Update Information Checker, and search for each specific operator. This is specially important when upgrading to a new OpenShift minor version (the y in 4.y.z).
IMPORTANT NOTE: if Red Hat OpenShift Data Foundations (RHODF) is installed on the cluster, in addition to check if the RHODF version is compatible with the desired RHOCP version, please refer also to OpenShift Data Foundations (RHODF) Operator Upgrade Pre-Checks, as if the status of the ODF cluster is not healthy, it will not be possible to drain the ODF nodes causing the OpenShift upgrade to hang. Check also implications to consider when upgrading OpenShift Data Foundation.
Checking the Cluster Update Path
It is important to check the update path available for the upgrade in advance, and also just before starting the upgrade (note that upgrade path can change if specific release is identified to be affected by specific issues), using the Red Hat OpenShift Container Platform Update Path:
- If OCP or ARO, please follow the Standard Update Path tooling.
- If OSD or ROSA (Classic or HCP), please follow the ROSA Update Path tooling.
If a cluster upgrade is delayed or scheduled for a later date, please check the Update Paths again before running the upgrade steps.
- It is possible that the supported path has changed, or that upgrades to certain versions are blocked due to Development Engineering intervention to allow for patching of a newly discovered bug/issue.
- If there is a version block, the details for why this block is occurring will be provided in the results after running the above mentioned Update Path tooling.
Checking removed APIs
When upgrading to a new OpenShift minor version (the y in 4.y.z), it is needed to check if there is any API removed, and if any custom application is still using it. If custom applications are using any API that is removed in the desired minor version, it will be needed to update those applications to avoid issues after the upgrade. Refer to navigating Kubernetes API deprecations and removals for additional information.
If upgrading a cluster installed in vSphere to OpenShift 4.13 or 4.14
Check the known Issues with OpenShift 4.12 to 4.13 or 4.13 to 4.14 vSphere CSI Storage Migration.
If upgrading to OpenShift 4.14
Checking for duplicated headers in requests
When upgrading to OpenShift 4.14, the HAProxy is upgraded from version 2.2 in previous versions to 2.6 in 4.14. This upgrade included a new behavior of HAProxy when duplicated headers are found as explained in Pods returns a 502 or 400 error when accessing via the application route after upgrading the RHOCP cluster to version 4.14, which includes information to identify the usage of duplicated headers before upgrading.
If upgrading to OpenShift 4.15
Checking the usage of ServiceAccount token secrets
When upgrading to OpenShift 4.15, the ServiceAccount token secrets automatically created in previous releases are removed if the Internal Image Registry is configured as Removed. Please, refer to ServiceAccount token secrets missing after upgrading to OpenShift 4.15 before upgrading to OpenShift 4.15 for additional information.
Checking if IPsec is configured in the cluster
There is a known bug when upgrading clusters with IPsec enabled to OpenShift 4.15. Please, refer to how to upgrade to or between 4.15 releases and above when IPsec is enabled for additional information about the issue, and check in the network.operator cluster resource if IPsec is enabled:
$ oc get network.operator cluster -o yaml
[...]
ipsecConfig:
mode: Full
[...]
If upgrading to OpenShift 4.17
Do not perform the network plugin migration at the same time that an upgrade
OpenShift SDN is no longer supported in OpenShift 4.17, and a migration to OVN-Kubernetes is required. The network plugin migration should never be done at the same time than the upgrade as explained in is it supported to upgrade the cluster to 4.17 at the same time than performing the network plugin migration? (note that the same applies to any version previous to 4.17).
When using LDAP, ensure it supports TLS 1.3 or ECDHE ciphers
Due to the underlying Go version used, cipher suites without ECDHE support are no longer offered by either clients or servers during pre-TLS 1.3 handshakes. Refer to LDAP authentication fails with TLS handshake failure in OpenShift 4.17 or newer for additional information.
If upgrading to OpenShift 4.19
Ensure control plane nodes have the label node-role.kubernetes.io/control-plane
The label node-role.kubernetes.io/control-plane could be missing in clusters installed in older versions that do not include it from the installation, which can cause issues during upgrades like machine-config-nodes-crd-cleanup pod in Pending state during upgrade from OpenShift 4.18 to 4.19. Refer to inconsistency of node-role between newly created vs. long running OpenShift 4 clusters for additional information.
Ensure only cgroup v2 is used in the cluster
As cgroup v1 has been removed in OpenShift 4.19 (it has been deprecated in OpenShift 4.16), it is needed to ensure the cluster is using cgroup v2 before upgrading.
If upgrading to OpenShift 4.20
Red Hat Marketplace is deprecated
In OpenShift 4.20, the Red Hat Marketplace is deprecated and it will be removed in an upcoming release. Refer to Red Hat Marketplace is deprecated for additional information.
Checking the Cluster Objects
-
Check the status of the nodes to ensure that none of the nodes are in a
NotReadystate or in aSchedulingDisabledstate:$ oc get nodes NAME STATUS ROLES AGE VERSION master-0.lab.example.com Ready master 3d18h v1.23.12+8a6bfe4 master-1.lab.example.com Ready master 3d18h v1.23.12+8a6bfe4 master-2.lab.example.com Ready master 3d18h v1.23.12+8a6bfe4 worker-0.lab.example.com Ready worker 3d17h v1.23.12+8a6bfe4 worker-1.lab.example.com Ready worker 3d17h v1.23.12+8a6bfe4 worker-2.lab.example.com Ready worker 3d17h v1.23.12+8a6bfe4 -
Check the status of the cluster operators to ensure that all the cluster operators are
Availableand are not in aDegradedstate:$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE [...] etcd 4.10.54 True False False 3d18h image-registry 4.10.54 True False False 3d9h ingress 4.10.54 True False False 3d17h insights 4.10.54 True False False 3d18h kube-apiserver 4.10.54 True False False 3d18h kube-controller-manager 4.10.54 True False False 3d18h kube-scheduler 4.10.54 True False False 3d18h kube-storage-version-migrator 4.10.54 True False False 2d3h machine-api 4.10.54 True False False 3d18h machine-approver 4.10.54 True False False 3d18h machine-config 4.10.54 True False False 2d2h [...] -
Check the health of the
PVsandPVCsto ensure that:- All the PVs and PVCs are
mounted. - None of the PVs and PVCs are
unmounted. - None of the PVs and PVCs are stuck in the
terminatingstate. - No abnormal configurations exist.
$ oc get pv,pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE [...] openshift-compliance persistentvolumeclaim/ocp4-cis Active 6d10h openshift-compliance persistentvolumeclaim/ocp4-cis-node-master Active 6d10h openshift-compliance persistentvolumeclaim/ocp4-cis-node-worker Active 6d10h [...] - All the PVs and PVCs are
-
Checks regarding
machineConfigPools:-
Check that every node on the cluster is associated at least with one
machineConfigPool: the way of achieving this is that every node must have a label which must be asnodeSelectorin one of the existingmachineConfigPools. After starting a cluster upgrade, new rendered configs will be created, hencemachineConfigPoolswill apply the new rendered configs to the nodes. If a node is not associated with amachineConfigPool, the MachineConfigController will avoid updating this node. After the node is associated with amachineConfigPool, it will synchronize its configuration with the corresponding rendered config. -
Check that all
machineConfigPoolshavepaused: falseparameter. If amachineConfigPoolis on paused state, the nodes associated with thismachineConfigPoolwill not be updated. More information on the Red Hat Solution MachineConfigPools are paused, preventing the Machine Config Operator to push out updates in OpenShift 4. -
Check the health of the
machineConfigPoolsand make sure thatMACHINECOUNTmust be equal toREADYMACHINECOUNTand there must be no machine stuck inUPDATEDMACHINECOUNTandDEGRADEDMACHINECOUNT:
$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX True False False 3 3 3 0 4d7h worker rendered-worker-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX True False False 3 3 3 0 4d7h -
Checking the Cluster Node Allocation
In order to check resource allocation, it can be done in 2 ways:
Using $ oc describe
$ oc describe node worker-0.lab.example.com
[...]
Conditions: <====
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 14 Jun 2023 15:24:09 -0400 Tue, 13 Jun 2023 02:59:26 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 14 Jun 2023 15:24:09 -0400 Tue, 13 Jun 2023 02:59:26 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 14 Jun 2023 15:24:09 -0400 Tue, 13 Jun 2023 02:59:26 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 14 Jun 2023 15:24:09 -0400 Tue, 13 Jun 2023 02:59:36 -0400 KubeletReady kubelet is posting ready status
Capacity: <====
cpu: 4
ephemeral-storage: 41407468Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8146240Ki
pods: 250
Allocatable: <====
cpu: 3500m
ephemeral-storage: 37087380622
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 6995264Ki
pods: 250
System Info: <====
Machine ID: bc21e1755a9142238b04129b97e118c0
System UUID: bc21e175-5a91-4223-8b04-129b97e118c0
Boot ID: b0520f6a-09e7-4bf8-8e0c-9aa749fd14bc
Kernel Version: 4.18.0-372.58.1.el8_6.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 412.86.202305230130-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.25.3-4.rhaos4.12.git76ceef4.el8
Kubelet Version: v1.25.8+37a9a08
Kube-Proxy Version: v1.25.8+37a9a08
Non-terminated Pods: (33 in total) <====
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
new-test httpd-675fd5bfdd-9s4pr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 34h
openshift-cluster-node-tuning-operator tuned-wfrxw 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 37h
openshift-cnv cdi-operator-6ffbc46886-rfb97 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 27h
openshift-cnv cluster-network-addons-operator-675b769f6f-954wl 60m (1%) 0 (0%) 50Mi (0%) 0 (0%) 27h
openshift-cnv hco-operator-7f8c48598d-c6vh5 10m (0%) 0 (0%) 96Mi (1%) 0 (0%) 27h
openshift-cnv hco-webhook-fc6b4c4b5-7zrdb 5m (0%) 0 (0%) 48Mi (0%) 0 (0%) 27h
openshift-cnv hostpath-provisioner-operator-6b6bc8bf8-6mxkp 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 27h
openshift-cnv hyperconverged-cluster-cli-download-7f5844cb77-ftjbz 10m (0%) 0 (0%) 96Mi (1%) 0 (0%) 27h
Allocated resources: <====
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 959m (27%) 1700m (48%)
memory 2968Mi (43%) 1800Mi (26%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none> <====
[...]
The description of all the nodes at once can also be possible using:
$ oc describe nodes > nodes_description.yaml
Using YAML Output
$ oc get node worker.lab.example.com -oyaml
To get the Resource Allocation of all the nodes at once, the below command can be used:
$ for i in $(oc get nodes | awk '{print $1}'); do echo "==== $i ====";oc describe node $i 2> /dev/null | grep -A10 Allocated; echo; done
[...]
==== master-0.lab.example.com ====
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1970m (56%) 400m (11%)
memory 8022Mi (53%) 900Mi (6%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
==== master-1.lab.example.com ====
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1935m (55%) 0 (0%)
memory 8357Mi (56%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
==== master-2.lab.example.com ====
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1579m (45%) 0 (0%)
memory 6282Mi (42%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
[...]
Refer to the below documentation for more details regarding Requests/Limits and Node Overcommitment.
Checking the Pod's health and Status
-
Check pods for status that is not
RunningorCompletedorSucceeded:$ oc get pods --all-namespaces | egrep -v 'Running | Completed | Succeeded' -
Check all pod's status within a namespace::
$ for i in `oc adm top pods -A | awk '{print $1}' | uniq`; do echo $i; oc get pods -owide -n $i; done ###[Using grep against node name will limit the search to get more accurate results] $ for i in `oc adm top pods -A | awk '{print $1}' | uniq`; do echo $i; oc get pods -owide -n $i | grep <node_name>; echo '---------------------'; done -
Check Pod logs using:
$ oc logs pod/<pod_name> -n <namespace_name> -
Use
-cparameter to fetch the logs from a particular container:$ oc logs pod/<pod_name> -c <conatiner_name> -n <namespace_name>
Other Pre-checks:
- Check the health of the etcd cluster.
- Check the network health using Network Observability
-
Check for pending certificate signing requests:
$ oc get csr -
For the Pod Disruption Budget, the below command output is required to check if there are pods that may block the node draining process during upgrades. Usually, the allowed disruptions are set to 1 to allow nodes to be drained properly. The must-gather doesn’t capture this properly so it is needed to check this output manually:
$ oc get pdb -A -
Check the firing alerts in alertmanager via Web Console -> Observe -> Alerting and make sure there is no Warning or Critical alert firing, and that you are aware of the existing Info ones.
-
Look for Warning events in all namespaces and check if there is anything that might be concerning:
$ oc get events -A --field-selector type=Warning --sort-by=".lastTimestamp" -
Ensure that any third-party software running on the cluster is compatible with the target OpenShift version prior to upgrading. Please note that Red Hat does not verify third-party compatibility with any version of OCP. We leave this to be the sole responsibility of the vendor. Please check our Third-Party Support documentation for further information.
- Third-party applications/operators of note are the following:
- TwistLock - Compatibility Matrix [External Link]
- Dynatrace - Compatibility Matrix [External Link]
-
Review the release notes for the target OpenShift version to identify any platform changes that could impact your applications. This allows you to assess potential risks and implement necessary adjustments before proceeding with the upgrade.
Root Cause
Red Hat OpenShift Container Platform 4 upgrades implies the upgrade of several different components, and it is required to check the overall status of the cluster and the compatibility of any additional operator and third party components before starting an upgrade.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments