OpenShift 4 cluster upgrade pre-checks requirements
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Red Hat OpenShift Service on AWS (ROSA)
- 4
- Red Hat OpenShift Dedicated (OSD)
- 4
- Azure Red Hat OpenShift (ARO)
- 4
Issue
- What are the initial requirements before upgrading an OpenShift Cluster?
- How to check the health of the cluster objects?
- How to check the resource allocation on the cluster?
- How to check the status and running condition of the pods?
- Other Pre-checks.
Resolution
Before upgrading the cluster, the below-mentioned checks can be considered to ensure that the cluster is running healthy and is safe to upgrade.
Creating a proactive case
Standard Guidelines for ARO
Below are the required prerequisites for creating a standard proactive ARO cluster upgrade ticket:
- Date/Time (including timezone) for the Scheduled Maintenance Window as per How to open a PROACTIVE case for a ROSA, OSD or ARO.
- Proper Contact information.
- Standard must-gather.
-
ARO Cluster ResourceID and Region which can be fetched using:
Resource ID: $ az aro show -n <cluster_name> -g <resource_group> --subscription <subscription_name> --query id Region: $ az aro show -n <cluster_name> -g <resource_group> --subscription <subscription_name> --query location
Standard Guidelines for OSD/ROSA
Below are the required prerequisites for creating a standard proactive OSD/ROSA clusters upgrade ticket:
- Date/Time (including timezone) for the Scheduled Maintenance Window as per How to open a PROACTIVE case for a ROSA, OSD or ARO.
- Proper Contact information.
Standard Guidelines for OCP
Below are the required prerequisites for creating a standard proactive OCP cluster upgrade ticket:
- Date/time (including timezone) for the scheduled maintenance window as per Proactive Case Standard Guidelines.
- Proper Contact information.
- Standard must-gather.
Cluster Pre-Checks
Checking Operators
Check if the versions of the operators running on the cluster are compatible with the desired OpenShift version. For Red Hat supported operators in OpenShift, refer to OpenShift Operator Life Cycles and Red Hat OpenShift Container Platform Operator Update Information Checker, and search for each specific operator. This is specially important when upgrading to a new OpenShift minor version (the y
in 4.y.z
).
IMPORTANT NOTE: If Red Hat OpenShift Data Foundations (RHODF) is installed on the cluster, in addition to check if the RHODF version is compatible with the desired RHOCP version, please refer also to OpenShift Data Foundations (RHODF) Operator Upgrade Pre-Checks, as if the status of the ODF cluster is not healthy, it will not be possible to drain the ODF nodes causing the OpenShift upgrade to hang.
Checking the Cluster Upgrade Path
Using Red Hat OpenShift Container Platform Update Graph, it is possible to check the the upgrade path available for the upgrade.
Checking removed APIs
When upgrading to a new OpenShift minor version (the y
in 4.y.z
), it is needed to check if there is any API removed, and if any custom application is still using it. If custom applications are using any API that is removed in the desired minor version, it will be needed to update those applications to avoid issues after the upgrade. Refer to Navigating Kubernetes API deprecations and removals for additional information.
If upgrading to OpenShift 4.14
Checking for duplicated headers in requests
When upgrading to OpenShift 4.14, the HAProxy is upgraded from version 2.2 in previous versions to 2.6 in 4.14. This upgrade included a new behavior of HAProxy when duplicated headers are found as explained in Pods returns a 502 or 400 error when accessing via the application route after upgrading the RHOCP cluster to version 4.14, which includes information to identify the usage of duplicated headers before upgrading.
If upgrading to OpenShift 4.15
Checking the usage of ServiceAccount token secrets
When upgrading to OpenShift 4.15, the ServiceAccount token secrets automatically created in previous releases are removed if the Internal Image Registry is configured as Removed
. Please, refer to ServiceAccount token secrets missing after upgrading to OpenShift 4.15 before upgrading to OpenShift 4.15 for additional information.
Checking if IPsec is configured in the cluster
There is a known bug when upgrading clusters with IPsec enabled to OpenShift 4.15. Please, refer to upgrading Full IPsec cluster from 4.14 to 4.15 has broken network communication in some nodes for additional information about the issue, and check in the network.operator
cluster
resource if IPsec is enabled:
$ oc get network.operator cluster -o yaml
[...]
ipsecConfig:
mode: Full
[...]
If upgrading a cluster installed in vSphere
Check the Known Issues with OpenShift 4.12 to 4.13 or 4.13 to 4.14 vSphere CSI Storage Migration.
Checking the Cluster Objects
-
Check the status of the nodes to ensure that none of the nodes are in a
NotReady
state or in aSchedulingDisabled
state:$ oc get nodes NAME STATUS ROLES AGE VERSION master-0.lab.example.com Ready master 3d18h v1.23.12+8a6bfe4 master-1.lab.example.com Ready master 3d18h v1.23.12+8a6bfe4 master-2.lab.example.com Ready master 3d18h v1.23.12+8a6bfe4 worker-0.lab.example.com Ready worker 3d17h v1.23.12+8a6bfe4 worker-1.lab.example.com Ready worker 3d17h v1.23.12+8a6bfe4 worker-2.lab.example.com Ready worker 3d17h v1.23.12+8a6bfe4
-
Check the status of the cluster operators to ensure that all the cluster operators are
Available
and are not in aDegraded
state:$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE [...] etcd 4.10.54 True False False 3d18h image-registry 4.10.54 True False False 3d9h ingress 4.10.54 True False False 3d17h insights 4.10.54 True False False 3d18h kube-apiserver 4.10.54 True False False 3d18h kube-controller-manager 4.10.54 True False False 3d18h kube-scheduler 4.10.54 True False False 3d18h kube-storage-version-migrator 4.10.54 True False False 2d3h machine-api 4.10.54 True False False 3d18h machine-approver 4.10.54 True False False 3d18h machine-config 4.10.54 True False False 2d2h [...]
-
Check the health of the
PVs
andPVCs
to ensure that:- All the PVs and PVCs are
mounted
. - None of the PVs and PVCs are
unmounted
. - None of the PVs and PVCs are stuck in the
terminating
state. - No abnormal configurations exist.
$ oc get pv,pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE [...] openshift-compliance persistentvolumeclaim/ocp4-cis Active 6d10h openshift-compliance persistentvolumeclaim/ocp4-cis-node-master Active 6d10h openshift-compliance persistentvolumeclaim/ocp4-cis-node-worker Active 6d10h [...]
- All the PVs and PVCs are
-
Checks regarding
machineConfigPools
:-
Check that every node on the cluster is associated at least with one
machineConfigPool
: the way of achieving this is that every node must have a label which must be asnodeSelector
in one of the existingmachineConfigPools
. After starting a cluster upgrade, new rendered configs will be created, hencemachineConfigPools
will apply the new rendered configs to the nodes. If a node is not associated with amachineConfigPool
, the MachineConfigController will avoid updating this node. After the node is associated with amachineConfigPool
, it will synchronize its configuration with the corresponding rendered config. -
Check that all
machineConfigPools
havepaused: false
parameter. If amachineConfigPool
is on paused state, the nodes associated with thismachineConfigPool
will not be updated. More information on the Red Hat Solution MachineConfigPools are paused, preventing the Machine Config Operator to push out updates in OpenShift 4. -
Check the health of the
machineConfigPools
and make sure thatMACHINECOUNT
must be equal toREADYMACHINECOUNT
and there must be no machine stuck inUPDATEDMACHINECOUNT
andDEGRADEDMACHINECOUNT
:
$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX True False False 3 3 3 0 4d7h worker rendered-worker-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX True False False 3 3 3 0 4d7h
-
Checking the Cluster Node Allocation
In order to check resource allocation, it can be done in 2 ways:
Using $ oc describe
$ oc describe node worker-0.lab.example.com
[...]
Conditions: <====
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 14 Jun 2023 15:24:09 -0400 Tue, 13 Jun 2023 02:59:26 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 14 Jun 2023 15:24:09 -0400 Tue, 13 Jun 2023 02:59:26 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 14 Jun 2023 15:24:09 -0400 Tue, 13 Jun 2023 02:59:26 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 14 Jun 2023 15:24:09 -0400 Tue, 13 Jun 2023 02:59:36 -0400 KubeletReady kubelet is posting ready status
Capacity: <====
cpu: 4
ephemeral-storage: 41407468Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8146240Ki
pods: 250
Allocatable: <====
cpu: 3500m
ephemeral-storage: 37087380622
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 6995264Ki
pods: 250
System Info: <====
Machine ID: bc21e1755a9142238b04129b97e118c0
System UUID: bc21e175-5a91-4223-8b04-129b97e118c0
Boot ID: b0520f6a-09e7-4bf8-8e0c-9aa749fd14bc
Kernel Version: 4.18.0-372.58.1.el8_6.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 412.86.202305230130-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.25.3-4.rhaos4.12.git76ceef4.el8
Kubelet Version: v1.25.8+37a9a08
Kube-Proxy Version: v1.25.8+37a9a08
Non-terminated Pods: (33 in total) <====
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
new-test httpd-675fd5bfdd-9s4pr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 34h
openshift-cluster-node-tuning-operator tuned-wfrxw 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 37h
openshift-cnv cdi-operator-6ffbc46886-rfb97 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 27h
openshift-cnv cluster-network-addons-operator-675b769f6f-954wl 60m (1%) 0 (0%) 50Mi (0%) 0 (0%) 27h
openshift-cnv hco-operator-7f8c48598d-c6vh5 10m (0%) 0 (0%) 96Mi (1%) 0 (0%) 27h
openshift-cnv hco-webhook-fc6b4c4b5-7zrdb 5m (0%) 0 (0%) 48Mi (0%) 0 (0%) 27h
openshift-cnv hostpath-provisioner-operator-6b6bc8bf8-6mxkp 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 27h
openshift-cnv hyperconverged-cluster-cli-download-7f5844cb77-ftjbz 10m (0%) 0 (0%) 96Mi (1%) 0 (0%) 27h
Allocated resources: <====
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 959m (27%) 1700m (48%)
memory 2968Mi (43%) 1800Mi (26%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none> <====
[...]
The description of all the nodes at once can also be possible using:
$ oc describe nodes > nodes_description.yaml
Using YAML Output
$ oc get node worker.lab.example.com -oyaml
To get the Resource Allocation of all the nodes at once, the below command can be used:
$ for i in $(oc get nodes | awk '{print $1}'); do echo "==== $i ====";oc describe node $i 2> /dev/null | grep -A10 Allocated; echo; done
[...]
==== master-0.lab.example.com ====
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1970m (56%) 400m (11%)
memory 8022Mi (53%) 900Mi (6%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
==== master-1.lab.example.com ====
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1935m (55%) 0 (0%)
memory 8357Mi (56%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
==== master-2.lab.example.com ====
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1579m (45%) 0 (0%)
memory 6282Mi (42%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
[...]
Refer to the below documentation for more details regarding Requests/Limits and Node Overcommitment.
Checking the Pod's health and Status
-
Check pods for status that is not
Running
orCompleted
orSucceeded
:$ oc get pods --all-namespaces | egrep -v 'Running | Completed | Succeeded'
-
Check all pod's status within a namespace::
$ for i in `oc adm top pods -A | awk '{print $1}' | uniq`; do echo $i; oc get pods -owide -n $i; done ###[Using grep against node name will limit the search to get more accurate results] $ for i in `oc adm top pods -A | awk '{print $1}' | uniq`; do echo $i; oc get pods -owide -n $i | grep <node_name>; echo '---------------------'; done
-
Check Pod logs using:
$ oc logs pod/<pod_name> -n <namespace_name>
-
Use
-c
parameter to fetch the logs from a particular container:$ oc logs pod/<pod_name> -c <conatiner_name> -n <namespace_name>
Other Pre-checks:
- Check the health of the etcd cluster.
- Check the network health using Network Observability
-
Check for pending certificate signing requests:
$ oc get csr
-
For the Pod Disruption Budget, the below command output is required to check if there are pods that may block the node draining process during upgrades. Usually, the allowed disruptions are set to 1 to allow nodes to be drained properly. The must-gather doesn’t capture this properly so it is needed to check this output manually:
$ oc get pdb -A
-
Check the firing alerts in alertmanager via Web Console -> Observe -> Alerting and make sure there is no Warning or Critical alert firing, and that you are aware of the existing Info ones.
-
Look for Warning events in all namespaces and check if there is anything that might be concerning:
$ oc get events -A --field-selector type=Warning --sort-by=".lastTimestamp"
Root Cause
Red Hat OpenShift Container Platform 4 upgrades implies the upgrade of several different components, and it is required to check the overall status of the cluster and the compatibility of any additional operator and third party components before starting an upgrade.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments