How to reboot a single worker node in Azure Red Hat Openshift

Environment

Red Hat OpenShift on Azure (ARO)
- 4

Issue

Steps required to reboot a single worker node.
Node power off or scale down to zero are not supported.

WARNING

Do not reboot master nodes. Master nodes are managed by Red Hat. Raise a support case if maintenance is required on the master nodes

Resolution

Identify the required worker node

   $ oc get nodes

   NAME                                    STATUS   ROLES    AGE     VERSION
   20230809-xxxx-master-0                  Ready    master   5d17h   v1.24.15+990d55b
   20230809-xxxx-master-1                  Ready    master   5d17h   v1.24.15+990d55b
   20230809-xxxx-master-2                  Ready    master   5d17h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-cbm52          Ready    worker   5d16h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-lmtl7          Ready    worker   5d16h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-wft2m          Ready    worker   5d16h   v1.24.15+990d55b

Cordon a worker node

   $ oc adm cordon xxxx-l52dl-worker-xxx-cbm52

   node/20230809-xxxx-worker-xxx-cbm52 cordoned

It is recommended to review the list of pods before draining

   $ oc get pods -A -o wide --field-selector spec.nodeName=<worker_node_name>

Drain node in preparation for maintenance. If the command fails then use additional options:
--force to delete pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet, or StatefulSet resources.
--delete-emptydir-data option deletes the pods with the local storage
--ignore-daemonsets option ignores the daemon sets, and pod eviction can resume successfully.
--disable-eviction option to bypass PDB and drain the node.

   $ oc adm drain 20230809-l52dl-worker-eastasia1-cbm52 

   node/20230809-l52dl-worker-eastasia1-cbm52 already cordoned

Ensure the drain has scheduled all pods onto other nodes and there is adequate resources for all nodes. Confirm that critical applications are still available. Note some pods like daemonsets can not be rescheduled and will need to be killed as part of the reboot.

Check for undrained pods on the node

   $ oc get pod -o wide -A | grep "node_name"

Check for recent FailedScheduling events. This may indicate the cluster is under resourced and require additional nodes.

   $ oc get events -A | grep "FailedScheduling"

Check the status of the worker nodes, expected Ready,SchedulingDisabled

   $ oc get nodes

   NAME                                    STATUS                     ROLES    AGE     VERSION
   20230809-xxxx-master-0                  Ready                      master   5d17h   v1.24.15+990d55b
   20230809-xxxx-master-1                  Ready                      master   5d17h   v1.24.15+990d55b
   20230809-xxxx-master-2                  Ready                      master   5d17h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-cbm52          Ready,SchedulingDisabled   worker   5d17h   v1.24.15+990d55b <--- Disabled
   20230809-xxxx-worker-xxx-lmtl7          Ready                      worker   5d17h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-wft2m          Ready                      worker   5d17h   v1.24.15+990d55b

The oc debug node/<node_name> command provides a way to open a shell prompt into the worker node. This crates a separate container and mounts the noderoot file system at the /hostfolder, and allows you to inspect any files from the node

   $ oc debug node/20230809-xxxx-worker-xxx-cbm52

   Temporary namespace openshift-debug-kck98 is created for debugging node...
   Starting pod/20230809-xxxx-worker-xxx-cbm52-debug ...
   To use host binaries, run `chroot /host`
   Pod IP: x.x.x.x
   If you don't see a command prompt, try pressing enter.
   sh-4.4#

Start a chroot shell in the /host folder

   $ chroot /host

Reboot the worker node

   $ reboot 

   Removing debug pod ...
   Temporary namespace openshift-debug-xx was removed.

Watch the progress and confirm the worker node is rebooted

   $ oc describe node 20230809-xxxx-worker-xxx-cbm52 | grep LastTransitionTime -A2

   Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       
   Message
   ----             ------  -----------------                 ------------------                ------                       
   MemoryPressure   False   Tue, 15 Aug 2023 12:40:54 +0800   Tue, 15 Aug 2023 12:18:38 +0800 KubeletHasSufficientMemory

Confirm the worker nodes is ready after the reboot

   $ oc wait --for=condition=Ready node/20230809-xxxx-worker-xxx-cbm52

   node/20230809-xxxx-worker-xxx-cbm52 condition met

Restore (Uncordon) the worker node from the maintenance mode

   $ oc adm uncordon 20230809-xxxx-worker-xxx-cbm52

   node/20230809-xxxx-worker-xxx-cbm52 uncordoned

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

How to reboot a single worker node in Azure Red Hat Openshift

Environment

Issue

Resolution

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links