How to reboot a single worker node in Azure Red Hat Openshift

Solution Verified - Updated -

Environment

  • Red Hat OpenShift on Azure (ARO)
    • 4

Issue

  • Steps required to reboot a single worker node.
  • Node power off or scale down to zero are not supported.

WARNING

Do not reboot master nodes. Master nodes are managed by Red Hat. Raise a support case if maintenance is required on the master nodes

Resolution

  1. Identify the required worker node
   $ oc get nodes

   NAME                                    STATUS   ROLES    AGE     VERSION
   20230809-xxxx-master-0                  Ready    master   5d17h   v1.24.15+990d55b
   20230809-xxxx-master-1                  Ready    master   5d17h   v1.24.15+990d55b
   20230809-xxxx-master-2                  Ready    master   5d17h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-cbm52          Ready    worker   5d16h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-lmtl7          Ready    worker   5d16h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-wft2m          Ready    worker   5d16h   v1.24.15+990d55b

  1. Cordon a worker node
   $ oc adm cordon xxxx-l52dl-worker-xxx-cbm52

   node/20230809-xxxx-worker-xxx-cbm52 cordoned 
  1. It is recommended to review the list of pods before draining
   $ oc get pods -A -o wide --field-selector spec.nodeName=<worker_node_name>
  1. Drain node in preparation for maintenance. If the command fails then use additional options:
    --force to delete pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet, or StatefulSet resources.
    --delete-emptydir-data option deletes the pods with the local storage
    --ignore-daemonsets option ignores the daemon sets, and pod eviction can resume successfully.
    --disable-eviction option to bypass PDB and drain the node.
   $ oc adm drain 20230809-l52dl-worker-eastasia1-cbm52 

   node/20230809-l52dl-worker-eastasia1-cbm52 already cordoned
  1. Ensure the drain has scheduled all pods onto other nodes and there is adequate resources for all nodes. Confirm that critical applications are still available. Note some pods like daemonsets can not be rescheduled and will need to be killed as part of the reboot.
  • Check for undrained pods on the node
   $ oc get pod -o wide -A | grep "node_name"
  • Check for recent FailedScheduling events. This may indicate the cluster is under resourced and require additional nodes.
   $ oc get events -A | grep "FailedScheduling"
  1. Check the status of the worker nodes, expected Ready,SchedulingDisabled
   $ oc get nodes

   NAME                                    STATUS                     ROLES    AGE     VERSION
   20230809-xxxx-master-0                  Ready                      master   5d17h   v1.24.15+990d55b
   20230809-xxxx-master-1                  Ready                      master   5d17h   v1.24.15+990d55b
   20230809-xxxx-master-2                  Ready                      master   5d17h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-cbm52          Ready,SchedulingDisabled   worker   5d17h   v1.24.15+990d55b <--- Disabled
   20230809-xxxx-worker-xxx-lmtl7          Ready                      worker   5d17h   v1.24.15+990d55b
   20230809-xxxx-worker-xxx-wft2m          Ready                      worker   5d17h   v1.24.15+990d55b

  1. The oc debug node/<node_name> command provides a way to open a shell prompt into the worker node. This crates a separate container and mounts the noderoot file system at the /hostfolder, and allows you to inspect any files from the node
   $ oc debug node/20230809-xxxx-worker-xxx-cbm52

   Temporary namespace openshift-debug-kck98 is created for debugging node...
   Starting pod/20230809-xxxx-worker-xxx-cbm52-debug ...
   To use host binaries, run `chroot /host`
   Pod IP: x.x.x.x
   If you don't see a command prompt, try pressing enter.
   sh-4.4# 
  1. Start a chroot shell in the /host folder
   $ chroot /host
  1. Reboot the worker node
   $ reboot 

   Removing debug pod ...
   Temporary namespace openshift-debug-xx was removed.
  1. Watch the progress and confirm the worker node is rebooted
   $ oc describe node 20230809-xxxx-worker-xxx-cbm52 | grep LastTransitionTime -A2

   Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       
   Message
   ----             ------  -----------------                 ------------------                ------                       
   MemoryPressure   False   Tue, 15 Aug 2023 12:40:54 +0800   Tue, 15 Aug 2023 12:18:38 +0800 KubeletHasSufficientMemory 

  1. Confirm the worker nodes is ready after the reboot
   $ oc wait --for=condition=Ready node/20230809-xxxx-worker-xxx-cbm52

   node/20230809-xxxx-worker-xxx-cbm52 condition met
  1. Restore (Uncordon) the worker node from the maintenance mode
   $ oc adm uncordon 20230809-xxxx-worker-xxx-cbm52

   node/20230809-xxxx-worker-xxx-cbm52 uncordoned

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments