Appendix C. Updating Software Packages on a Running Cluster

With one of the primary responsibilities of a High Availability or Resilient Storage cluster being to provide continuous service for applications or resources, it is especially important that updates be applied in a systematic and consistent fashion to avoid any potential disruption to the availability of those critical services. This appendix provides the outline of procedure for a rolling cluster update.

Warning

It is critical when performing software update procedures for Red Hat Enterprise Linux High Availability and Resilient Storage clusters to ensure that any node that will undergo updates is not an active member of the cluster before those updates are initiated. Swapping out the software that the cluster stack relies on while it is in use can lead to various problems and unexpected behaviors, including but not limited to issues that can cause complete outages of the cluster and services it is managing.
Performing a rolling update involves the following risks and considerations:
  • When performing a rolling update, the presence of different versions of the High Availability and Resilient Storage packages within the same cluster introduces a risk that there may be unexpected behavior. The only way to completely eliminate this risk is to update the entire cluster by stopping the cluster software on all nodes, update those nodes by following this procedure, then start the cluster software again.
  • New software versions always come with the potential for unexpected behavior, changes in functionality that may require advance preparation, or in rare cases, bugs causing that could impact the operation of the product. Red Hat strongly recommends having a test, development, or staging cluster configured identically to any production clusters, and using such a cluster to roll out any updates to first for thorough testing prior to the roll-out in production.
  • Performing a rolling update necessarily means reducing the overall capacity and redundancy within the cluster. The size of the cluster dictates whether the absence of a single node poses a significant risk, with larger clusters being able to absorb more node failures before reaching the critical limit, and with smaller clusters being less capable or not capable at all of withstanding the failure of another node while one is missing. It is important that the potential for failure of additional nodes during the update procedure be considered and accounted for. If at all possible, taking a complete outage and updating the cluster entirely may be the preferred option so as to not leave the cluster operating in a state where additional failures could lead to an unexpected outage.
Perform the following steps to update the base Red Hat Enterprise Linux packages, High Availability Add-On packages, and Resilient Storage Add-On packages on each node in a rolling fashion.
  1. Choose a single node where the software will be updated. If any preparations need to be made before stopping or moving the resources or software running on that node, carry out those steps now.
  2. Move any managed resources off of this node as needed. If there are specific requirements or preferences for where resources should be relocated to, then consider creating new location constraints to place the resources on the correct node. The location of resources can be strategically chosen to result in the least number of moves throughout the rolling update procedure, rather than moving resources in preparation for every single node update. If allowing the cluster to automatically manage placement of resources on its own is acceptable, then the next step will automatically take care of this.
  3. Place the chosen node in standby mode to ensure it is not considered in service, and to cause any remaining resources to be relocated elsewhere or stopped.
    # pcs cluster standby node1.example.com
  4. Stop the cluster software on the chosen node.
    # pcs cluster stop node1.example.com
  5. Perform any necessary software updates on the chosen node.
  6. If any software was updated that necessitates a reboot, prepare to perform that reboot. It is recommended that cluster software be disabled from starting on boot so that the host can be checked to ensure it is fully functional on its new software versions before bringing it into the cluster. The following command disables the cluster from starting on boot.
    # pcs cluster disable node1.example.com
    Perform the reboot when ready, and when the boot is complete, ensure that the host is fully functional and is using the correct software in any relevant areas (such as having booted into the latest kernel). If anything does not seem correct, then do not proceed until the situation is resolved.
    Once everything is set up correctly, re-enable the cluster software on this chosen node if it was previously enabled.
    # pcs cluster enable node1.example.com
  7. Start cluster services on the updated node so that the node will rejoin the cluster.
    # pcs cluster start node1.example.com
    Check the output of the pcs status command to determine that appears as it should. Once the node is functioning properly, reactivate it for service by taking it out of standby mode.
    # pcs cluster unstandby node1.example.com
  8. If any temporary location constraints were created to move managed resources off the node, adjust or remove the constraints to allow resources to return to their normally preferred locations.
  9. Perform this entire procedure on each cluster node in turn.