Chapter 29. Performing cluster maintenance

In order to perform maintenance on the nodes of your cluster, you may need to stop or move the resources and services running on that cluster. Or you may need to stop the cluster software while leaving the services untouched. Pacemaker provides a variety of methods for performing system maintenance.

  • If you need to stop a node in a cluster while continuing to provide the services running on that cluster on another node, you can put the cluster node in standby mode. A node that is in standby mode is no longer able to host resources. Any resource currently active on the node will be moved to another node, or stopped if no other node is eligible to run the resource. For information on standby mode, see Putting a node into standby mode.
  • If you need to move an individual resource off the node on which it is currently running without stopping that resource, you can use the pcs resource move command to move the resource to a different node. For information on the pcs resource move command, see Manually moving cluster resources.

    When you execute the pcs resource move command, this adds a constraint to the resource to prevent it from running on the node on which it is currently running. When you are ready to move the resource back, you can execute the pcs resource clear or the pcs constraint delete command to remove the constraint. This does not necessarily move the resources back to the original node, however, since where the resources can run at that point depends on how you have configured your resources initially. You can relocate a resource to its preferred node with the pcs resource relocate run command, as described in Moving a resource to its preferred node.

  • If you need to stop a running resource entirely and prevent the cluster from starting it again, you can use the pcs resource disable command. For information on the pcs resource disable command, see Enabling, disabling, and banning cluster resources.
  • If you want to prevent Pacemaker from taking any action for a resource (for example, if you want to disable recovery actions while performing maintenance on the resource, or if you need to reload the /etc/sysconfig/pacemaker settings), use the pcs resource unmanage command, as described in Setting a resource to unmanaged mode. Pacemaker Remote connection resources should never be unmanaged.
  • If you need to put the cluster in a state where no services will be started or stopped, you can set the maintenance-mode cluster property. Putting the cluster into maintenance mode automatically unmanages all resources. For information on putting the cluster in maintenance mode, see Putting a cluster in maintenance mode.
  • If you need to update the packages that make up the RHEL High Availability and Resilient Storage Add-Ons, you can update the packages on one node at a time or on the entire cluster as a whole, as summarized in Updating a Red Hat Enterprise Linux high availability cluster.
  • If you need to perform maintenance on a Pacemaker remote node, you can remove that node from the cluster by disabling the remote node resource, as described in Upgrading remote nodes and guest nodes.

29.1. Putting a node into standby mode

When a cluster node is in standby mode, the node is no longer able to host resources. Any resources currently active on the node will be moved to another node.

The following command puts the specified node into standby mode. If you specify the --all, this command puts all nodes into standby mode.

You can use this command when updating a resource’s packages. You can also use this command when testing a configuration, to simulate recovery without actually shutting down a node.

pcs node standby node | --all

The following command removes the specified node from standby mode. After running this command, the specified node is then able to host resources. If you specify the --all, this command removes all nodes from standby mode.

pcs node unstandby node | --all

Note that when you execute the pcs node standby command, this prevents resources from running on the indicated node. When you execute the pcs node unstandby command, this allows resources to run on the indicated node. This does not necessarily move the resources back to the indicated node; where the resources can run at that point depends on how you have configured your resources initially.

29.2. Manually moving cluster resources

You can override the cluster and force resources to move from their current location. There are two occasions when you would want to do this:

  • When a node is under maintenance, and you need to move all resources running on that node to a different node
  • When individually specified resources needs to be moved

To move all resources running on a node to a different node, you put the node in standby mode.

You can move individually specified resources in either of the following ways.

  • You can use the pcs resource move command to move a resource off a node on which it is currently running.
  • You can use the pcs resource relocate run command to move a resource to its preferred node, as determined by current cluster status, constraints, location of resources and other settings.

29.2.1. Moving a resource from its current node

To move a resource off the node on which it is currently running, use the following command, specifying the resource_id of the resource as defined. Specify the destination_node if you want to indicate on which node to run the resource that you are moving.

pcs resource move resource_id [destination_node] [--master] [lifetime=lifetime]
Note

When you execute the pcs resource move command, this adds a constraint to the resource to prevent it from running on the node on which it is currently running. You can execute the pcs resource clear or the pcs constraint delete command to remove the constraint. This does not necessarily move the resources back to the original node; where the resources can run at that point depends on how you have configured your resources initially.

If you specify the --master parameter of the pcs resource move command, the scope of the constraint is limited to the master role and you must specify master_id rather than resource_id.

You can optionally configure a lifetime parameter for the pcs resource move command to indicate a period of time the constraint should remain. You specify the units of a lifetime parameter according to the format defined in ISO 8601, which requires that you specify the unit as a capital letter such as Y (for years), M (for months), W (for weeks), D (for days), H (for hours), M (for minutes), and S (for seconds).

To distinguish a unit of minutes(M) from a unit of months(M), you must specify PT before indicating the value in minutes. For example, a lifetime parameter of 5M indicates an interval of five months, while a lifetime parameter of PT5M indicates an interval of five minutes.

The lifetime parameter is checked at intervals defined by the cluster-recheck-interval cluster property. By default this value is 15 minutes. If your configuration requires that you check this parameter more frequently, you can reset this value with the following command.

pcs property set cluster-recheck-interval=value

You can optionally configure a --wait[=n] parameter for the pcs resource move command to indicate the number of seconds to wait for the resource to start on the destination node before returning 0 if the resource is started or 1 if the resource has not yet started. If you do not specify n, the default resource timeout will be used.

The following command moves the resource resource1 to node example-node2 and prevents it from moving back to the node on which it was originally running for one hour and thirty minutes.

pcs resource move resource1 example-node2 lifetime=PT1H30M

The following command moves the resource resource1 to node example-node2 and prevents it from moving back to the node on which it was originally running for thirty minutes.

pcs resource move resource1 example-node2 lifetime=PT30M

29.2.2. Moving a resource to its preferred node

After a resource has moved, either due to a failover or to an administrator manually moving the node, it will not necessarily move back to its original node even after the circumstances that caused the failover have been corrected. To relocate resources to their preferred node, use the following command. A preferred node is determined by the current cluster status, constraints, resource location, and other settings and may change over time.

pcs resource relocate run [resource1] [resource2] ...

If you do not specify any resources, all resource are relocated to their preferred nodes.

This command calculates the preferred node for each resource while ignoring resource stickiness. After calculating the preferred node, it creates location constraints which will cause the resources to move to their preferred nodes. Once the resources have been moved, the constraints are deleted automatically. To remove all constraints created by the pcs resource relocate run command, you can enter the pcs resource relocate clear command. To display the current status of resources and their optimal node ignoring resource stickiness, enter the pcs resource relocate show command.

29.3. Disabling, enabling, and banning cluster resources

In addition to the pcs resource move and pcs resource relocate commands, there are a variety of other commands you can use to control the behavior of cluster resources.

Disabling a cluster resource

You can manually stop a running resource and prevent the cluster from starting it again with the following command. Depending on the rest of the configuration (constraints, options, failures, and so on), the resource may remain started. If you specify the --wait option, pcs will wait up to 'n' seconds for the resource to stop and then return 0 if the resource is stopped or 1 if the resource has not stopped. If 'n' is not specified it defaults to 60 minutes.

pcs resource disable resource_id [--wait[=n]]

As of Red Hat Enterprise Linux 8.2, you can specify that a resource be disabled only if disabling the resource would not have an effect on other resources. Ensuring that this would be the case can be impossible to do by hand when complex resource relations are set up.

  • The pcs resource disable --simulate command shows the effects of disabling a resource while not changing the cluster configuration.
  • The pcs resource disable --safe command disables a resource only if no other resources would be affected in any way, such as being migrated from one node to another. The pcs resource safe-disable command is an alias for the pcs resource disable --safe command.
  • The pcs resource disable --safe --no-strict command disables a resource only if no other resources would be stopped or demoted

Enabling a cluster resource

Use the following command to allow the cluster to start a resource. Depending on the rest of the configuration, the resource may remain stopped. If you specify the --wait option, pcs will wait up to 'n' seconds for the resource to start and then return 0 if the resource is started or 1 if the resource has not started. If 'n' is not specified it defaults to 60 minutes.

pcs resource enable resource_id [--wait[=n]]

Preventing a resource from running on a particular node

Use the following command to prevent a resource from running on a specified node, or on the current node if no node is specified.

pcs resource ban resource_id [node] [--master] [lifetime=lifetime] [--wait[=n]]

Note that when you execute the pcs resource ban command, this adds a -INFINITY location constraint to the resource to prevent it from running on the indicated node. You can execute the pcs resource clear or the pcs constraint delete command to remove the constraint. This does not necessarily move the resources back to the indicated node; where the resources can run at that point depends on how you have configured your resources initially.

If you specify the --master parameter of the pcs resource ban command, the scope of the constraint is limited to the master role and you must specify master_id rather than resource_id.

You can optionally configure a lifetime parameter for the pcs resource ban command to indicate a period of time the constraint should remain.

You can optionally configure a --wait[=n] parameter for the pcs resource ban command to indicate the number of seconds to wait for the resource to start on the destination node before returning 0 if the resource is started or 1 if the resource has not yet started. If you do not specify n, the default resource timeout will be used.

Forcing a resource to start on the current node

Use the debug-start parameter of the pcs resource command to force a specified resource to start on the current node, ignoring the cluster recommendations and printing the output from starting the resource. This is mainly used for debugging resources; starting resources on a cluster is (almost) always done by Pacemaker and not directly with a pcs command. If your resource is not starting, it is usually due to either a misconfiguration of the resource (which you debug in the system log), constraints that prevent the resource from starting, or the resource being disabled. You can use this command to test resource configuration, but it should not normally be used to start resources in a cluster.

The format of the debug-start command is as follows.

pcs resource debug-start resource_id

29.4. Setting a resource to unmanaged mode

When a resource is in unmanaged mode, the resource is still in the configuration but Pacemaker does not manage the resource.

The following command sets the indicated resources to unmanaged mode.

pcs resource unmanage resource1  [resource2] ...

The following command sets resources to managed mode, which is the default state.

pcs resource manage resource1  [resource2] ...

You can specify the name of a resource group with the pcs resource manage or pcs resource unmanage command. The command will act on all of the resources in the group, so that you can set all of the resources in a group to managed or unmanaged mode with a single command and then manage the contained resources individually.

29.5. Putting a cluster in maintenance mode

When a cluster is in maintenance mode, the cluster does not start or stop any services until told otherwise. When maintenance mode is completed, the cluster does a sanity check of the current state of any services, and then stops or starts any that need it.

To put a cluster in maintenance mode, use the following command to set the maintenance-mode cluster property to true.

# pcs property set maintenance-mode=true

To remove a cluster from maintenance mode, use the following command to set the maintenance-mode cluster property to false.

# pcs property set maintenance-mode=false

You can remove a cluster property from the configuration with the following command.

pcs property unset property

Alternately, you can remove a cluster property from a configuration by leaving the value field of the pcs property set command blank. This restores that property to its default value. For example, if you have previously set the symmetric-cluster property to false, the following command removes the value you have set from the configuration and restores the value of symmetric-cluster to true, which is its default value.

# pcs property set symmetric-cluster=

29.6. Updating a RHEL high availability cluster

Updating packages that make up the RHEL High Availability and Resilient Storage Add-Ons, either individually or as a whole, can be done in one of two general ways:

  • Rolling Updates: Remove one node at a time from service, update its software, then integrate it back into the cluster. This allows the cluster to continue providing service and managing resources while each node is updated.
  • Entire Cluster Update: Stop the entire cluster, apply updates to all nodes, then start the cluster back up.
Warning

It is critical that when performing software update procedures for Red Hat Enterprise LInux High Availability and Resilient Storage clusters, you ensure that any node that will undergo updates is not an active member of the cluster before those updates are initiated.

For a full description of each of these methods and the procedures to follow for the updates, see Recommended Practices for Applying Software Updates to a RHEL High Availability or Resilient Storage Cluster.

29.7. Upgrading remote nodes and guest nodes

If the pacemaker_remote service is stopped on an active remote node or guest node, the cluster will gracefully migrate resources off the node before stopping the node. This allows you to perform software upgrades and other routine maintenance procedures without removing the node from the cluster. Once pacemaker_remote is shut down, however, the cluster will immediately try to reconnect. If pacemaker_remote is not restarted within the resource’s monitor timeout, the cluster will consider the monitor operation as failed.

If you wish to avoid monitor failures when the pacemaker_remote service is stopped on an active Pacemaker Remote node, you can use the following procedure to take the node out of the cluster before performing any system administration that might stop pacemaker_remote

  1. Stop the node’s connection resource with the pcs resource disable resourcename, which will move all services off the node. For guest nodes, this will also stop the VM, so the VM must be started outside the cluster (for example, using virsh) to perform any maintenance.
  2. Perform the required maintenance.
  3. When ready to return the node to the cluster, re-enable the resource with the pcs resource enable.

29.8. Migrating VMs in a RHEL cluster

Information on support policies for RHEL high availability clusters with virtualized cluster members can be found in Support Policies for RHEL High Availability Clusters - General Conditions with Virtualized Cluster Members. As noted, Red Hat does not support live migration of active cluster nodes across hypervisors or hosts. If you need to perform a live migration, you will first need to stop the cluster services on the VM to remove the node from the cluster, and then start the cluster back up after performing the migration. The following steps outline the procedure for removing a VM from a cluster, migrating the VM, and restoring the VM to the cluster.

This procedure applies to VMs that are used as full cluster nodes, not to VMs managed as cluster resources (including VMs used as guest nodes) which can be live-migrated without special precautions. For general information on the fuller procedure required for updating packages that make up the RHEL High Availability and Resilient Storage Add-Ons, either individually or as a whole, see Recommended Practices for Applying Software Updates to a RHEL High Availability or Resilient Storage Cluster.

Note

Before performing this procedure, consider the effect on cluster quorum of removing a cluster node. For example, if you have a three-node cluster and you remove one node, your cluster can withstand only one more node failure. If one node of a three-node cluster is already down, removing a second node will lose quorum.

  1. If any preparations need to be made before stopping or moving the resources or software running on the VM to migrate, perform those steps.
  2. Run the following command on the VM to stop the cluster software on the VM.

    # pcs cluster stop
  3. Perform the live migration of the VM.
  4. Start cluster services on the VM.

    # pcs cluster start