Red Hat Training

A Red Hat training course is available for RHEL 8

Chapter 31. Performing cluster maintenance

In order to perform maintenance on the nodes of your cluster, you may need to stop or move the resources and services running on that cluster. Or you may need to stop the cluster software while leaving the services untouched. Pacemaker provides a variety of methods for performing system maintenance.

  • If you need to stop a node in a cluster while continuing to provide the services running on that cluster on another node, you can put the cluster node in standby mode. A node that is in standby mode is no longer able to host resources. Any resource currently active on the node will be moved to another node, or stopped if no other node is eligible to run the resource. For information about standby mode, see Putting a node into standby mode.
  • If you need to move an individual resource off the node on which it is currently running without stopping that resource, you can use the pcs resource move command to move the resource to a different node.

    When you execute the pcs resource move command, this adds a constraint to the resource to prevent it from running on the node on which it is currently running. When you are ready to move the resource back, you can execute the pcs resource clear or the pcs constraint delete command to remove the constraint. This does not necessarily move the resources back to the original node, however, since where the resources can run at that point depends on how you have configured your resources initially. You can relocate a resource to its preferred node with the pcs resource relocate run command.

  • If you need to stop a running resource entirely and prevent the cluster from starting it again, you can use the pcs resource disable command. For information on the pcs resource disable command, see Disabling, enabling, and banning cluster resources.
  • If you want to prevent Pacemaker from taking any action for a resource (for example, if you want to disable recovery actions while performing maintenance on the resource, or if you need to reload the /etc/sysconfig/pacemaker settings), use the pcs resource unmanage command, as described in Setting a resource to unmanaged mode. Pacemaker Remote connection resources should never be unmanaged.
  • If you need to put the cluster in a state where no services will be started or stopped, you can set the maintenance-mode cluster property. Putting the cluster into maintenance mode automatically unmanages all resources. For information about putting the cluster in maintenance mode, see Putting a cluster in maintenance mode.
  • If you need to update the packages that make up the RHEL High Availability and Resilient Storage Add-Ons, you can update the packages on one node at a time or on the entire cluster as a whole, as summarized in Updating a RHEL high availability cluster.
  • If you need to perform maintenance on a Pacemaker remote node, you can remove that node from the cluster by disabling the remote node resource, as described in Upgrading remote nodes and guest nodes.
  • If you need to migrate a VM in a RHEL cluster, you will first need to stop the cluster services on the VM to remove the node from the cluster and then start the cluster back up after performing the migration. as described in Migrating VMs in a RHEL cluster.

31.1. Putting a node into standby mode

When a cluster node is in standby mode, the node is no longer able to host resources. Any resources currently active on the node will be moved to another node.

The following command puts the specified node into standby mode. If you specify the --all, this command puts all nodes into standby mode.

You can use this command when updating a resource’s packages. You can also use this command when testing a configuration, to simulate recovery without actually shutting down a node.

pcs node standby node | --all

The following command removes the specified node from standby mode. After running this command, the specified node is then able to host resources. If you specify the --all, this command removes all nodes from standby mode.

pcs node unstandby node | --all

Note that when you execute the pcs node standby command, this prevents resources from running on the indicated node. When you execute the pcs node unstandby command, this allows resources to run on the indicated node. This does not necessarily move the resources back to the indicated node; where the resources can run at that point depends on how you have configured your resources initially.

31.2. Manually moving cluster resources

You can override the cluster and force resources to move from their current location. There are two occasions when you would want to do this:

  • When a node is under maintenance, and you need to move all resources running on that node to a different node
  • When individually specified resources needs to be moved

To move all resources running on a node to a different node, you put the node in standby mode.

You can move individually specified resources in either of the following ways.

  • You can use the pcs resource move command to move a resource off a node on which it is currently running.
  • You can use the pcs resource relocate run command to move a resource to its preferred node, as determined by current cluster status, constraints, location of resources and other settings.

31.2.1. Moving a resource from its current node

To move a resource off the node on which it is currently running, use the following command, specifying the resource_id of the resource as defined. Specify the destination_node if you want to indicate on which node to run the resource that you are moving.

pcs resource move resource_id [destination_node] [--master] [lifetime=lifetime]
Note

When you run the pcs resource move command, this adds a constraint to the resource to prevent it from running on the node on which it is currently running. As of RHEL 8.6, you can specify the --autodelete option for this command, which will cause the location constraint that this command creates to be removed automatically once the resource has been moved. For earlier releases, you can run the pcs resource clear or the pcs constraint delete command to remove the constraint manually. Removing the constraint does not necessarily move the resources back to the original node; where the resources can run at that point depends on how you have configured your resources initially.

If you specify the --master parameter of the pcs resource move command, the constraint applies only to promoted instances of the resource.

You can optionally configure a lifetime parameter for the pcs resource move command to indicate a period of time the constraint should remain. You specify the units of a lifetime parameter according to the format defined in ISO 8601, which requires that you specify the unit as a capital letter such as Y (for years), M (for months), W (for weeks), D (for days), H (for hours), M (for minutes), and S (for seconds).

To distinguish a unit of minutes(M) from a unit of months(M), you must specify PT before indicating the value in minutes. For example, a lifetime parameter of 5M indicates an interval of five months, while a lifetime parameter of PT5M indicates an interval of five minutes.

The following command moves the resource resource1 to node example-node2 and prevents it from moving back to the node on which it was originally running for one hour and thirty minutes.

pcs resource move resource1 example-node2 lifetime=PT1H30M

The following command moves the resource resource1 to node example-node2 and prevents it from moving back to the node on which it was originally running for thirty minutes.

pcs resource move resource1 example-node2 lifetime=PT30M

31.2.2. Moving a resource to its preferred node

After a resource has moved, either due to a failover or to an administrator manually moving the node, it will not necessarily move back to its original node even after the circumstances that caused the failover have been corrected. To relocate resources to their preferred node, use the following command. A preferred node is determined by the current cluster status, constraints, resource location, and other settings and may change over time.

pcs resource relocate run [resource1] [resource2] ...

If you do not specify any resources, all resource are relocated to their preferred nodes.

This command calculates the preferred node for each resource while ignoring resource stickiness. After calculating the preferred node, it creates location constraints which will cause the resources to move to their preferred nodes. Once the resources have been moved, the constraints are deleted automatically. To remove all constraints created by the pcs resource relocate run command, you can enter the pcs resource relocate clear command. To display the current status of resources and their optimal node ignoring resource stickiness, enter the pcs resource relocate show command.

31.3. Disabling, enabling, and banning cluster resources

In addition to the pcs resource move and pcs resource relocate commands, there are a variety of other commands you can use to control the behavior of cluster resources.

Disabling a cluster resource

You can manually stop a running resource and prevent the cluster from starting it again with the following command. Depending on the rest of the configuration (constraints, options, failures, and so on), the resource may remain started. If you specify the --wait option, pcs will wait up to 'n' seconds for the resource to stop and then return 0 if the resource is stopped or 1 if the resource has not stopped. If 'n' is not specified it defaults to 60 minutes.

pcs resource disable resource_id [--wait[=n]]

As of RHEL 8.2, you can specify that a resource be disabled only if disabling the resource would not have an effect on other resources. Ensuring that this would be the case can be impossible to do by hand when complex resource relations are set up.

  • The pcs resource disable --simulate command shows the effects of disabling a resource while not changing the cluster configuration.
  • The pcs resource disable --safe command disables a resource only if no other resources would be affected in any way, such as being migrated from one node to another. The pcs resource safe-disable command is an alias for the pcs resource disable --safe command.
  • The pcs resource disable --safe --no-strict command disables a resource only if no other resources would be stopped or demoted

As of RHEL 8.5 you can specify the --brief option for the pcs resource disable --safe command to print errors only. Also as of RHEL 8.5, the error report that the pcs resource disable --safe command generates if the safe disable operation fails contains the affected resource IDs. If you need to know only the resource IDs of resources that would be affected by disabling a resource, use the --brief option, which does not provide the full simulation result.

Enabling a cluster resource

Use the following command to allow the cluster to start a resource. Depending on the rest of the configuration, the resource may remain stopped. If you specify the --wait option, pcs will wait up to 'n' seconds for the resource to start and then return 0 if the resource is started or 1 if the resource has not started. If 'n' is not specified it defaults to 60 minutes.

pcs resource enable resource_id [--wait[=n]]

Preventing a resource from running on a particular node

Use the following command to prevent a resource from running on a specified node, or on the current node if no node is specified.

pcs resource ban resource_id [node] [--master] [lifetime=lifetime] [--wait[=n]]

Note that when you execute the pcs resource ban command, this adds a -INFINITY location constraint to the resource to prevent it from running on the indicated node. You can execute the pcs resource clear or the pcs constraint delete command to remove the constraint. This does not necessarily move the resources back to the indicated node; where the resources can run at that point depends on how you have configured your resources initially.

If you specify the --master parameter of the pcs resource ban command, the scope of the constraint is limited to the master role and you must specify master_id rather than resource_id.

You can optionally configure a lifetime parameter for the pcs resource ban command to indicate a period of time the constraint should remain.

You can optionally configure a --wait[=n] parameter for the pcs resource ban command to indicate the number of seconds to wait for the resource to start on the destination node before returning 0 if the resource is started or 1 if the resource has not yet started. If you do not specify n, the default resource timeout will be used.

Forcing a resource to start on the current node

Use the debug-start parameter of the pcs resource command to force a specified resource to start on the current node, ignoring the cluster recommendations and printing the output from starting the resource. This is mainly used for debugging resources; starting resources on a cluster is (almost) always done by Pacemaker and not directly with a pcs command. If your resource is not starting, it is usually due to either a misconfiguration of the resource (which you debug in the system log), constraints that prevent the resource from starting, or the resource being disabled. You can use this command to test resource configuration, but it should not normally be used to start resources in a cluster.

The format of the debug-start command is as follows.

pcs resource debug-start resource_id

31.4. Setting a resource to unmanaged mode

When a resource is in unmanaged mode, the resource is still in the configuration but Pacemaker does not manage the resource.

The following command sets the indicated resources to unmanaged mode.

pcs resource unmanage resource1  [resource2] ...

The following command sets resources to managed mode, which is the default state.

pcs resource manage resource1  [resource2] ...

You can specify the name of a resource group with the pcs resource manage or pcs resource unmanage command. The command will act on all of the resources in the group, so that you can set all of the resources in a group to managed or unmanaged mode with a single command and then manage the contained resources individually.

31.5. Putting a cluster in maintenance mode

When a cluster is in maintenance mode, the cluster does not start or stop any services until told otherwise. When maintenance mode is completed, the cluster does a sanity check of the current state of any services, and then stops or starts any that need it.

To put a cluster in maintenance mode, use the following command to set the maintenance-mode cluster property to true.

# pcs property set maintenance-mode=true

To remove a cluster from maintenance mode, use the following command to set the maintenance-mode cluster property to false.

# pcs property set maintenance-mode=false

You can remove a cluster property from the configuration with the following command.

pcs property unset property

Alternately, you can remove a cluster property from a configuration by leaving the value field of the pcs property set command blank. This restores that property to its default value. For example, if you have previously set the symmetric-cluster property to false, the following command removes the value you have set from the configuration and restores the value of symmetric-cluster to true, which is its default value.

# pcs property set symmetric-cluster=

31.6. Updating a RHEL high availability cluster

Updating packages that make up the RHEL High Availability and Resilient Storage Add-Ons, either individually or as a whole, can be done in one of two general ways:

  • Rolling Updates: Remove one node at a time from service, update its software, then integrate it back into the cluster. This allows the cluster to continue providing service and managing resources while each node is updated.
  • Entire Cluster Update: Stop the entire cluster, apply updates to all nodes, then start the cluster back up.
Warning

It is critical that when performing software update procedures for Red Hat Enterprise Linux High Availability and Resilient Storage clusters, you ensure that any node that will undergo updates is not an active member of the cluster before those updates are initiated.

For a full description of each of these methods and the procedures to follow for the updates, see Recommended Practices for Applying Software Updates to a RHEL High Availability or Resilient Storage Cluster.

31.7. Upgrading remote nodes and guest nodes

If the pacemaker_remote service is stopped on an active remote node or guest node, the cluster will gracefully migrate resources off the node before stopping the node. This allows you to perform software upgrades and other routine maintenance procedures without removing the node from the cluster. Once pacemaker_remote is shut down, however, the cluster will immediately try to reconnect. If pacemaker_remote is not restarted within the resource’s monitor timeout, the cluster will consider the monitor operation as failed.

If you wish to avoid monitor failures when the pacemaker_remote service is stopped on an active Pacemaker Remote node, you can use the following procedure to take the node out of the cluster before performing any system administration that might stop pacemaker_remote.

Procedure

  1. Stop the node’s connection resource with the pcs resource disable resourcename command, which will move all services off the node. The connection resource would be the ocf:pacemaker:remote resource for a remote node or, commonly, the ocf:heartbeat:VirtualDomain resource for a guest node. For guest nodes, this command will also stop the VM, so the VM must be started outside the cluster (for example, using virsh) to perform any maintenance.

    pcs resource disable resourcename
  2. Perform the required maintenance.
  3. When ready to return the node to the cluster, re-enable the resource with the pcs resource enable command.

    pcs resource enable resourcename

31.8. Migrating VMs in a RHEL cluster

Red Hat does not support live migration of active cluster nodes across hypervisors or hosts, as noted in Support Policies for RHEL High Availability Clusters - General Conditions with Virtualized Cluster Members. If you need to perform a live migration, you will first need to stop the cluster services on the VM to remove the node from the cluster, and then start the cluster back up after performing the migration. The following steps outline the procedure for removing a VM from a cluster, migrating the VM, and restoring the VM to the cluster.

The following steps outline the procedure for removing a VM from a cluster, migrating the VM, and restoring the VM to the cluster.

This procedure applies to VMs that are used as full cluster nodes, not to VMs managed as cluster resources (including VMs used as guest nodes) which can be live-migrated without special precautions. For general information about the fuller procedure required for updating packages that make up the RHEL High Availability and Resilient Storage Add-Ons, either individually or as a whole, see Recommended Practices for Applying Software Updates to a RHEL High Availability or Resilient Storage Cluster.

Note

Before performing this procedure, consider the effect on cluster quorum of removing a cluster node. For example, if you have a three-node cluster and you remove one node, your cluster can not withstand any node failure. This is because if one node of a three-node cluster is already down, removing a second node will lose quorum.

Procedure

  1. If any preparations need to be made before stopping or moving the resources or software running on the VM to migrate, perform those steps.
  2. Run the following command on the VM to stop the cluster software on the VM.

    # pcs cluster stop
  3. Perform the live migration of the VM.
  4. Start cluster services on the VM.

    # pcs cluster start

31.9. Identifying clusters by UUID

As of Red Hat Enterprise Linux 8.7, when you create a cluster it has an associated UUID. Since a cluster name is not a unique cluster identifier, a third-party tool such as a configuration management database that manages multiple clusters with the same name can uniquely identify a cluster by means of its UUID. You can display the current cluster UUID with the pcs cluster config [show] command, which includes the cluster UUID in its output.

To add a UUID to an existing cluster, run the following command.

# pcs cluster config uuid generate

To regenerate a UUID for a cluster with an existing UUID, run the following command.

# pcs cluster config uuid generate --force