Red Hat Training

A Red Hat training course is available for Red Hat Enterprise Linux

8.2. Moving Resources Due to Failure

When you create a resource, you can configure the resource so that it will move to a new node after a defined number of failures by setting the migration-threshold option for that resource. Once the threshold has been reached, this node will no longer be allowed to run the failed resource until:
  • The administrator manually resets the resource's failcount using the pcs resource failcount command.
  • The resource's failure-timeout value is reached.
The value of migration-threshold is set to INFINITY by default. INFINITY is defined internally as a very large but finite number. A value of 0 disables the migration-threshold feature.

Note

Setting a migration-threshold for a resource is not the same as configuring a resource for migration, in which the resource moves to another location without loss of state.
The following example adds a migration threshold of 10 to the resource named dummy_resource, which indicates that the resource will move to a new node after 10 failures.
# pcs resource meta dummy_resource migration-threshold=10
You can add a migration threshold to the defaults for the whole cluster with the following command.
# pcs resource defaults migration-threshold=10
To determine the resource's current failure status and limits, use the pcs resource failcount command.
There are two exceptions to the migration threshold concept; they occur when a resource either fails to start or fails to stop. If the cluster property start-failure-is-fatal is set to true (which is the default), start failures cause the failcount to be set to INFINITY and thus always cause the resource to move immediately. For information on the start-failure-is-fatal option, see Table 12.1, “Cluster Properties”.
Stop failures are slightly different and crucial. If a resource fails to stop and STONITH is enabled, then the cluster will fence the node in order to be able to start the resource elsewhere. If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout.