A resource in a negatively colocated group can remain stopped if it fails to start or hits its migration threshold in a Pacemaker cluster

Solution In Progress - Updated -

Issue

Assume the following resource configuration with two resource groups.

[root@fastvm-rhel-8-0-23 pacemaker]# pcs config | egrep '(Group|Resource|Meta Attrs):'
 Group: dummya
  Resource: dummya_1 (class=ocf provider=heartbeat type=Dummy)
  Resource: dummya_2 (class=ocf provider=heartbeat type=Dummy)
 Group: dummyb
  Resource: dummyb_1 (class=ocf provider=heartbeat type=Dummy)
  Resource: dummyb_2 (class=ocf provider=heartbeat type=Dummy)

Assume that one group (dummyb) is colocated with another (dummya) with a negative, non-INFINITY colocation score.

[root@fastvm-rhel-8-0-23 pacemaker]# pcs constraint colocation
Colocation Constraints:
  dummyb with dummya (score:-5000)

If resource dummyb_2 fails to start, the start-failure-is-fatal=true cluster property prevents it from running on node2 again. It remains stopped on its current node. The dummyb group does not fail over and allow resource dummyb_2 to start there as expected.

The stopped resource has messages in logs like the following:

Sep  5 20:20:12 fastvm-rhel-8-0-24 pacemaker-schedulerd[342774]: warning: Forcing dummyb_2 away from node2 after 1000000 failures (max=1000000)

Note: This can also happen if the resource fails its monitor operation enough times to reach its migration-threshold. The migration-threshold meta attribute defaults to INFINITY (defined as 1000000), but it can be configured explicitly to a lower value.

Environment

  • Red Hat Enterprise Linux 7 (with the High Availability Add-on)
  • Red Hat Enterprise Linux 8 (with the High Availability Add-on)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In