A resource in a negatively colocated group can remain stopped if it fails to start or hits its migration threshold in a Pacemaker cluster
Issue
Assume the following resource configuration with two resource groups.
[root@fastvm-rhel-8-0-23 pacemaker]# pcs config | egrep '(Group|Resource|Meta Attrs):'
Group: dummya
Resource: dummya_1 (class=ocf provider=heartbeat type=Dummy)
Resource: dummya_2 (class=ocf provider=heartbeat type=Dummy)
Group: dummyb
Resource: dummyb_1 (class=ocf provider=heartbeat type=Dummy)
Resource: dummyb_2 (class=ocf provider=heartbeat type=Dummy)
Assume that one group (dummyb) is colocated with another (dummya) with a negative, non-INFINITY colocation score.
[root@fastvm-rhel-8-0-23 pacemaker]# pcs constraint colocation
Colocation Constraints:
dummyb with dummya (score:-5000)
If resource dummyb_2 fails to start, the start-failure-is-fatal=true cluster property prevents it from running on its current node again. It remains stopped on its current node. The dummyb group does not fail over and allow resource dummyb_2 to start on the other node as expected.
The stopped resource has messages in logs like the following:
Sep 5 20:20:12 fastvm-rhel-8-0-24 pacemaker-schedulerd[342774]: warning: Forcing dummyb_2 away from node2 after 1000000 failures (max=1000000)
Note: This can also happen if the resource fails its monitor operation enough times to reach its migration-threshold. The migration-threshold meta attribute defaults to INFINITY (defined as 1000000), but it can be configured explicitly to a lower value.
Environment
- Red Hat Enterprise Linux 7, 8, 9 (with the High Availability Add-on)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.