pengine reports resource errors or failures every 15 minutes in a RHEL 6 or 7 High Availability cluster with pacemaker
Issue
-
The cluster log shows "Processing failed op monitor for my_script_res on node1.example.com: not running (7)" repeatedly which was caused after we manually stopped the application process with start/stop script or run "pcs resource disable" command.
-
My cluster is logging that its forcing a resource away from a node every 15 minutes
Jul 28 16:00:59 node1 pengine[5878]: warning: common_apply_stickiness: Forcing myResource-custom-start away from node2-priv after 1000000 failures (max=1000000)
-
pengine
reports "failed op monitor" warnings for the same resource every 15 minutes, which causes an issue for our monitoring software, because it causes alerts to fire off. -
Is it normal for
pengine
to repeatedly report errors saying a resource is not running after I've disabled it withpcs
? -
How to reset a resource's fail count without affecting status of any other resources in a group?
Environment
- Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
pacemaker
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.