pengine reports resource errors or failures every 15 minutes in a RHEL 6 or 7 High Availability cluster with pacemaker
Issue
-
The cluster log shows "Processing failed op monitor for my_script_res on node1.example.com: not running (7)" repeatedly which was caused after we manually stopped the application process with start/stop script or run "pcs resource disable" command.
-
My cluster is logging that its forcing a resource away from a node every 15 minutes
Jul 28 16:00:59 node1 pengine[5878]: warning: common_apply_stickiness: Forcing myResource-custom-start away from node2-priv after 1000000 failures (max=1000000)
-
penginereports "failed op monitor" warnings for the same resource every 15 minutes, which causes an issue for our monitoring software, because it causes alerts to fire off. -
Is it normal for
pengineto repeatedly report errors saying a resource is not running after I've disabled it withpcs? -
How to reset a resource's fail count without affecting status of any other resources in a group?
Environment
- Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
pacemaker
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
