pengine reports resource errors or failures every 15 minutes in a RHEL 6 or 7 High Availability cluster with pacemaker

Solution Unverified - Updated -

Issue

  • The cluster log shows "Processing failed op monitor for my_script_res on node1.example.com: not running (7)" repeatedly which was caused after we manually stopped the application process with start/stop script or run "pcs resource disable" command.

  • My cluster is logging that its forcing a resource away from a node every 15 minutes

Jul 28 16:00:59 node1 pengine[5878]:  warning: common_apply_stickiness: Forcing myResource-custom-start away from node2-priv after 1000000 failures (max=1000000)
  • pengine reports "failed op monitor" warnings for the same resource every 15 minutes, which causes an issue for our monitoring software, because it causes alerts to fire off.

  • Is it normal for pengine to repeatedly report errors saying a resource is not running after I've disabled it with pcs?

  • How to reset a resource's fail count without affecting status of any other resources in a group?

Environment

  • Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
  • pacemaker

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content