One or more db2 resources in a RHEL 6 High Availability cluster with pacemaker is failing a monitor operation periodically with "generic error" and no output logged

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add On
  • IBM DB2 managed by ocf:heartbeat:db2 resource in the cluster

Issue

  • In our pacemaker (pcs) cluster made by two nodes with two DB2 resource groups, after 1 or 2 days the DB2 resource monitor fails randomly with an "unknown error".
  • pcs status shows several of my db2 resources failed a monitor operation at some point
Failed actions:
    myDB1_DB_monitor_60000 on node2.example.com 'unknown error' (1): call=534, status=complete, last-rc-change='Mon Jul 27 07:17:01 2015', queued=0ms, exec=0ms
    myDB2_DB_monitor_60000 on node1.example.com 'unknown error' (1): call=188, status=complete, last-rc-change='Mon Jul 27 04:39:05 2015', queued=0ms, exec=0ms
  • db2 resources fail a monitor check with "generic error" without printing any other errors or messages in the logs.
db2(mydb_DB)[20403]:    2015/07/27_07:15:58 DEBUG: Monitor: DB2 database mydb(0)/DWG has HADR status Standard/Standalone
db2(mydb_DB)[20403]:    2015/07/27_07:15:59 DEBUG: DB2 database mydb(0)/DWG appears to be working
[...]
Jul 27 07:17:00 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: log_finished:     finished - rsc:mydb_FS action:monitor call_id:507 pid:21751 exit-code:0 exec-time:0ms queue-time:0ms
Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: operation_finished:   mydb_DB_monitor_60000:21564 - exited with rc=1
Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: operation_finished:   mydb_DB_monitor_60000:21564:stderr [ -- empty -- ]
Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: operation_finished:   mydb_DB_monitor_60000:21564:stdout [ -- empty -- ]
Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: log_finished:     finished - rsc:mydb_DB action:monitor call_id:534 pid:21564 exit-code:1 exec-time:0ms queue-time:0ms
Jul 27 07:17:01 [4240] ls001e02-00-db2.rmasede.grma.net       crmd:    debug: create_operation_update:  do_update_resource: Updating resource mydb_DB after monitor op complete (interval=60000)
Jul 27 07:17:01 [4240] ls001e02-00-db2.rmasede.grma.net       crmd:   notice: process_lrm_event:    Operation mydb_DB_monitor_60000: unknown error (node=node1.example.com, call=534, rc=1, cib-update=3873, confirmed=false)

Resolution

Identify what is causing the DB2 database to fail, crash, or otherwise report an error that the db2 monitor operation is detecting.

Root Cause

The db2 resource is running a monitor operation at a regular interval and is detecting an error eventually. See the Diagnostic Steps below for more information on detecting how the resource may have failed.

Diagnostic Steps

  • Check /var/log/messages, or it may be necessary to enable debug logging in /etc/cluster/cluster.conf then check /var/log/cluster/corosync.log, then look for messages from the db2 resource at the time that the monitor operation failed. The following log snippet shows the pacemaker daemons detecting and reporting the monitor failure, so looking for a snippet like this would be a good starting point:

    Jul 27 07:17:00 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: log_finished:     finished - rsc:mydb_FS action:monitor call_id:507 pid:21751 exit-code:0 exec-time:0ms queue-time:0ms
    Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: operation_finished:   mydb_DB_monitor_60000:21564 - exited with rc=1
    Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: operation_finished:   mydb_DB_monitor_60000:21564:stderr [ -- empty -- ]
    Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: operation_finished:   mydb_DB_monitor_60000:21564:stdout [ -- empty -- ]
    Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net       lrmd:    debug: log_finished:     finished - rsc:mydb_DB action:monitor call_id:534 pid:21564 exit-code:1 exec-time:0ms queue-time:0ms
    Jul 27 07:17:01 [4240] ls001e02-00-db2.rmasede.grma.net       crmd:    debug: create_operation_update:  do_update_resource: Updating resource mydb_DB after monitor op complete (interval=60000)
    Jul 27 07:17:01 [4240] ls001e02-00-db2.rmasede.grma.net       crmd:   notice: process_lrm_event:    Operation mydb_DB_monitor_60000: unknown error (node=node1.example.com, call=534, rc=1, cib-update=3873, confirmed=false)
    
    • Now trace back in the log to just before those daemons report that error and see if there are any messages reported by db2(<resource name>). In this example, the last messages of this nature seen were a minute ago during the last monitor operation, and demonstrate that everything was seen as ok in that instance:
    db2(mydb_DB)[20403]:    2015/07/27_07:15:58 DEBUG: Monitor: DB2 database mydb(0)/DWG has HADR status Standard/Standalone
    db2(mydb_DB)[20403]:    2015/07/27_07:15:59 DEBUG: DB2 database mydb(0)/DWG appears to be working
    
    • With the above logs, we know that it was running ok a minute ago, and then on this pass nothing was logged on stderr or stdout or through the cluster facilities. The implementation of the resource agent's monitor operation leads to only two possible explanations when that is the case:

      • The command db2pd -hadr -db $db run as the db2 user (filing in the proper $db value) returns non-zero, or
      • The command db2nps $db2node | cut -c9- | grep ' db2[^ ]' | wc -l run as the db2 user (filling in the proper $db2node value) returns 1

      • The above commands can be executed regularly to determine, when this happens again, if either one of them was true

  • Review the DB2 logs to determine if a failure is occurring.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.