One or more db2 resources in a RHEL 6 High Availability cluster with pacemaker is failing a monitor operation periodically with "generic error" and no output logged
Environment
- Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add On
- IBM DB2 managed by
ocf:heartbeat:db2resource in the cluster
Issue
- In our
pacemaker(pcs) cluster made by two nodes with two DB2 resource groups, after 1 or 2 days the DB2 resource monitor fails randomly with an "unknown error". pcs statusshows several of mydb2resources failed a monitor operation at some point
Failed actions:
myDB1_DB_monitor_60000 on node2.example.com 'unknown error' (1): call=534, status=complete, last-rc-change='Mon Jul 27 07:17:01 2015', queued=0ms, exec=0ms
myDB2_DB_monitor_60000 on node1.example.com 'unknown error' (1): call=188, status=complete, last-rc-change='Mon Jul 27 04:39:05 2015', queued=0ms, exec=0ms
db2resources fail amonitorcheck with "generic error" without printing any other errors or messages in the logs.
db2(mydb_DB)[20403]: 2015/07/27_07:15:58 DEBUG: Monitor: DB2 database mydb(0)/DWG has HADR status Standard/Standalone
db2(mydb_DB)[20403]: 2015/07/27_07:15:59 DEBUG: DB2 database mydb(0)/DWG appears to be working
[...]
Jul 27 07:17:00 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: log_finished: finished - rsc:mydb_FS action:monitor call_id:507 pid:21751 exit-code:0 exec-time:0ms queue-time:0ms
Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: operation_finished: mydb_DB_monitor_60000:21564 - exited with rc=1
Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: operation_finished: mydb_DB_monitor_60000:21564:stderr [ -- empty -- ]
Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: operation_finished: mydb_DB_monitor_60000:21564:stdout [ -- empty -- ]
Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: log_finished: finished - rsc:mydb_DB action:monitor call_id:534 pid:21564 exit-code:1 exec-time:0ms queue-time:0ms
Jul 27 07:17:01 [4240] ls001e02-00-db2.rmasede.grma.net crmd: debug: create_operation_update: do_update_resource: Updating resource mydb_DB after monitor op complete (interval=60000)
Jul 27 07:17:01 [4240] ls001e02-00-db2.rmasede.grma.net crmd: notice: process_lrm_event: Operation mydb_DB_monitor_60000: unknown error (node=node1.example.com, call=534, rc=1, cib-update=3873, confirmed=false)
Resolution
Identify what is causing the DB2 database to fail, crash, or otherwise report an error that the db2 monitor operation is detecting.
Root Cause
The db2 resource is running a monitor operation at a regular interval and is detecting an error eventually. See the Diagnostic Steps below for more information on detecting how the resource may have failed.
Diagnostic Steps
-
Check
/var/log/messages, or it may be necessary to enable debug logging in/etc/cluster/cluster.confthen check/var/log/cluster/corosync.log, then look for messages from thedb2resource at the time that themonitoroperation failed. The following log snippet shows the pacemaker daemons detecting and reporting the monitor failure, so looking for a snippet like this would be a good starting point:Jul 27 07:17:00 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: log_finished: finished - rsc:mydb_FS action:monitor call_id:507 pid:21751 exit-code:0 exec-time:0ms queue-time:0ms Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: operation_finished: mydb_DB_monitor_60000:21564 - exited with rc=1 Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: operation_finished: mydb_DB_monitor_60000:21564:stderr [ -- empty -- ] Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: operation_finished: mydb_DB_monitor_60000:21564:stdout [ -- empty -- ] Jul 27 07:17:01 [4237] ls001e02-00-db2.rmasede.grma.net lrmd: debug: log_finished: finished - rsc:mydb_DB action:monitor call_id:534 pid:21564 exit-code:1 exec-time:0ms queue-time:0ms Jul 27 07:17:01 [4240] ls001e02-00-db2.rmasede.grma.net crmd: debug: create_operation_update: do_update_resource: Updating resource mydb_DB after monitor op complete (interval=60000) Jul 27 07:17:01 [4240] ls001e02-00-db2.rmasede.grma.net crmd: notice: process_lrm_event: Operation mydb_DB_monitor_60000: unknown error (node=node1.example.com, call=534, rc=1, cib-update=3873, confirmed=false)- Now trace back in the log to just before those daemons report that error and see if there are any messages reported by
db2(<resource name>). In this example, the last messages of this nature seen were a minute ago during the last monitor operation, and demonstrate that everything was seen as ok in that instance:
db2(mydb_DB)[20403]: 2015/07/27_07:15:58 DEBUG: Monitor: DB2 database mydb(0)/DWG has HADR status Standard/Standalone db2(mydb_DB)[20403]: 2015/07/27_07:15:59 DEBUG: DB2 database mydb(0)/DWG appears to be working-
With the above logs, we know that it was running ok a minute ago, and then on this pass nothing was logged on stderr or stdout or through the cluster facilities. The implementation of the resource agent's monitor operation leads to only two possible explanations when that is the case:
- The command
db2pd -hadr -db $dbrun as the db2 user (filing in the proper$dbvalue) returns non-zero, or -
The command
db2nps $db2node | cut -c9- | grep ' db2[^ ]' | wc -lrun as the db2 user (filling in the proper$db2nodevalue) returns 1 -
The above commands can be executed regularly to determine, when this happens again, if either one of them was true
- The command
- Now trace back in the log to just before those daemons report that error and see if there are any messages reported by
-
Review the DB2 logs to determine if a failure is occurring.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
