A node is getting fenced after the rabbitmq-server resource is timing out on stop when a node leaves in a RHEL 7 Highly Available RHEL-OSP 6 cluster

Solution Verified - Updated -

Issue

  • When a node leaves the cluster, another node reboots because its rabbitmq-server resource times out on a stop operation. Why is rabbitmq taking so long to stop?
  • Why do my rabbitmq resource's operations take a long time to complete?
  • When we reboot a node manually, another node's rabbitmq monitor operation times out causing all nodes to attempt recovery, and then one node gets scheduled for STONITH because its stop operation times out too.
Apr 23 01:19:01 node1 corosync[9514]: [TOTEM ] A processor failed, forming new configuration.
Apr 23 01:19:03 node1 corosync[9514]: [TOTEM ] A new membership (172.16.20.230:132) was formed. Members left: 2
Apr 23 01:19:03 node1 corosync[9514]: [QUORUM] Members[2]: 1 3
Apr 23 01:19:03 node1 corosync[9514]: [MAIN  ] Completed service synchronization, ready to provide service.
Apr 23 01:19:44 node1 lrmd[16316]: warning: child_timeout_callback: rabbitmq-server_monitor_10000 process (PID 9427) timed out
Apr 23 01:19:44 node1 lrmd[16316]: warning: operation_finished: rabbitmq-server_monitor_10000:9427 - timed out after 40000ms
Apr 23 01:19:44 node1 pengine[16318]: notice: LogActions: Recover rabbitmq-server:0 (Started pcmk-node3)
Apr 23 01:19:44 node1 pengine[16318]: notice: LogActions: Recover rabbitmq-server:1 (Started pcmk-node1)
Apr 23 01:20:10 node1 crmd[9538]: notice: te_rsc_command: Initiating action 14: stop rabbitmq-server_stop_0 on pcmk-node2
Apr 23 01:21:41 node1 pengine[16318]: notice: LogActions: Stop    rabbitmq-server:0 (pcmk-node3)
Apr 23 01:21:38 node2 lrmd[3207]: warning: child_timeout_callback: rabbitmq-server_stop_0 process (PID 41953) timed out
Apr 23 01:21:38 node2 lrmd[3207]: warning: operation_finished: rabbitmq-server_stop_0:41953 - timed out after 90000ms
Apr 23 01:21:38 node2 crmd[41660]: error: process_lrm_event: Operation rabbitmq-server_stop_0: Timed Out (node=pcmk-node3, call=841, timeout=90000ms)
Apr 23 01:21:38 node2 crmd[41660]: notice: process_lrm_event: pcmk-node3-rabbitmq-server_stop_0:841 [ Stopping and halting node 'rabbit@lb-backend-node3' ...\n ]
Apr 23 01:21:41 node1 pengine[16318]: warning: unpack_rsc_op_failure: Processing failed op stop for rabbitmq-server:0 on pcmk-node3: unknown error (1)
Apr 23 01:21:41 node1 pengine[16318]: warning: unpack_rsc_op_failure: Processing failed op stop for rabbitmq-server:0 on pcmk-node3: unknown error (1)
Apr 23 01:21:41 node1 pengine[16318]: warning: stage6: Scheduling Node pcmk-node2 for STONITH

Environment

  • Red Hat Enterprise Linux - Openstack Platform (RHEL-OSP) 6
  • Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
  • rabbitmq messaging

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In