Node reboots after reporting "crmd[xxxx]: error: process_lrm_event: Operation yyyyyy_stop_0: Timed out" in a RHEL 6 or 7 High Availability cluster
Issue
- I rebooted one node manually and observed that when that node came back, another node got rebooted itself
- I see a node getting rebooted frequently after reporting a stop operation timed out
- A node gets fenced after a stop operation fails from a timeout
Apr 23 01:21:38 node1 crmd[41660]: error: process_lrm_event: Operation rabbitmq-server_stop_0: Timed Out (node=node1.example.com, call=841, timeout=90000ms)
- I see a node getting rebooted after another node goes missing from the cluster, and it looks like it was scheduled for STONITH because its
rabbitmqstop operation failed or timed out
Environment
- Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
pacemaker
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.