RHEL 7 High Availability cluster nodes frequently getting fenced and lrmd reporting "error: crm_abort: lrmd_ipc_dispatch: Triggered assert at main.c:123 : flags & crm_ipc_client_response" and segfaulting

Solution Verified - Updated -

Issue

  • My 3 nodes are all rebooting in a loop and lrmd seems to be constantly segfaulting.
  • The nodes in my cluster won't stop fencing each other and I see lrmd reporting "Triggered assert at main.c:123 : flags & crm_ipc_client_response" and segfaulting
Jul  6 08:57:02 node1 lrmd[4164]: error: crm_abort: lrmd_ipc_dispatch: Triggered assert at main.c:123 : flags & crm_ipc_client_response
Jul  6 08:57:02 node1 lrmd[4164]: error: lrmd_ipc_dispatch: Invalid client request: 0x1219ce0
  • I see constant repeating errors from lrmd about notifications failing and crmd crashing after "crit: lrm_connection_destroy: LRM Connection failed"
Jul  6 08:57:12 node1 crmd[33886]: crit: lrm_connection_destroy: LRM Connection failed
Jul  6 08:57:12 node1 crmd[33886]: warning: do_update_resource: Resource pcmk-node1 no longer exists in the lrmd
Jul  6 08:57:12 node1 lrmd[4164]: warning: qb_ipcs_event_sendv: new_event_notification (4164-33886-8): Bad file descriptor (9)
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 crmd[33886]: notice: process_lrm_event: Operation pcmk-node1_stop_0: ok (node=pcmk-node1, call=2, rc=0, cib-update=0, confirmed=true)
Jul  6 08:57:12 node1 attrd[4166]: notice: attrd_peer_remove: Removing all pcmk-node1 attributes for pcmk-node1
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 crmd[33886]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Jul  6 08:57:12 node1 crmd[33886]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Jul  6 08:57:12 node1 crmd[33886]: warning: do_recover: Fast-tracking shutdown in response to errors
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 crmd[33886]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 attrd[4166]: notice: attrd_peer_remove: Removing all pcmk-slnec1ctl2 attributes for pcmk-slnec1ctl2
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 crmd[33886]: notice: do_lrm_control: Disconnected from the LRM
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 crmd[33886]: notice: terminate_cs_connection: Disconnecting from Corosync
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 crmd[33886]: error: crmd_fast_exit: Could not recover from internal error
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 node1 pacemakerd[4050]: error: pcmk_child_exit: The crmd process (33886) exited: Generic Pacemaker error (201)
Jul  6 08:57:12 node1 pacemakerd[4050]: notice: pcmk_process_exit: Respawning failed child process: crmd
Jul  6 08:57:12 node1 crmd[36596]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log

Environment

  • Red Hat Enterprise Linux (RHEL) 7 with the High Availability Add On
  • One or more stonith devices in the CIB has a name (ID) matching the name of one of the cluster nodes.
    • The clusternode name comes either from corosync as specified in /etc/corosync/corosync.conf, or if the nodes are specified by IP address in that file, then the hostname (uname -n output) of the node is used as the name.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.