Corosync crashes after "[TOTEM] FAILED TO RECEIVE" in RHEL 6 cluster

Solution Unverified - Updated -

Issue

  • Node is removed from the cluster after a crash in corosync following a "FAILED TO RECEIVE" condition
  • corosync crashes due to a SIGABRT and dumps a core after seeing "[TOTEM] FAILED TO RECEIVE" in logs
  Oct  2 09:13:56 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98
  Oct  2 09:13:56 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98
  [...]
  Oct  2 09:31:31 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99
  Oct  2 09:31:32 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99
  [...]
  Oct  2 09:49:35 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99 9b 9c
  Oct  2 09:49:35 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99 9b 9c
  [...]
  Oct  2 10:24:57 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99 9b 9c
  Oct  2 10:24:59 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99 9b 9c
  Oct  2 10:24:59 node1 corosync[5670]:   [TOTEM ] FAILED TO RECEIVE
  Oct  2 10:25:01 node1 abrtd: Directory 'ccpp-2012-10-02-10:25:01-5670' creation detected
  Oct  2 10:25:01 node1 abrt[14835]: Saved core dump of pid 5670 (/usr/sbin/corosync) to /var/spool/abrt/ccpp-2012-10-02-10:25:01-5670 (65900544 bytes)
  Oct  2 10:25:01 node1 dlm_controld[5743]: cluster is down, exiting
  Oct  2 10:25:01 node1 gfs_controld[5792]: cluster is down, exiting
  • core dumped by corosync after FAILED TO RECEIVE shows a failed assertion in memb_consensus_agreed
#0  0x0000003416e32885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64    return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x0000003416e32885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003416e34065 in abort () at abort.c:92
#2  0x0000003416e2b9fe in __assert_fail_base (fmt=<value optimized out>, assertion=0x3462e23ef5 "token_memb_entries >= 1", file=0x3462e23e8d "totemsrp.c", 
    line=<value optimized out>, function=<value optimized out>) at assert.c:96
#3  0x0000003416e2bac0 in __assert_fail (assertion=0x3462e23ef5 "token_memb_entries >= 1", file=0x3462e23e8d "totemsrp.c", line=1211, function=0x3462e25150 "memb_consensus_agreed")
    at assert.c:105
#4  0x0000003462e12e86 in memb_consensus_agreed (instance=0x7f1852e24010) at totemsrp.c:1211
#5  0x0000003462e17513 in memb_join_process (instance=0x7f1852e24010, memb_join=0xf344fc) at totemsrp.c:4007
#6  0x0000003462e17839 in message_handler_memb_join (instance=0x7f1852e24010, msg=<value optimized out>, msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:4250
#7  0x0000003462e10d18 in rrp_deliver_fn (context=0xef19c0, msg=0xf344fc, msg_len=245) at totemrrp.c:1747
#8  0x0000003462e0b9a8 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0xf33e30) at totemudp.c:1252
#9  0x0000003462e07132 in poll_run (handle=2111858625151500288) at coropoll.c:513
#10 0x0000000000406eb9 in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1852
  • After we started the cman service on each node, node1 and node3 were rebooted and the corefile was generated by corosync on node2.

Environment

  • Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add on
  • Multicast communication between nodes (ie. not UDPU or broadcast)
  • corosync releases prior to 1.4.1-17.el6

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content