Corosync crashes after "[TOTEM] FAILED TO RECEIVE" in RHEL 6 cluster

Solution Unverified - Updated -

Issue

  • Node is removed from the cluster after a crash in corosync following a "FAILED TO RECEIVE" condition
  • corosync crashes due to a SIGABRT and dumps a core after seeing "[TOTEM] FAILED TO RECEIVE" in logs
  Oct  2 09:13:56 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98
  Oct  2 09:13:56 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98
  [...]
  Oct  2 09:31:31 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99
  Oct  2 09:31:32 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99
  [...]
  Oct  2 09:49:35 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99 9b 9c
  Oct  2 09:49:35 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99 9b 9c
  [...]
  Oct  2 10:24:57 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99 9b 9c
  Oct  2 10:24:59 node1 corosync[5670]:   [TOTEM ] Retransmit List: 98 99 9b 9c
  Oct  2 10:24:59 node1 corosync[5670]:   [TOTEM ] FAILED TO RECEIVE
  Oct  2 10:25:01 node1 abrtd: Directory 'ccpp-2012-10-02-10:25:01-5670' creation detected
  Oct  2 10:25:01 node1 abrt[14835]: Saved core dump of pid 5670 (/usr/sbin/corosync) to /var/spool/abrt/ccpp-2012-10-02-10:25:01-5670 (65900544 bytes)
  Oct  2 10:25:01 node1 dlm_controld[5743]: cluster is down, exiting
  Oct  2 10:25:01 node1 gfs_controld[5792]: cluster is down, exiting
  • core dumped by corosync after FAILED TO RECEIVE shows a failed assertion in memb_consensus_agreed
#0  0x0000003416e32885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64    return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x0000003416e32885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003416e34065 in abort () at abort.c:92
#2  0x0000003416e2b9fe in __assert_fail_base (fmt=<value optimized out>, assertion=0x3462e23ef5 "token_memb_entries >= 1", file=0x3462e23e8d "totemsrp.c", 
    line=<value optimized out>, function=<value optimized out>) at assert.c:96
#3  0x0000003416e2bac0 in __assert_fail (assertion=0x3462e23ef5 "token_memb_entries >= 1", file=0x3462e23e8d "totemsrp.c", line=1211, function=0x3462e25150 "memb_consensus_agreed")
    at assert.c:105
#4  0x0000003462e12e86 in memb_consensus_agreed (instance=0x7f1852e24010) at totemsrp.c:1211
#5  0x0000003462e17513 in memb_join_process (instance=0x7f1852e24010, memb_join=0xf344fc) at totemsrp.c:4007
#6  0x0000003462e17839 in message_handler_memb_join (instance=0x7f1852e24010, msg=<value optimized out>, msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:4250
#7  0x0000003462e10d18 in rrp_deliver_fn (context=0xef19c0, msg=0xf344fc, msg_len=245) at totemrrp.c:1747
#8  0x0000003462e0b9a8 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0xf33e30) at totemudp.c:1252
#9  0x0000003462e07132 in poll_run (handle=2111858625151500288) at coropoll.c:513
#10 0x0000000000406eb9 in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1852
  • After we started the cman service on each node, node1 and node3 were rebooted and the corefile was generated by corosync on node2.

Environment

  • Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add on
  • Multicast communication between nodes (ie. not UDPU or broadcast)
  • corosync releases prior to 1.4.1-17.el6

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In