Corosync crashes after "[TOTEM] FAILED TO RECEIVE" in RHEL 6 cluster
Issue
- Node is removed from the cluster after a crash in corosync following a "FAILED TO RECEIVE" condition
- corosync crashes due to a SIGABRT and dumps a core after seeing "[TOTEM] FAILED TO RECEIVE" in logs
Oct 2 09:13:56 node1 corosync[5670]: [TOTEM ] Retransmit List: 98
Oct 2 09:13:56 node1 corosync[5670]: [TOTEM ] Retransmit List: 98
[...]
Oct 2 09:31:31 node1 corosync[5670]: [TOTEM ] Retransmit List: 98 99
Oct 2 09:31:32 node1 corosync[5670]: [TOTEM ] Retransmit List: 98 99
[...]
Oct 2 09:49:35 node1 corosync[5670]: [TOTEM ] Retransmit List: 98 99 9b 9c
Oct 2 09:49:35 node1 corosync[5670]: [TOTEM ] Retransmit List: 98 99 9b 9c
[...]
Oct 2 10:24:57 node1 corosync[5670]: [TOTEM ] Retransmit List: 98 99 9b 9c
Oct 2 10:24:59 node1 corosync[5670]: [TOTEM ] Retransmit List: 98 99 9b 9c
Oct 2 10:24:59 node1 corosync[5670]: [TOTEM ] FAILED TO RECEIVE
Oct 2 10:25:01 node1 abrtd: Directory 'ccpp-2012-10-02-10:25:01-5670' creation detected
Oct 2 10:25:01 node1 abrt[14835]: Saved core dump of pid 5670 (/usr/sbin/corosync) to /var/spool/abrt/ccpp-2012-10-02-10:25:01-5670 (65900544 bytes)
Oct 2 10:25:01 node1 dlm_controld[5743]: cluster is down, exiting
Oct 2 10:25:01 node1 gfs_controld[5792]: cluster is down, exiting
- core dumped by
corosync
after FAILED TO RECEIVE shows a failed assertion in memb_consensus_agreed
#0 0x0000003416e32885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x0000003416e32885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003416e34065 in abort () at abort.c:92
#2 0x0000003416e2b9fe in __assert_fail_base (fmt=<value optimized out>, assertion=0x3462e23ef5 "token_memb_entries >= 1", file=0x3462e23e8d "totemsrp.c",
line=<value optimized out>, function=<value optimized out>) at assert.c:96
#3 0x0000003416e2bac0 in __assert_fail (assertion=0x3462e23ef5 "token_memb_entries >= 1", file=0x3462e23e8d "totemsrp.c", line=1211, function=0x3462e25150 "memb_consensus_agreed")
at assert.c:105
#4 0x0000003462e12e86 in memb_consensus_agreed (instance=0x7f1852e24010) at totemsrp.c:1211
#5 0x0000003462e17513 in memb_join_process (instance=0x7f1852e24010, memb_join=0xf344fc) at totemsrp.c:4007
#6 0x0000003462e17839 in message_handler_memb_join (instance=0x7f1852e24010, msg=<value optimized out>, msg_len=<value optimized out>,
endian_conversion_needed=<value optimized out>) at totemsrp.c:4250
#7 0x0000003462e10d18 in rrp_deliver_fn (context=0xef19c0, msg=0xf344fc, msg_len=245) at totemrrp.c:1747
#8 0x0000003462e0b9a8 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0xf33e30) at totemudp.c:1252
#9 0x0000003462e07132 in poll_run (handle=2111858625151500288) at coropoll.c:513
#10 0x0000000000406eb9 in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1852
- After we started the cman service on each node, node1 and node3 were rebooted and the corefile was generated by corosync on node2.
Environment
- Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add on
- Multicast communication between nodes (ie. not UDPU or broadcast)
corosync
releases prior to1.4.1-17.el6
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.