Sometimes a member is suspected by FD because a heartbeat acknowledgement has not been received for some time (defined by
timeout and max_tries). This may occur for several reasons. As an example, say you have a cluster consisting of nodes A, B, C and D. In this cluster, A pings B, B pings C, C pings D, and D pings A.
C may be suspected in any of the following situations.
- If B and C are running at 100% CPU for longer than the time defined by
timeoutandmax_tries. Even if C sends a heartbeat acknowledgement to B, B may not be able to process the acknowledgement. - If B or C are garbage collecting, they may not respond or process acknowledgement of a heartbeat message.
- If the network loses packets, heartbeat messages or acknowledgements may be lost. This can occur when a network has high traffic. Packets are usually dropped in the following order: broadcasts, IP multicasts, then TCP packets.
- If B or C are processing a callback. Say C receives a remote method call and takes longer than the
timeoutormax_triesperiod to process it. During this time, C does not process any other message, including heartbeats. Therefore B will not receive a heartbeat acknowledgement, and will suspect C.