23.6.3.2. Causes of missing heartbeats in FD

Sometimes a member is suspected by FD because a heartbeat acknowledgement has not been received for some time (defined by timeout and max_tries). This may occur for several reasons. As an example, say you have a cluster consisting of nodes A, B, C and D. In this cluster, A pings B, B pings C, C pings D, and D pings A.
C may be suspected in any of the following situations.
  • If B and C are running at 100% CPU for longer than the time defined by timeout and max_tries. Even if C sends a heartbeat acknowledgement to B, B may not be able to process the acknowledgement.
  • If B or C are garbage collecting, they may not respond or process acknowledgement of a heartbeat message.
  • If the network loses packets, heartbeat messages or acknowledgements may be lost. This can occur when a network has high traffic. Packets are usually dropped in the following order: broadcasts, IP multicasts, then TCP packets.
  • If B or C are processing a callback. Say C receives a remote method call and takes longer than the timeout or max_tries period to process it. During this time, C does not process any other message, including heartbeats. Therefore B will not receive a heartbeat acknowledgement, and will suspect C.