Language:
Format:

Chapter 28. JGroups Services

JGroups provides the underlying group communication support for JBoss Enterprise Application Platform clusters. The interaction of clustered services with JGroups was covered in Section 21.1, “Group Communication with JGroups”. This chapter focuses on the details of this interaction, with particular attention to configuration details and troubleshooting tips.

This chapter is not intended as complete JGroups documentation. If you want to know more about JGroups, you can consult:

The JGroups project documentation at http://jgroups.org/ug.html
The JGroups wiki pages at jboss.org, rooted at https://www.jboss.org/community/wiki/JGroups

The first section of this chapter covers the many JGroups configuration options in detail. JBoss Enterprise Application Platform ships with a set of default JGroups configurations. Most applications will work with the default configurations out of the box. You will only need to edit these configurations when you deploy an application with special network or performance requirements.

28.1. Configuring a JGroups Channel's Protocol Stack

The JGroups framework provides services to enable peer-to-peer communications between nodes in a cluster. Communication occurs over a communication channel. The channel built up from a stack of network communication protocols, each of which is responsible for adding a particular capability to the overall behavior of the channel. Key capabilities provided by various protocols include transport, cluster discovery, message ordering, lossless message delivery, detection of failed peers, and cluster membership management services.

Figure 28.1, “Protocol stack in JGroups” shows a conceptual cluster with each member's channel composed of a stack of JGroups protocols.

Figure 28.1. Protocol stack in JGroups

This section of the chapter covers some of the most commonly used protocols, according to the type of behavior they add to the channel. We discuss a few key configuration attributes exposed by each protocol, but since these attributes should be altered only by experts, this chapter focuses on familiarizing users with the purpose of various protocols.

The JGroups configurations used in JBoss Enterprise Application Platform appear as nested elements in the <JBOSS_HOME>/server/<PROFILE>/deploy/cluster/jgroups-channelfactory.sar/META-INF/jgroups-channelfactory-stacks.xml file. This file is parsed by the ChannelFactory service, which uses the contents to provide correctly configured channels to the clustered services that require them. See Section 21.1.1, “The Channel Factory Service” for more on the ChannelFactory service.

The following is an example protocol stack configuration from jgroups-channelfactory-stacks.xml:

<stack name="udp-async"
           description="Same as the default 'udp' stack above, except message bundling
                        is enabled in the transport protocol (enable_bundling=true). 
                        Useful for services that make high-volume asynchronous 
                        RPCs (e.g. high volume JBoss Cache instances configured 
                        for REPL_ASYNC) where message bundling may improve performance.">
        <config>
          <UDP
             singleton_name="udp-async"
             mcast_port="${jboss.jgroups.udp_async.mcast_port:45689}"
             mcast_addr="${jboss.partition.udpGroup:228.11.11.11}"
             tos="8"
             ucast_recv_buf_size="20000000"
             ucast_send_buf_size="640000"
             mcast_recv_buf_size="25000000"
             mcast_send_buf_size="640000"
             loopback="true"
             discard_incompatible_packets="true"
             enable_bundling="true"
             max_bundle_size="64000"
             max_bundle_timeout="30"
             ip_ttl="${jgroups.udp.ip_ttl:2}"
             thread_naming_pattern="cl"
             timer.num_threads="12"
             enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"
             diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.0.75}"
             diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"

             thread_pool.enabled="true"
             thread_pool.min_threads="8"
             thread_pool.max_threads="200"
             thread_pool.keep_alive_time="5000"
             thread_pool.queue_enabled="true"
             thread_pool.queue_max_size="1000"
             thread_pool.rejection_policy="discard"
      
             oob_thread_pool.enabled="true"
             oob_thread_pool.min_threads="8"
             oob_thread_pool.max_threads="200"
             oob_thread_pool.keep_alive_time="1000"
             oob_thread_pool.queue_enabled="false"
             oob_thread_pool.rejection_policy="discard"/>
          <PING timeout="2000" num_initial_members="3"/>
          <MERGE2 max_interval="100000" min_interval="20000"/>
          <FD_SOCK/>
          <FD timeout="6000" max_tries="5" shun="true"/>
          <VERIFY_SUSPECT timeout="1500"/>
          <BARRIER/>
          <pbcast.NAKACK use_mcast_xmit="true" gc_lag="0"
                   retransmit_timeout="300,600,1200,2400,4800"
                   discard_delivered_msgs="true"/>
          <UNICAST timeout="300,600,1200,2400,3600"/>
          <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="400000"/>          
          <VIEW_SYNC avg_send_interval="10000"/>
          <pbcast.GMS print_local_addr="true" join_timeout="3000"
                   shun="true"
                   view_bundling="true"
                   view_ack_collection_timeout="5000"
                   resume_task_timeout="7500"/>
          <FC max_credits="2000000" min_threshold="0.10" 
              ignore_synchronous_response="true"/>
          <FRAG2 frag_size="60000"/>
          <!-- pbcast.STREAMING_STATE_TRANSFER/ -->
          <pbcast.STATE_TRANSFER/>
          <pbcast.FLUSH timeout="0" start_flush_timeout="10000"/>
        </config>
    </stack>

The <config> element contains all the configuration data for JGroups. This information is used to configure a JGroups channel, which is conceptually similar to a socket, and manages communication between peers in a cluster. Each element within the <config> element defines a particular JGroups protocol. Each protocol performs one function. The combination of these functions defines the characteristics of the channel as a whole. The next few sections describe common protocols and explain the options available to each.

28.1.1. Common Configuration Properties

The following property is exposed by all of the JGroups protocols discussed below:

stats - indicates whether the protocol should gather runtime statistics on its operations. These statistics can be exposed via tools like the JMX Console or the JGroups Probe utility. What, if any, statistics are gathered depends on the protocol. Default is true.

Note

All protocols in the versions of JGroups used in JBoss Enterprise Application Platform 4 and earlier exposed the down_thread and up_thread attributes. The JGroups version included in JBoss Enterprise Application Platform 5 and later no longer uses those attributes, and a WARN message will be written to the server log if they are configured for any protocol.

28.1.2. Transport Protocols

The transport protocols send and receive messages to and from the network. They also manage the thread pools used to deliver incoming messages to addresses higher in the protocol stack. JGroups supports UDP, TCP and TUNNEL as transport protocols.

Note

The UDP, TCP, and TUNNEL protocols are mutually exclusive. You can only have one transport protocol in each JGroups Config element

28.1.2.1. UDP configuration

UDP is the preferred transport protocol for JGroups. UDP uses multicast (or, in an unusual configuration, multiple unicasts) to send and receive messages. If you choose UDP as the transport protocol for your cluster service, you need to configure it in the UDP sub-element in the JGroups config element. Here is an example.

          <UDP
             singleton_name="udp-async"
             mcast_port="${jboss.jgroups.udp_async.mcast_port:45689}"
             mcast_addr="${jboss.partition.udpGroup:228.11.11.11}"
             tos="8"
             ucast_recv_buf_size="20000000"
             ucast_send_buf_size="640000"
             mcast_recv_buf_size="25000000"
             mcast_send_buf_size="640000"
             loopback="true"
             discard_incompatible_packets="true"
             enable_bundling="true"
             max_bundle_size="64000"
             max_bundle_timeout="30"
             ip_ttl="${jgroups.udp.ip_ttl:2}"
             thread_naming_pattern="cl"
             timer.num_threads="12"
             enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"
             diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.0.75}"
             diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"

             thread_pool.enabled="true"
             thread_pool.min_threads="8"
             thread_pool.max_threads="200"
             thread_pool.keep_alive_time="5000"
             thread_pool.queue_enabled="true"
             thread_pool.queue_max_size="1000"
             thread_pool.rejection_policy="discard"
      
             oob_thread_pool.enabled="true"
             oob_thread_pool.min_threads="8"
             oob_thread_pool.max_threads="200"
             oob_thread_pool.keep_alive_time="1000"
             oob_thread_pool.queue_enabled="false"
             oob_thread_pool.rejection_policy="discard"/>

JGroups transport configurations have a number of attributes available. First we look at the attributes available to the UDP protocol, followed by the attributes that are also used by the TCP and TUNNEL transport protocols.

The attributes particular to the UDP protocol are:

ip_mcast specifies whether or not to use IP multicasting. The default is true. If set to false, multiple unicast packets will be sent instead of one multicast packet. Any packet sent via UDP protocol are UDP datagrams.
mcast_addr specifies the multicast address (class D) for communicating with the group (i.e., the cluster). The standard protocol stack configurations in JBoss Enterprise Application Platform use the value of system property jboss.partition.udpGroup, if set, as the value for this attribute. Using the -u command line switch when starting JBoss Enterprise Application Platform sets that value. See Section 28.6.2, “Isolating JGroups Channels” for information about using this configuration attribute to ensure that JGroups channels are properly isolated from one another. If this attribute is omitted, the default value is 228.11.11.11.
mcast_port specifies the port to use for multicast communication with the group. See Section 28.6.2, “Isolating JGroups Channels” for how to use this configuration attribute to ensure JGroups channels are properly isolated from one another. If this attribute is omitted, the default is 45688.
mcast_send_buf_size, mcast_recv_buf_size, ucast_send_buf_size and ucast_recv_buf_size define the socket send and receive buffer sizes that JGroups will request from the operating system. A large buffer size helps to ensure that packets are not dropped due to buffer overflow. However, socket buffer sizes are limited at the operating system level, so obtaining the desired buffer may require configuration at the operating system level. See Section 28.6.2.3, “Improving UDP Performance by Configuring OS UDP Buffer Limits” for further details.
bind_port specifies the port to which the unicast receive socket should be bound. The default is 0; i.e. use an ephemeral port.
port_range specifies the number of ports to try if the port identified by bind_port is not available. The default is 1, which specifies that only bind_port will be tried.
ip_ttl specifies time-to-live (TTL) for IP Multicast packets. TTL is the commonly used term in multicast networking, but is actually something of a misnomer, since the value here refers to how many network hops a packet will be allowed to travel before networking equipment will drop it.
tos specifies the traffic class for sending unicast and multicast datagrams.

The attributes that are common to all transport protocols, and thus have the same meanings when used with TCP or TUNNEL, are:

singleton_name provides a unique name for this transport protocol configuration. Used by the application server's ChannelFactory to support sharing of a transport protocol instance by different channels that use the same transport protocol configuration. See Section 21.1.2, “The JGroups Shared Transport”.
bind_addr specifies the interface on which to receive and send messages. By default, JGroups uses the value of system property jgroups.bind_addr. This can also be set with the -b command line switch. See Section 28.6, “Other Configuration Issues” for more on binding JGroups sockets.
receive_on_all_interfaces specifies whether this node should listen on all interfaces for multicasts. The default is false. It overrides the bind_addr property for receiving multicasts. However, bind_addr (if set) is still used to send multicasts.
send_on_all_interfaces specifies whether this node sends UDP packets via all available network interface controllers, if your machine has multiple network interface controllers available. This means that the same multicast message is sent N times, so use with care.
receive_interfaces specifies a list of interfaces on which to receive multicasts. The multicast receive socket will listen on all of these interfaces. This is a comma-separated list of IP addresses or interface names, for example, 192.168.5.1,eth1,127.0.0.1.
send_interfaces specifies a list of interfaces via which to send multicasts. The multicast sender socket will send on all of these interfaces. This is a comma-separated list of IP addresses or interface names, for example, 192.168.5.1,eth1,127.0.0.1.This means that the same multicast message is sent N times, so use with care.
enable_bundling specifies whether to enable message bundling. If true, the transport protocol queues outgoing messages until max_bundle_size bytes have accumulated, or max_bundle_time milliseconds have elapsed, whichever occurs first. Then the transport protocol bundles queued messages into one large message and sends it. The messages are un-bundled at the receiver. The default is false.
Message bundling can have significant performance benefits for channels that are used for high volume sending of messages where the sender does not block waiting for a response from recipients (for example, a JBoss Cache instance configured for REPL_ASYNC.) It can add considerable latency to applications where senders need to block waiting for responses, so it is not recommended for certain situations, such as where a JBoss Cache instance is configured for REPL_SYNC.
loopback specifies whether the thread sending a message to the group should itself carry the message back up the stack for delivery. (Messages sent to the group are always delivered to the sending node as well.) If false, the sending thread does not carry the message; the transport protocol waits to read the message off the network and uses one of the message delivery pool threads for delivery. The default is false, but true is recommended to ensure that the channel receives its own messages, in case the network interface goes down.
discard_incompatible_packets specifies whether to discard packets sent by peers that use a different version of JGroups. Each message in the cluster is tagged with a JGroups version. If discard_incompatible_packets is set to true, messages received from different versions of JGroups will be silently discarded. Otherwise, a warning will be logged. In no case will the message be delivered. The default value is false.
enable_diagnostics specifies that the transport should open a multicast socket on address diagnostics_addr and port diagnostics_port to listen for diagnostic requests sent by the JGroups Probe utility.
The various thread_pool attributes configure the behavior of the pool of threads JGroups uses to carry ordinary incoming messages up the stack. The various attributes provide the constructor arguments for an instance of java.util.concurrent.ThreadPoolExecutorService. In the example above, the pool will have a minimum or core size of 8 threads, and a maximum size of 200. If more than 8 pool threads have been created, a thread returning from carrying a message will wait for up to 5000 milliseconds to be assigned a new message to carry, after which it will terminate. If no threads are available to carry a message, the (separate) thread reading messages off the socket will place messages in a queue; the queue will hold up to 1000 messages. If the queue is full, the thread reading messages off the socket will discard the message.
The various oob_thread_pool attributes are similar to the thread_pool attributes in that they configure a java.util.concurrent.ThreadPoolExecutorService used to carry incoming messages up the protocol stack. In this case, the pool is used to carry a special type of message known as an Out-Of-Band (OOB) message. OOB messages are exempt from the ordered-delivery requirements of protocols like NAKACK and UNICAST and thus can be delivered up the stack even if NAKACK or UNICAST are queuing up messages from a particular sender. OOB messages are often used internally by JGroups protocols and can be used by applications as well. For example, when JBoss Cache is in REPL_SYNC mode, it uses OOB messages for the second phase of its two-phase-commit protocol.

28.1.2.2. TCP configuration

Alternatively, a JGroups-based cluster can also work over TCP connections. Compared with UDP, TCP generates more network traffic when the cluster size increases. TCP is fundamentally a unicast protocol. To send multicast messages, JGroups uses multiple TCP unicasts. To use TCP as a transport protocol, you should define a TCP element in the JGroups config element. Here is an example of the TCP element.

<TCP singleton_name="tcp" 
        start_port="7800" end_port="7800"/>

The following attributes are specific to the TCP element:

start_port and end_port define the range of TCP ports to which the server should bind. The server socket is bound to the first available port beginning with start_port. If no available port is found (for example, because the ports are in use by other sockets) before the end_port, the server throws an exception. If no end_port is provided, or end_port is lower than start_port, no upper limit is applied to the port range. If start_port is equal to end_port, JGroups is forced to use the specified port, since start_port fails if the specified port in not available. The default value is 7800. If set to 0, the operating system will select a port. (This will only work for MPING or TCPGOSSIP discovery protocols. TCCPING requires that nodes and their required ports are listed.)
bind_port in TCP acts as an alias for start_port. If configured internally, it sets start_port.
recv_buf_size, send_buf_size define receive and send buffer sizes. It is good to have a large receiver buffer size, so packets are less likely to get dropped due to buffer overflow.
conn_expire_time specifies the time (in milliseconds) after which a connection can be closed by the reaper if no traffic has been received.
reaper_interval specifies interval (in milliseconds) to run the reaper. If both values are 0, no reaping will be done. If either value is > 0, reaping will be enabled. By default, reaper_interval is 0, which means no reaper.
sock_conn_timeout specifies max time in milliseconds for a socket creation. When doing the initial discovery, and a peer hangs, do not wait forever but go on after the timeout to ping other members. Reduces chances of *not* finding any members at all. The default is 2000.
use_send_queues specifies whether to use separate send queues for each connection. This prevents blocking on write if the peer hangs. The default is true.
external_addr specifies external IP address to broadcast to other group members (if different to local address). This is useful when you have use (Network Address Translation) NAT, e.g. a node on a private network, behind a firewall, but you can only route to it via an externally visible address, which is different from the local address it is bound to. Therefore, the node can be configured to broadcast its external address, while still able to bind to the local one. This avoids having to use the TUNNEL protocol, (and hence a requirement for a central gossip router) because nodes outside the firewall can still route to the node inside the firewall, but only on its external address. Without setting the external_addr, the node behind the firewall will broadcast its private address to the other nodes which will not be able to route to it.
skip_suspected_members specifies whether unicast messages should not be sent to suspected members. The default is true.
tcp_nodelay specifies TCP_NODELAY. TCP by default nagles messages, that is, conceptually, smaller messages are bundled into larger ones. If we want to invoke synchronous cluster method calls, then we need to disable nagling in addition to disabling message bundling (by setting enable_bundling to false). Nagling is disabled by setting tcp_nodelay to true. The default is false.

Note

All of the attributes common to all protocols discussed in the UDP protocol section also apply to TCP.

28.1.2.3. TUNNEL configuration

The TUNNEL protocol uses an external router process to send messages. The external router is a Java process that runs the org.jgroups.stack.GossipRouter main class. Each node has to register with the router. All messages are sent to the router and forwarded on to their destinations. The TUNNEL approach can be used to set up communication with nodes behind firewalls. A node can establish a TCP connection to the GossipRouter through the firewall (you can use port 80). This connection is also used by the router to send messages to nodes behind the firewall, as most firewalls do not permit outside hosts to initiate a TCP connection to a host inside the firewall. The TUNNEL configuration is defined in the TUNNEL element within the JGroups <config> element, like so:

<TUNNEL  singleton_name="tunnel"
            router_port="12001"
            router_host="192.168.5.1"/>

The available attributes in the TUNNEL element are listed below.

router_host specifies the host on which the GossipRouter is running.
router_port specifies the port on which the GossipRouter is listening.
reconnect_interval specifies the interval of time (in milliseconds) for which TUNNEL will attempt to connect to the GossipRouter if the connection is not established. The default value is 5000.

Note

All of the attributes common to all protocols discussed in the UDP protocol section also apply to TUNNEL.

28.1.3. Discovery Protocols

When a channel on a node first connects, it must determine which other nodes are running compatible channels, and which of these nodes is currently acting as the coordinator (the node responsible for letting new nodes join the group). Discovery protocols are used to find active nodes in the cluster and to determine which is the coordinator. This information is then provided to the group membership protocol (GMS), which communicates with the coordinator's GMS to add the newly-connecting node to the group. (For more information about group membership protocols, see Section 28.1.6, “Group Membership (GMS)”.)

Discovery protocols also assist merge protocols (see Section 28.5, “Merging (MERGE2)”) to detect cluster-split situations.

The discovery protocols sit on top of the transport protocol, so you can choose to use different discovery protocols depending on your transport protocol. These are also configured as sub-elements in the JGroups <config> element.

28.1.3.1. PING

PING is a discovery protocol that works by either multicasting PING requests to an IP multicast address or connecting to a gossip router. As such, PING normally sits on top of the UDP or TUNNEL transport protocols. Each node responds with a packet {C, A}, where C=coordinator's address and A=own address. After timeout milliseconds or num_initial_members replies, the joiner determines the coordinator from the responses, and sends a JOIN request to it (handled by). If nobody responds, we assume we are the first member of a group.

Here is an example PING configuration for IP multicast.

<PING timeout="2000"
    num_initial_members="3"/>

Here is another example PING configuration for contacting a Gossip Router.

<PING gossip_host="localhost"
      gossip_port="1234"
       timeout="2000" 
      num_initial_members="3"/>

The available attributes in the PING element are listed below.

timeout specifies the maximum number of milliseconds to wait for num_initial_members responses. The default is 3000.
num_initial_members specifies the minimum number of responses to wait for unless timeout has expired. The default is 2.
gossip_host specifies the host on which the GossipRouter is running.
gossip_port specifies the port on which the GossipRouter is listening on.
gossip_refresh specifies the interval (in milliseconds) for the lease from the GossipRouter. The default is 20000.
initial_hosts is a comma-separated list of addresses or ports (for example, host1[12345],host2[23456]) which are pinged for discovery. Default is null, meaning multicast discovery should be used. If initial_hosts is specified, you must list all possible cluster members, not just a few well-known hosts, or MERGE2 cluster split discovery will not work reliably.

If both gossip_host and gossip_port are defined, the cluster uses the GossipRouter for the initial discovery. If the initial_hosts is specified, the cluster pings that static list of addresses for discovery. Otherwise, the cluster uses IP multicasting for discovery.

Note

The discovery phase returns when the timeout ms have elapsed or the num_initial_members responses have been received.

28.1.3.2. TCPGOSSIP

The TCPGOSSIP protocol only works with a GossipRouter. It works essentially the same way as the PING protocol configuration with valid gossip_host and gossip_port attributes. It works on top of both UDP and TCP transport protocols. Here is an example.

<TCPGOSSIP timeout="2000"
       num_initial_members="3"
       initial_hosts="192.168.5.1[12000],192.168.0.2[12000]"/>

The available attributes in the TCPGOSSIP element are listed below.

timeout specifies the maximum number of milliseconds to wait for num_initial_members responses. The default is 3000.
num_initial_members specifies the minimum number of responses to wait for unless timeout has expired. The default is 2.
initial_hosts is a comma-separated list of addresses/ports (for example, host1[12345],host2[23456]) of GossipRouters to register

28.1.3.3. TCPPING

The TCPPING protocol takes a set of known members and pings them for discovery. The mechanism works on top of TCP.

Here is an example of the TCPPING configuration element in the JGroups config element.

<TCPPING timeout="2000"
     num_initial_members="3"/
     initial_hosts="hosta[2300],hostb[3400],hostc[4500]"
     max_dynamic_hosts="3"
     port_range="3">

The available attributes in the TCPPING element are listed below.

timeout specifies the maximum number of milliseconds to wait for num_initial_members responses. The default is 3000.
num_initial_members specifies the minimum number of responses to wait for unless timeout has expired. The default is 2.
initial_hosts is a comma-separated list of addresses (for example, host1[12345],host2[23456]) for pinging.
max_dynamic_hosts specifies the maximum number of hosts that can be dynamically added to the cluster (defaults to 0).
If dynamic adding of hosts is not allowed, make sure you list all cluster members in the <initial_hosts> attribute on all cluster members before adding the new node to the cluster, so that the nodes can be added on server start.
port_range specifies the number of consecutive ports to be probed when getting the initial membership, starting with the port specified in the initial_hosts parameter. Given the current values of port_range and initial_hosts above, the TCPPING layer will try to connect to hosta[2300], hosta[2301], hosta[2302], hostb[3400], hostb[3401], hostb[3402], hostc[4500], hostc[4501], and hostc[4502]. This configuration option allows for multiple possible ports on the same host to be pinged without having to spell out all possible combinations. If in your TCP protocol configuration your end_port is greater than your start_port, we recommend using a TCPPING port_range equal to the difference, to ensure a node is pinged no matter which port it is bound to within the allowed range.

28.1.3.4. MPING

MPING uses IP multicast to discover the initial membership. Unlike the other discovery protocols, which delegate the sending and receiving of discovery messages on the network to the transport protocol, MPING opens its own sockets to send and receive multicast discovery messages. As a result it can be used with all transports, but it is most often used with TCP. TCP usually requires TCPPING, which must explicitly list all possible group members. MPING does not have this requirement, and is typically used where TCP is required for regular message transport, and UDP multicasting is allowed for discovery.

<MPING timeout="2000"
    num_initial_members="3"
    bind_to_all_interfaces="true"
    mcast_addr="228.8.8.8"
    mcast_port="7500"
    ip_ttl="8"/>

The available attributes in the MPING element are listed below.

timeout specifies the maximum number of milliseconds to wait for any responses. The default is 3000.
num_initial_members specifies the maximum number of responses to wait for unless timeout has expired. The default is 2..
bind_addr specifies the interface on which to send and receive multicast packets. By default JGroups uses the value of the system property jgroups.bind_addr, which can be set with the -b command line switch. See Section 28.6, “Other Configuration Issues” for more on binding JGroups sockets.
bind_to_all_interfaces overrides the bind_addr and uses all interfaces in multihome nodes.
mcast_addr, mcast_port, ip_ttl attributes are the same as related attributes in the UDP protocol configuration.

28.1.4. Failure Detection Protocols

The failure detection protocols are used to detect failed nodes. Once a failed node is detected, a suspect verification phase can occur. If the node is still considered dead after this phase is complete, the cluster updates its membership view so that further messages are not sent to the failed node. The service using JGroups is informed that the node is no longer part of the cluster. Failure detection protocols are configured as sub-elements in the JGroups <config> element.

28.1.4.1. FD

FD is a failure detection protocol based on 'heartbeat' messages. This protocol requires that each node periodically ping its neighbor. If the neighbor fails to respond, the calling node sends a SUSPECT message to the cluster. The current group coordinator can optionally verify that the suspected node is dead (VERIFY_SUSPECT). If the node is still considered dead after this verification step, the coordinator updates the cluster's membership view. The following is an example of FD configuration:

<FD timeout="6000"
    max_tries="5"
    shun="true"/>

The available attributes in the FD element are listed below.

timeout specifies the maximum number of milliseconds to wait for the responses to the are-you-alive messages. The default is 3000.
max_tries specifies the number of missed are-you-alive messages from a node before the node is suspected. The default is 2.
shun specifies whether a failed node will be forbidden from sending messages to the group without formally rejoining. A shunned node would need to rejoin the cluster via the discovery process. JGroups allows applications to configure a channel such that, when a channel is shunned, the process of rejoining the cluster and transferring state takes place automatically. This is the default behavior of JBoss Enterprise Application Platform.

Note

Regular traffic from a node is proof of life, so heartbeat messages are only sent when no regular traffic is detected on the node for a long period of time.

28.1.4.2. FD_SOCK

FD_SOCK is a failure detection protocol based on a ring of TCP sockets created between group members. Each member in a group connects to its neighbor, with the final member connecting to the first, forming a ring. Node B becomes suspected when its neighbor, Node A, detects an abnormally closed TCP socket, presumably due to a crash in Node B. (When nodes intend to leave the group, they inform their neighbors so that they do not become suspected.)

The simplest FD_SOCK configuration does not take any attribute. You can declare an empty FD_SOCK element in the JGroups <config> element.

<FD_SOCK/>

The attributes available to the FD_SOCK element are listed below.

bind_addr specifies the interface to which the server socket should be bound. By default, JGroups uses the value of the system property jgroups.bind_addr. This system property can be set with the -b command line switch. For more information about binding JGroups sockets, see Section 28.6, “Other Configuration Issues”.

28.1.4.3. VERIFY_SUSPECT

This protocol verifies whether a suspected member is really dead by pinging that member once again. This verification is performed by the coordinator of the cluster. The suspected member is dropped from the cluster group if confirmed to be dead. The aim of this protocol is to minimize false suspicions. Here's an example.

   
<VERIFY_SUSPECT timeout="1500"/>

The available attributes in the VERIFY_SUSPECT element are listed below.

timeout specifies how long to wait for a response from the suspected member before considering it dead.

28.1.4.4. FD versus FD_SOCK

FD and FD_SOCK, each taken individually, do not provide a solid failure detection layer. Let us look at the differences between these failure detection protocols to understand how they complement each other:

FD
- An overloaded machine might be slow in sending are-you-alive responses.
- A member will be suspected when suspended in a debugger/profiler.
- Low timeouts lead to higher probability of false suspicions and higher network traffic.
- High timeouts will not detect and remove crashed members for some time.
FD_SOCK:
- Suspended in a debugger is no problem because the TCP connection is still open.
- High load no problem either for the same reason.
- Members will only be suspected when TCP connection breaks, so hung members will not be detected.
- Also, a crashed switch will not be detected until the connection runs into the TCP timeout (between 2-20 minutes, depending on TCP/IP stack implementation).

A failure detection layer is intended to report real failures promptly, while avoiding false suspicions. There are two solutions:

By default, JGroups configures the FD_SOCK socket with KEEP_ALIVE, which means that TCP sends a heartbeat on socket on which no traffic has been received in 2 hours. If a host crashed (or an intermediate switch or router crashed) without closing the TCP connection properly, we would detect this after 2 hours (plus a few minutes). This is of course better than never closing the connection (if KEEP_ALIVE is off), but may not be of much help. So, the first solution would be to lower the timeout value for KEEP_ALIVE. This can only be done for the entire kernel in most operating systems, so if this is lowered to 15 minutes, this will affect all TCP sockets.
The second solution is to combine FD_SOCK and FD; the timeout in FD can be set such that it is much lower than the TCP timeout, and this can be configured individually per process. FD_SOCK will already generate a suspect message if the socket was closed abnormally. However, in the case of a crashed switch or host, FD will make sure the socket is eventually closed and the suspect message generated. Example:

<FD_SOCK/>
<FD timeout="6000" max_tries="5" shun="true"/>
<VERIFY_SUSPECT timeout="1500"/>

In this example, a member becomes suspected when the neighboring socket has been closed abnormally, in a process crash, for instance, since the operating system closes all sockets. However, if a host or switch crashes, the sockets will not be closed. FD will suspect the neighbor after sixty seconds (6000 milliseconds). Note that if this example system were stopped in a breakpoint in the debugger, the node being debugged will be suspected once the timeout has elapsed.

A combination of FD and FD_SOCK provides a solid failure detection layer, which is why this technique is used across the JGroups configurations included with JBoss Enterprise Application Platform.

28.1.5. Reliable Delivery Protocols

Reliable delivery protocols within the JGroups stack ensure that messages are actually delivered, and delivered in the correct order (First In, First Out, or FIFO) to the destination node. The basis for reliable message delivery is positive and negative delivery acknowledgments (ACK and NAK). In ACK mode, the sender resends the message until acknowledgment is received from the receiver. In NAK mode, the receiver requests re-transmission when it discovers a gap.

28.1.5.1. UNICAST

The UNICAST protocol is used for unicast messages. It uses positive acknowledgements (ACK). It is configured as a sub-element under the JGroups config element. If the JGroups stack is configured with the TCP transport protocol, UNICAST is not necessary because TCP itself guarantees FIFO delivery of unicast messages. Here is an example configuration for the UNICAST protocol:

<UNICAST timeout="300,600,1200,2400,3600"/>

There is only one configurable attribute in the UNICAST element.

timeout specifies the re-transmission timeout (in milliseconds). For instance, if the timeout is 100,200,400,800, the sender resends the message if it has not received an ACK after 100 milliseconds the first time, and the second time it waits for 200 milliseconds before re-sending, and so on. A low value for the first timeout allows for prompt re-transmission of dropped messages, but means that messages may be transmitted more than once if they have not actually been lost (that is, the message has been sent, but the ACK has not been received before the timeout). High values (1000,2000,3000) can improve performance if the network is tuned such that UDP datagram loss is infrequent. High values on networks with frequent losses will be harmful to performance, since later messages will not be delivered until lost messages have been re-transmitted.

28.1.5.2. NAKACK

The NAKACK protocol is used for multicast messages. It uses negative acknowledgements (NAK). Under this protocol, each message is tagged with a sequence number. The receiver keeps track of the received sequence numbers and delivers the messages in order. When a gap in the series of received sequence numbers is detected, the receiver schedules a task to periodically ask the sender to re-transmit the missing message. The task is canceled if the missing message is received. NAKACK protocol is configured as the pbcast.NAKACK sub-element under the JGroups <config> element. Here is an example configuration:

<pbcast.NAKACK max_xmit_size="60000" use_mcast_xmit="false" 
   retransmit_timeout="300,600,1200,2400,4800" gc_lag="0"
   discard_delivered_msgs="true"/>

The configurable attributes in the pbcast.NAKACK element are as follows.

re-transmit_timeout specifies the series of timeouts (in milliseconds) after which re-transmission is requested if a missing message has not yet been received.
use_mcast_xmit determines whether the sender should send the re-transmission to the entire cluster rather than just to the node requesting it. This is useful when the sender's network layer tends to drop packets, avoiding the need to individually re-transmit to each node.
max_xmit_size specifies the maximum size (in bytes) for a bundled re-transmission, if multiple messages are reported missing.
discard_delivered_msgs specifies whether to discard delivered messages on the receiver nodes. By default, nodes save delivered messages so any node can re-transmit a lost message in case the original sender has crashed or left the group. However, if we only ask the sender to resend its messages, we can enable this option and discard delivered messages.
gc_lag specifies the number of messages to keep in memory for re-transmission, even after the periodic cleanup protocol (see Section 28.4, “Distributed Garbage Collection (STABLE)”) indicates all peers have received the message. The default value is 20.

28.1.6. Group Membership (GMS)

The group membership service (GMS) protocol in the JGroups stack maintains a list of active nodes. It handles the requests to join and leave the cluster. It also handles the SUSPECT messages sent by failure detection protocols. All nodes in the cluster, as well as any interested services like JBoss Cache or HAPartition, are notified if the group membership changes. The group membership service is configured in the pbcast.GMS sub-element under the JGroups config element. Here is an example configuration.

<pbcast.GMS print_local_addr="true"
    join_timeout="3000"
    join_retry_timeout="2000"
    shun="true"
    view_bundling="true"/>

The configurable attributes in the pbcast.GMS element are as follows.

join_timeout specifies the maximum number of milliseconds to wait for a new node JOIN request to succeed. Retry afterwards.
join_retry_timeout specifies the number of milliseconds to wait after a failed JOIN before trying again.
print_local_addr specifies whether to dump the node's own address to the standard output when started.
shun specifies whether a node should shun (that is, disconnect) itself if it receives a cluster view in which it is not a member node.
disable_initial_coord specifies whether to prevent this node from becoming the cluster coordinator during the initial connection of the channel. This flag does not prevent a node becoming the coordinator after the initial channel connection, if the current coordinator leaves the group.
view_bundling specifies whether multiple JOIN or LEAVE requests arriving at the same time are bundled and handled together at the same time, resulting in only one new view that incorporates all changes. This is more efficient than handling each request separately.

28.1.7. Flow Control (FC)

The flow control (FC) protocol tries to adapt the data sending rate to the data receipt rate among nodes. If a sender node is too fast, it might overwhelm the receiver node and result in out-of-memory conditions or dropped packets that have to be re-transmitted. In JGroups, flow control is implemented via a credit-based system. The sender and receiver nodes have the same number of credits (bytes) to start with. The sender subtracts credits by the number of bytes in messages it sends. The receiver accumulates credits for the bytes in the messages it receives. When the sender's credit drops to a threshold, the receivers send some credit to the sender. If the sender's credit is used up, the sender blocks until it receives credits from the receiver. The flow control protocol is configured in the FC sub-element under the JGroups config element. Here is an example configuration.

<FC max_credits="2000000"
    min_threshold="0.10" 
    ignore_synchronous_response="true"/>

The configurable attributes in the FC element are as follows.

max_credits specifies the maximum number of credits (in bytes). This value should be smaller than the JVM heap size.
min_credits specifies the minimum number of bytes that must be received before the receiver will send more credits to the sender.
min_threshold specifies the percentage of the max_credits that should be used to calculate min_credits. Setting this overrides the min_credits attribute.
ignore_synchronous_response specifies whether threads that have carried messages up to the application should be allowed to carry outgoing messages back down through FC without blocking for credits. Synchronous response refers to the fact that these messages are generally responses to incoming RPC-type messages. Forbidding JGroups threads to carry messages up to block in FC can help prevent certain deadlock scenarios, so we recommend setting this to true.

Note

FC is required for group communication where group messages must be sent at the highest speed that the slowest receiver can handle. For example, say we have a cluster comprised of nodes A, B, C and D. D is slow (perhaps overloaded), while the rest are fast. When A sends a group message, it does so via TCP connections: A-A (theoretically), A-B, A-C and A-D.

Say A sends 100 million messages to the cluster. TCP's flow control applies to A-B, A-C and A-D individually, but not to A-BCD as a group. Therefore, A, B and C will receive the 100 million messages, but D will receive only 1 million. (This is also why NAKACK is required, even though TCP handles its own re-transmission.)

JGroups must buffer all messages in memory in case an original sender S dies and a node requests re-transmission of a message sent by S. Since all members buffer all messages that they receive, stable messages (messages seen by every node) must sometimes be purged. (The purging process is managed by the STABLE protocol. For more information, see Section 28.4, “Distributed Garbage Collection (STABLE)”.)

In the above case, the slow node D will prevent the group from purging messages above 1M, so every member will buffer 99M messages ! This in most cases leads to OOM exceptions. Note that - although the sliding window protocol in TCP will cause writes to block if the window is full - we assume in the above case that this is still much faster for A-B and A-C than for A-D.

So, in summary, even with TCP we need to FC to ensure we send messages at a rate the slowest receiver (D) can handle.

Note

This depends on how the application uses the JGroups channel. Referring to the example above, if there was something about the application that would naturally cause A to slow down its rate of sending because D was not keeping up, then FC would not be needed.

A good example of such an application is one that uses JGroups to make synchronous group RPC calls. By synchronous, we mean the thread that makes the call blocks waiting for responses from all the members of the group. In that kind of application, the threads on A that are making calls would block waiting for responses from D, thus naturally slowing the overall rate of calls.

A JBoss Cache cluster configured for REPL_SYNC is a good example of an application that makes synchronous group RPC calls. If a channel is only used for a cache configured for REPL_SYNC, we recommend you remove FC from its protocol stack.

And, of course, if your cluster only consists of two nodes, including FC in a TCP-based protocol stack is unnecessary. There is no group beyond the single peer-to-peer relationship, and TCP's internal flow control will handle that just fine.

Another case where FC may not be needed is for a channel used by a JBoss Cache configured for buddy replication and a single buddy. Such a channel will in many respects act like a two node cluster, where messages are only exchanged with one other node, the buddy. (There may be other messages related to data gravitation that go to all members, but in a properly engineered buddy replication use case these should be infrequent. But if you remove FC be sure to load test your application.)

Select Your Language

Chapter 28. JGroups Services

28.1. Configuring a JGroups Channel's Protocol Stack

28.1.1. Common Configuration Properties

28.1.2. Transport Protocols

28.1.2.1. UDP configuration

28.1.2.2. TCP configuration

28.1.2.3. TUNNEL configuration

28.1.3. Discovery Protocols

28.1.3.1. PING

28.1.3.2. TCPGOSSIP

28.1.3.3. TCPPING

28.1.3.4. MPING

28.1.4. Failure Detection Protocols

28.1.4.1. FD

28.1.4.2. FD_SOCK

28.1.4.3. VERIFY_SUSPECT

28.1.4.4. FD versus FD_SOCK

28.1.5. Reliable Delivery Protocols

28.1.5.1. UNICAST

28.1.5.2. NAKACK

28.1.6. Group Membership (GMS)

28.1.7. Flow Control (FC)

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Language and Page Formatting Options

Chapter 28. JGroups Services

28.1. Configuring a JGroups Channel's Protocol Stack

28.1.1. Common Configuration Properties

28.1.2. Transport Protocols

28.1.2.1. UDP configuration

28.1.2.2. TCP configuration

28.1.2.3. TUNNEL configuration

28.1.3. Discovery Protocols

28.1.3.1. PING

28.1.3.2. TCPGOSSIP

28.1.3.3. TCPPING

28.1.3.4. MPING

28.1.4. Failure Detection Protocols

28.1.4.1. FD

28.1.4.2. FD_SOCK

28.1.4.3. VERIFY_SUSPECT

28.1.4.4. FD versus FD_SOCK

28.1.5. Reliable Delivery Protocols

28.1.5.1. UNICAST

28.1.5.2. NAKACK

28.1.6. Group Membership (GMS)

28.1.7. Flow Control (FC)

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links