Chapter 23. JGroups Services

JGroups provides the underlying group communication support for JBoss Enterprise Web Platform clusters. The interaction of clustered services with JGroups was covered in Section 17.1, “Group Communication with JGroups”. This chapter focuses on the details of this interaction, with particular attention to configuration details and troubleshooting tips.
This chapter is not intended as complete JGroups documentation. If you want to know more about JGroups, you can consult:
The first section of this chapter covers the many JGroups configuration options in detail. JBoss Enterprise Web Platform ships with a set of default JGroups configurations. Most applications will work with the default configurations out of the box. You will only need to edit these configurations when you deploy an application with special network or performance requirements.

23.1. Configuring a JGroups Channel's Protocol Stack

The JGroups framework provides services to enable peer-to-peer communications between nodes in a cluster. Communication occurs over a communication channel. The channel built up from a stack of network communication protocols, each of which is responsible for adding a particular capability to the overall behavior of the channel. Key capabilities provided by various protocols include transport, cluster discovery, message ordering, lossless message delivery, detection of failed peers, and cluster membership management services.
Figure 23.1, “Protocol stack in JGroups” shows a conceptual cluster with each member's channel composed of a stack of JGroups protocols.
Protocol stack in JGroups

Figure 23.1. Protocol stack in JGroups


This section of the chapter covers some of the most commonly used protocols, according to the type of behaviour they add to the channel. We discuss a few key configuration attributes exposed by each protocol, but since these attributes should be altered only by experts, this chapter focuses on familiarizing users with the purpose of various protocols.
The JGroups configurations used in JBoss Enterprise Web Platform appear as nested elements in the $JBOSS_HOME/server/production/cluster/jgroups-channelfactory.sar/META-INF/jgroups-channelfactory-stacks.xml file. This file is parsed by the ChannelFactory service, which uses the contents to provide correctly configured channels to the clustered services that require them. See Section 17.1.1, “The Channel Factory Service” for more on the ChannelFactory service.
The following is an example protocol stack configuration from jgroups-channelfactory-stacks.xml:
<stack name="udp-async"
           description="Same as the default 'udp' stack above, except message bundling
                        is enabled in the transport protocol (enable_bundling=true). 
                        Useful for services that make high-volume asynchronous 
                        RPCs (e.g. high volume JBoss Cache instances configured 
                        for REPL_ASYNC) where message bundling may improve performance.">
        <config>
          <UDP
             singleton_name="udp-async"
             mcast_port="${jboss.jgroups.udp_async.mcast_port:45689}"
             mcast_addr="${jboss.partition.udpGroup:228.11.11.11}"
             tos="8"
             ucast_recv_buf_size="20000000"
             ucast_send_buf_size="640000"
             mcast_recv_buf_size="25000000"
             mcast_send_buf_size="640000"
             loopback="true"
             discard_incompatible_packets="true"
             enable_bundling="true"
             max_bundle_size="64000"
             max_bundle_timeout="30"
             ip_ttl="${jgroups.udp.ip_ttl:2}"
             thread_naming_pattern="cl"
             timer.num_threads="12"
             enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"
             diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.0.75}"
             diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"

             thread_pool.enabled="true"
             thread_pool.min_threads="8"
             thread_pool.max_threads="200"
             thread_pool.keep_alive_time="5000"
             thread_pool.queue_enabled="true"
             thread_pool.queue_max_size="1000"
             thread_pool.rejection_policy="discard"
      
             oob_thread_pool.enabled="true"
             oob_thread_pool.min_threads="8"
             oob_thread_pool.max_threads="200"
             oob_thread_pool.keep_alive_time="1000"
             oob_thread_pool.queue_enabled="false"
             oob_thread_pool.rejection_policy="discard"/>
          <PING timeout="2000" num_initial_members="3"/>
          <MERGE2 max_interval="100000" min_interval="20000"/>
          <FD_SOCK/>
          <FD timeout="6000" max_tries="5" shun="true"/>
          <VERIFY_SUSPECT timeout="1500"/>
          <BARRIER/>
          <pbcast.NAKACK use_mcast_xmit="true" gc_lag="0"
                   retransmit_timeout="300,600,1200,2400,4800"
                   discard_delivered_msgs="true"/>
          <UNICAST timeout="300,600,1200,2400,3600"/>
          <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="400000"/>          
          <VIEW_SYNC avg_send_interval="10000"/>
          <pbcast.GMS print_local_addr="true" join_timeout="3000"
                   shun="true"
                   view_bundling="true"
                   view_ack_collection_timeout="5000"
                   resume_task_timeout="7500"/>
          <FC max_credits="2000000" min_threshold="0.10" 
              ignore_synchronous_response="true"/>
          <FRAG2 frag_size="60000"/>
          <!-- pbcast.STREAMING_STATE_TRANSFER/ -->
          <pbcast.STATE_TRANSFER/>
          <pbcast.FLUSH timeout="0" start_flush_timeout="10000"/>
        </config>
    </stack>
The <config> element contains all the configuration data for JGroups. This information is used to configure a JGroups channel, which is conceptually similar to a socket, and manages communication between peers in a cluster. Each element within the <config> element defines a particular JGroups protocol. Each protocol performs one function. The combination of these functions defines the characteristics of the channel as a whole. The next few sections describe common protocols and explain the options available to each.

23.1.1. Common Configuration Properties

The following property is exposed by all of the JGroups protocols discussed below:
  • stats whether the protocol should gather runtime statistics on its operations that can be exposed via tools like the AS's JMX console or the JGroups Probe utility. What, if any, statistics are gathered depends on the protocol. Default is true.

Note

Past versions of JGroups used down_thread and up_thread attributes. These attributes are no longer used. A WARN message will be written to the server log if they are configured for any protocol.

23.1.2. Transport Protocols

The transport protocols send and receive messages to and from the network. They also manage the thread pools used to deliver incoming messages to addresses higher in the protocol stack. JGroups supports UDP, TCP and TUNNEL as transport protocols.

Note

The UDP, TCP, and TUNNEL protocols are mutually exclusive. You can only have one transport protocol in each JGroups Config element

23.1.2.1. UDP configuration

UDP is the preferred transport protocol for JGroups. UDP uses multicast (or, in an unusual configuration, multiple unicasts) to send and receive messages. If you choose UDP as the transport protocol for your cluster service, you need to configure it in the UDP sub-element in the JGroups config element. Here is an example.
          <UDP
             singleton_name="udp-async"
             mcast_port="${jboss.jgroups.udp_async.mcast_port:45689}"
             mcast_addr="${jboss.partition.udpGroup:228.11.11.11}"
             tos="8"
             ucast_recv_buf_size="20000000"
             ucast_send_buf_size="640000"
             mcast_recv_buf_size="25000000"
             mcast_send_buf_size="640000"
             loopback="true"
             discard_incompatible_packets="true"
             enable_bundling="true"
             max_bundle_size="64000"
             max_bundle_timeout="30"
             ip_ttl="${jgroups.udp.ip_ttl:2}"
             thread_naming_pattern="cl"
             timer.num_threads="12"
             enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"
             diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.0.75}"
             diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"

             thread_pool.enabled="true"
             thread_pool.min_threads="8"
             thread_pool.max_threads="200"
             thread_pool.keep_alive_time="5000"
             thread_pool.queue_enabled="true"
             thread_pool.queue_max_size="1000"
             thread_pool.rejection_policy="discard"
      
             oob_thread_pool.enabled="true"
             oob_thread_pool.min_threads="8"
             oob_thread_pool.max_threads="200"
             oob_thread_pool.keep_alive_time="1000"
             oob_thread_pool.queue_enabled="false"
             oob_thread_pool.rejection_policy="discard"/>
JGroups transport configurations have a number of attributes available. First we look at the attributes available to the UDP protocol, followed by the attributes that are also used by the TCP and TUNNEL transport protocols.
The attributes particular to the UDP protocol are:
ip_multicast
Specifies whether to use IP multicasting. The default value is true. If set to false, multiple unicast packets will be sent instead of one multicast packet. Any packet sent via UDP is sent as a UDP datagram.
mcast_addr
Specifies the multicast address (class D) for communicating with the group. The value of system property jboss.partition.udpGroup is used as the value for this attribute, if set. (Set this property at startup with the -u command line switch.) If omitted, the default value is 228.11.11.11.
mcast_port
Specifies the port to use for multicast communication with the group. If omitted, the default value is 45688. See Section 23.6.2, “Isolating JGroups Channels” to ensure JGroups channels are properly isolated from each other.
mcast_send_buf_size, mcast_recv_buf_size, ucast_send_buf_size, ucast_recv_buf_size
Define the socket send and receive buffer sizes that JGroups requests from the operating system. A large buffer helps prevent packets being dropped due to buffer overflow. However, socket buffer sizes are limited at the operating system level, and may require operating system-level configuration. See Section 23.6.2.3, “Improving UDP Performance by Configuring OS UDP Buffer Limits” for details.
bind_port
Specifies the port that binds the unicast receive socket. The default value is 0 (use an ephemeral port).
port_range
Specifies the range of ports to try if the bind_port is not available. The default is 1, which specifies that only bind_port will be tried.
ip_ttl
Specifies the time-to-live (TTL) for IP multicast packets. The value here refers to the number of network hops a packet is allowed to make before it is dropped.
tos
Specifies the traffic class for sending unicast and multicast datagrams.
The following attributes are common to all transport protocols:
singleton_name
The unique name of this transport protocol configuration. The ChannelFactory uses this to share transport protocol instances between different channels with the same transport protocol configuration. See Section 17.1.2, “The JGroups Shared Transport” for details.
bind_addr
Specifies the interface that sends and receives messages. By default, JGroups uses the value of system property jgroups.bind_addr. This can be set with the -b command line switch. See Section 23.6, “Other Configuration Issues” for more about binding JGroups sockets.
receive_on_all_interfaces
Specifies that this node should listen on all interfaces for multicasts. The default value is false. Specifying this overrides the bind_addr property for receiving multicasts. (It does not override bind_addr for sending multicasts.)
send_on_all_interfaces
Specifies that the node send UDP packets via all available network interface controllers (NICs). The same multicast message will be sent multiple times so use this attribute with care.
receive_interfaces
A comma-separated list of interfaces on which to receive multicasts, for example, 192.168.5.1,eth1,127.0.0.1. The multicast receive socket will listen on all listed interfaces.
send_interfaces
A comma-separated list of interfaces on which to send multicasts, for example, 192.168.5.1,eth1,127.0.0.1. The multicast sender socket will send on all listed interfaces. The same multicast message will be sent multiple times, so use with care.
enable_bundling
Specifies whether to enable message bundling. If true, the transport protocol queues outgoing messages until max_bundle_size bytes have accumulated or max_bundle_time milliseconds have elapsed. The transport protocol then bundles queued messages into one large message and sends it. Messages are unbundled at the receiver. The defailt value is false.
Message bundling can improve performance where senders do not block waiting for a response from recipients, for example, a JBoss Cache instance configured for REPL_ASYNC. It adds latency to applications where senders must block waiting for responses, so it is not recommended in some circumstances, for example, a JBoss Cache instance configured for REPL_SYNC.
loopback
Specifies whether the thread should carry a message back up the stack for delivery. (Messages sent to the group are always sent to the sending node as well.) If false, a message delivery pool thread is used instead of the sending thread. false is the default, but true is recommended to ensure that a channel receives its own messages should the network interface fail.
discard_incompatible_packets
Specifies whether to discard packets sent by peers that use a different version of JGroups. If true, messages tagged with a different JGroups version are silently discarded. If false, a warning is logged. In neither case will the message be delivered. The default is false.
enable_diagnostics
Specifies that the transport should open a multicast socket on diagnostics_addr and diagnostics_port to listen for diagnostic requests sent by the JGroups Probe utility.
thread_pool
The various thread_pool attributes configure the behavior of the pool of threads JGroups uses to carry incoming messages up the stack. They provide the constructor arguments for an instance of java.util.concurrent.ThreadPoolExecutorService.
thread_pool.enabled="true"
thread_pool.min_threads="8"
thread_pool.max_threads="200"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="1000"
thread_pool.rejection_policy="discard"
Here, the pool will have a minimum or core size of 8 threads, and a maximum size of 200. If more than 8 pool threads have been created, a thread returning from carrying a message will wait for up to 5000 milliseconds to be assigned a new message to carry, after which it will terminate. If no thread is available to be assigned a new message, the (separate) thread reading messages from the socket will place messages in a queue (thread_pool.queue_enabled). This queue will hold up to 1000 messages. If the queue is full, the thread reading messages from the queue will discard new messages.
oob_thread_pool
The various oob_thread_pool attributes are similar to the thread_pool attributes in that they configure an instance of java.util.concurrent.ThreadPoolExecutorService used to carry messages up the protocol stack. In this case, the pool carries a special type of message known as an Out-of-Band (OOB) message.
OOB messages are exempt from the ordered delivery requirements of protocols such as NAKACK and UNICAST, and can be delivered up the stack even if messages are queued ahead of them. OOB messages are often used internally by JGroups protocols. They can also be used by applications, for example, when JBoss Cache is in REPL_SYNC mode, it uses OOB messages for the second phase of its two-phase commit protocol.

23.1.2.2. TCP configuration

Alternatively, a JGroups-based cluster can also work over TCP connections. Compared with UDP, TCP generates more network traffic when the cluster size increases. TCP is fundamentally a unicast protocol. To send multicast messages, JGroups uses multiple TCP unicasts. To use TCP as a transport protocol, you should define a TCP element in the JGroups Config element, like so:
<TCP singleton_name="tcp" 
        start_port="7800" end_port="7800"/>
The following attributes are specific to the TCP element:
start_port, end_port
Define the range of TCP ports to which the server should bind. The server socket is bound to the first available port, beginning with start_port. If no available port is found before the server reaches end_port, the server throws an exception. If no end_port is provided, or end_port is lower than start_port, no upper limit is applied to the port range. If start_port is equal to end_port, JGroups is forced to use the port specified, and will fail if it is unavailable. The default value is 7800. If set to 0, the operating system will select a port. (This works only for MPING or TCPGOSSIP. TCPPING requires that nodes and their required ports are listed.)
bind_port
Acts as an alias for start_port. If configured internally, sets start_port.
recv_buf_size, send_buf_size
Define receive and send puffer sizes. A large buffer size means packets are less likely to be dropped due to buffer overflow.
conn_expire_time
Specifies the time in milliseconds after which a connection can be closed by the reaper if no traffic has been received.
reaper_interval
Specifies the interval in milliseconds at which to run the reaper. If both values are 0, no reaping will be done. If either value is greater than zero, reaping will be enabled. The reaper is disabled by default.
sock_conn_timeout
Specifies the maximum time in milliseconds for socket creation. When a peer hangs during initial discovery, instead of waiting forever, other members will be pinged after this timeout period. This reduces the chances of not finding any members at all. The default value is 2000.
use_send_queues
Specifies whether to use separate send queues for each connection. This prevents blocking on write if the peer hangs. The default value is true.
external_addr
Specifies an external IP address to broadcast to other group members (if not the local address). This is useful for Network Address Translation (NAT). Say a node on a private network exists behind a firewall, but can only be routed to via an externally visible address, not the local address to which it is bound. The node can be configured to broadcast its external address while remaining bound to the local one. This lets you avoid using the TUNNEL protocol and a central gossip router. Without setting the external_addr, the node behind the firewall broadcasts its private address to the other nodes, which will not be able to route to it.
skip_suspected_members
Specifies whether unicast messages should not be sent to suspected members. The default value is true.
tcp_nodelay
Specifies TCP_NODELAY. By default, TCP nagles messages (bundles smaller messages together into a larger message). To invoke synchronous cluster method calls, we must disable nagling in addition to disabling message bundling. To do this, set tcp_nodelay to true and enable_bundling to false. The default value for tcp_nodelay is false.

Note

All of the attributes common to all protocols discussed in the UDP protocol section also apply to TCP.

23.1.2.3. TUNNEL configuration

The TUNNEL protocol uses an external router known as the GossipRouter to send messages. Each node must register with this router. All messages are sent to the router and forwarded to their destinations. The TUNNEL approach can be used to set up communication with nodes behind firewalls. A node can establish a TCP connection to the GossipRouter through the firewall via port 80. This connection is also used by the router to send messages to nodes behind the firewall, since most firewalls do not permit outside hosts to initiate a TCP connection to a host inside the firewall. The TUNNEL configuration is defined in the TUNNEL sub-element in the JGroups Config element, like so:
<TUNNEL  singleton_name="tunnel"
            router_port="12001"
            router_host="192.168.5.1"/>
The available attributes in the TUNNEL element are listed below.
router_host
Specifies the host on which the GossipRouter runs.
router_port
Specifies the port on which the GossipRouter listens.
reconnect_interval
Specifies the interval in milliseconds for which TUNNEL will attempt to connect to the GossipRouter if the connection is not established.

Note

All of the attributes common to all protocols discussed in the UDP protocol section also apply to TUNNEL.

23.1.3. Discovery Protocols

When a channel on a node first connects, it must determine which other nodes are running compatible channels, and which of these nodes is currently acting as the coordinator (the node responsible for letting new nodes join the group). Discovery protocols are used to find active nodes in the cluster and to determine which is the coordinator. This information is then provided to the group membership protocol (GMS), which communicates with the coordinator's GMS to add the newly-connecting node to the group. (For more information about group membership protocols, see Section 23.1.6, “Group Membership (GMS)”.)
Discovery protocols also assist merge protocols (see Section 23.5, “Merging (MERGE2)”) to detect cluster-split situations.
The discovery protocols sit on top of the transport protocol, so you can choose to use different discovery protocols depending on your transport protocol. These are also configured as sub-elements in the JGroups <Config> element.

23.1.3.1. PING

PING is a discovery protocol that works by either multicasting PING requests to an IP multicast address or connecting to a gossip router. As such, PING normally sits on top of the UDP or TUNNEL transport protocols. Each node responds with a packet {C, A}, where C is the coordinator's address, and A is the node's own address. After timeout milliseconds or num_initial_members replies, the joiner determines the coordinator from the responses, and sends a JOIN request to it (handled by). If no node responds, it assumes it is the first member of a group.
Here is an example PING configuration for IP multicast.
<PING timeout="2000"
    num_initial_members="3"/>
Here is another example PING configuration for contacting a Gossip Router.
<PING gossip_host="localhost"
      gossip_port="1234"
	      timeout="2000" 
      num_initial_members="3"/>
The available attributes in the PING element are listed below.
timeout
Specifies the maximum number of milliseconds to wait for any responses. The default value is 3000.
num_initial_members
Specifies the maximum number of responses to wait for unless the timeout has expired. The default value is 2.
gossip_host
Specifies the host on which the GossipRouter is running.
gossip_port
Specifies the port on which the GossipRouter is listening.
gossip_refresh
Specifies the interval, in milliseconds, for the lease from the GossipRouter. The default value is 20000.
initial_hosts
A comma-separated list of addresses to ping for discovery, for example, host1[12345],host2[23456].
If both gossip_host and gossip_port are defined, the cluster uses the GossipRouter for the initial discovery. If initial_hosts is specified, the cluster pings that static list of addresses for discovery. Otherwise, the cluster uses IP multicasting for discovery.

Note

The discovery phase returns when the timeout period has elapsed or num_initial_members responses have been received.

23.1.3.2. TCPGOSSIP

The TCPGOSSIP protocol only works with a GossipRouter. It works similarly to the PING protocol configuration with valid gossip_host and gossip_port attributes. It works on top of both UDP and TCP transport protocols, like so:
<TCPGOSSIP timeout="2000"
       num_initial_members="3"
       initial_hosts="192.168.5.1[12000],192.168.0.2[12000]"/>
The available attributes in the TCPGOSSIP element are listed below.
timeout
Specifies the maximum number of milliseconds to wait for any responses. The default value is 3000.
num_initial_members
Specifies the maximum number of responses to wait for unless timeout has expired. The default value is 2.
initial_hosts
A comma-seperated list of addresses for GossipRouters to register with, for example, host1[12345],host2[23456].

23.1.3.3. TCPPING

The TCPPING protocol takes a set of known members and pings them for discovery. This is a static configuration. It works on top of TCP. Here is an example of the TCPPING configuration sub-element in the JGroups Config element.
<TCPPING timeout="2000"
     num_initial_members="3"/
     initial_hosts="hosta[2300],hostb[3400],hostc[4500]"
     port_range="3">
The available attributes in the TCPPING element are listed below.
timeout
Specifies the maximum number of milliseconds to wait for any responses. The default value is 3000.
num_initial_members
Specifies the maximum number of responses to wait for unless the timeout has expired. The default value is 2.
initial_hosts
A comma-separated list of addresses to ping fgor discovery, for example, host1[12345],host2[23456].
port_range
Specifies the number of consecutive ports to be probed when getting the initial membership, starting from the port specified in the initial_hosts parameter. Given the values of port_range and initial_hosts given in the example code, the TCPPING layer will try to connect to hosta:2300, hosta:2301, hosta:2302, hostb:3400, hostb:3401, hostb:3402, hostc:4500, hostc:4501 and hostc:4502. The configuration options allow for multiple nodes on the same host to be pinged.

23.1.3.4. MPING

MPING uses IP multicast to discover the initial membership. It can be used with all transports, but usually this is used in combination with TCP. TCP usually requires TCPPING, which has to list all group members explicitly, but MPING does not have this requirement. The typical use case for this is when we want TCP as transport, but multicasting for discovery so we don't have to define a static list of initial hosts in TCPPING or require an external GossipRouter.
<MPING timeout="2000"
    num_initial_members="3"
    bind_to_all_interfaces="true"
    mcast_addr="228.8.8.8"
    mcast_port="7500"
    ip_ttl="8"/>
The available attributes in the MPING element are listed below.
timeout
Specifies the maximum number of milliseconds to wait for any responses. The default value is 3000.
num_initial_members
Specifies the maximum number of responses to wait for unless timeout has expired. The default value is 2.
bind_addr
Specifies the interface on which to send and receive multicast packets.
bind_to_all_interfaces
Overrides the bind_addr value and uses all interfaces in multihome nodes.
mcast_addr
Specifies the multicast address for joining a cluster. If omitted, the default is 228.8.8.8.
mcast_port
Specifies the multicast port number. If omitted, the default is 45566.
ip_ttl
Specifies the time to live (TTL) for IP multicast packets. TTL is the common term in multicast networking, but the value actually refers to how many network hops a packet will be allowed to travel before networking equipment drops it.

23.1.4. Failure Detection Protocols

The failure detection protocols are used to detect failed nodes. Once a failed node is detected, a suspect verification phase can occur. If the node is still considered dead after this phase, the cluster updates its view so that the load balancer and client interceptors know to avoid the dead node. The failure detection protocols are configured as sub-elements in the JGroups MBean Config element.

23.1.4.1. FD

FD is a failure detection protocol based on heartbeat messages. This protocol requires each node to periodically send messages to its neighbour to check that the neighbour is alive. If the neighbour fails to respond, the calling node sends a SUSPECT message to the cluster. The current group coordinator can optionally double check whether the suspected node is indeed dead. If the node is still considered dead after this check, the group coordinator updates the cluster's view. Here is an example FD configuration:
<FD timeout="6000"
    max_tries="5"
    shun="true"/>
The available attributes in the FD element are listed below.
timeout
Specifies the maximum number of milliseconds to wait for a response to the heartbeat messages. The default value is 3000.
max_tries
Specifies the number of heartbeat messages that a node can fail to reply to before the node is suspected. The default value is 2.
shun
Specifies whether a failed node will be shunned. Once shunned, the node will be expelled from the cluster even if it is later revived. The shunned node would have to rejoin the cluster through the discovery process. You can configure JGroups so that shunning leads to automatic rejoins and state transfer (the default behavior).

Note

Normal node traffic is considered proof of life, so heartbeat messages are sent only when there is no normal traffic to the node for some time.

23.1.4.2. FD_SOCK

FD_SOCK is a failure detection protocol based on a ring of TCP sockets created between group members. Each member in a group connects to its neighbor (last member connects to first) thus forming a ring. Member B is suspected when its neighbor A detects an abnormally closed TCP socket (presumably due to a node B crash). However, if a member B is about to leave gracefully, it lets its neighbor A know, so that it does not become suspected. The simplest FD_SOCK configuration does not take any attribute. You can just declare an empty FD_SOCK element in JGroups's Config element.
<FD_SOCK/>
The available attributes in the FD_SOCK element are listed below.
bind_addr
Specifies the interface to which the server socket should bind. If -Djgroups.bind_address system property is defined, this XML value will be ignored. This behavior can be reversed by setting the -Djgroups.ignore.bind_addr=true system property.

23.1.4.3. VERIFY_SUSPECT

This protocol verifies that a suspected member is dead by pinging them a second time. This verification is performed by the coordinator of the cluster. The suspected member is dropped from the cluster group if confirmed dead. The aim of this protocol is to minimize false suspicions. See the following code for an example:
			
<VERIFY_SUSPECT timeout="1500"/>
The available attributes in the VERIFY_SUSPECT element are listed below.
timeout
Specifies how long to wait for a response from the suspected member before considering it dead.

23.1.4.4. FD versus FD_SOCK

FD and FD_SOCK do not individually provide a solid failure detection layer. Their differences are outlined below to show how they complement each other.

FD

  • An overloaded machine might be slow in sending heartbeat responses.
  • A member will become suspected when suspended in a debugger or profiler.
  • Low timeouts lead to a higher probability of false suspicions and higher network traffic.
  • High timeouts will not detect and remove crashed members for a long period of time.

FD_SOCK

  • Suspension in a debugger does not mean a member will become suspected because the TCP connection remains open.
  • High load is not a problem for the same reason.
  • Members will be suspected only when the TCP connection breaks, so hung members will not be detected.
  • A crashed switch will not be detected until the connection encounters the TCP timeout (between two and twenty minutes, depending on TCP/IP stack implementation).
A failure detection layer aims to report real failures and avoid reporting false suspicions. Two methods of achieving this are outlined in the following paragraphs.
By default, JGroups configures the FD_SOCK socket with KEEP_ALIVE, which means that TCP sends a heartbeat to a socket that has received no traffic in two hours. If a host or immediate switch or router crashed without closing the TCP connection properly, it would be detected shortly after two hours. This is better than never closing the connection (where KEEP_ALIVE is off), but may not be helpful. The first solution, therefore, is to lower the timeout value for KEEP_ALIVE. This is a kernel-wide value on most operating systems and therefore affects all TCP sockets.
Alternatively, you can combine FD_SOCK and FD. The timeout in FD can be set such that it is much lower than the TCP timeout. This can be configured on a per-process basis. FD_SOCK generates a SUSPECT message if the socket closes abnormally, but in the case of a crashed switch or host, FD ensures that the socket is eventually closed, and a suspect message generated.
The following code shows how the two could be combined:
<FD_SOCK/>
<FD timeout="6000" max_tries="5" shun="true"/>
<VERIFY_SUSPECT timeout="1500"/>
This code suspects a member when the socket to its neighbour has been closed abnormally (for example, in a process crash, since the operating system closes all sockets). However, if a host or switch crashes, the sockets would not be closed. As a secondary line of defense, FD suspects the neighbour after 50 seconds. Note that if you use this example code and your system is stopped in a debugging breakpoint, the node you are debugging will be suspected after the specified fifty seconds.
Combining FD and FD_SOCK provides a solid failure detection layer. This technique is used across the JGroups configurations included in JBoss Enterprise Web Platform.

23.1.5. Reliable Delivery Protocols

Reliable delivery protocols within the JGroups stack ensure that data packets are delivered in the correct order (FIFO) to the destination node. The basis for reliable message delivery is positive and negative delivery acknowledgments: respectively, ACK and NAK. In ACK mode, the sender resends the message until acknowledgement is received. In NAK mode, the receiver requests retransmission when it discovers a gap.

23.1.5.1. UNICAST

The UNICAST protocol is used for unicast messages. It uses ACK. It is configured as a sub-element under the JGroups Config element. UNICAST is not required for JGroups stacks configured with the TCP transport protocol, since TCP guarantees FIFO delivery of unicast messages. The following is an example of UNICAST protocol:
<UNICAST timeout="300,600,1200,2400,3600"/>
There is only one configurable attribute in the UNICAST element.
timeout
Specifies the retransmission timeout in milliseconds. For example, if the timeout is "100,200,400,800", the sender resends the message if it has not received an ACK after 100 milliseconds the first time, 200 milliseconds the second time, and so on. A low value for the first timeout allows for prompt retransmisson of dropped message, but means that messages can be sent more than once if only the acknowledgement was not received before timeout. High values can improve performance if the network is tuned such that datagram loss is infrequent.

23.1.5.2. NAKACK

The NAKACK protocol is used for multicast messages. It uses NAK. Under this protocol, each message is tagged with a sequence number. The receiver tracks the sequence numbers to deliver the messages in order. When a gap in the sequence is detected, the receiver asks the sender to retransmit the missing message. The NAKACK protocol is configured as the pbcast.NAKACK sub-element under the JGroups Config element, like so:
<pbcast.NAKACK max_xmit_size="60000" use_mcast_xmit="false" 
   retransmit_timeout="300,600,1200,2400,4800" gc_lag="0"
   discard_delivered_msgs="true"/>
The configurable attributes in the pbcast.NAKACK element are as follows.
retransmit_timeout
Specifies the retransmission timeout in milliseconds. This is same as the timeout attribute in the UNICAST protocol.
use_mcast_xmit
Determines whether the sender should send retransmit to the entire cluster rather than just the node requesting the retransmit. This is useful when the sender drops the packet, so that we do not need to retransmit for each node.
max_xmit_size
Specifies maximum size for a bundled retransmission, if multiple packets are reported missing.
discard_delivered_msgs
Specifies whether to discard delivered messages on receiver nodes. By default, we save all delivered messages. If the sender can resend the message, we can enable this option and discard delivered messages.
gc_lag
Specifies the number of messages to keep in memory for retransmission, even after the periodic cleanup protocol (see Section 23.4, “Distributed Garbage Collection (STABLE)”). The default value is 20.

23.1.6. Group Membership (GMS)

The group membership service in the JGroups stack maintains a list of active nodes. It handles requests to join and leave the cluster. It also handles the SUSPECT messages sent by failure detection protocols. All nodes in the cluster, as well as the load balancer and client side interceptors, are notified if the group membership changes. The group membership service is configured in the pbcast.GMS sub-element under the JGroups Config element, like so:
<pbcast.GMS print_local_addr="true"
    join_timeout="3000"
    join_retry_timeout="2000"
    shun="true"
    view_bundling="true"/>
The configurable attributes in the pbcast.GMS element are as follows.
join_timeout
Specifies the maximum number of milliseconds to wait for a new node JOIN request to succeed. Retries afterward.
join_retry_timeout
Sepcifies the maximum numver of milliseconds to wait after a failed JOIN request to resubmit the request.
print_local_addr
Specifies whether to dump the node's own address to the output when started.
shun
Specifies whether a node should shun itself if it receives a cluster view that is not a member node.
disable_initial_coord
Specifies whether to prevent this node from becoming the cluster coordinator.
view_bundling
Specifies whether multiple JOIN or LEAVE requests arriving at the same time are bundled together and handled at the same time. This is more efficient than handling each request separately, as it sends only one new view.

23.1.7. Flow Control (FC)

The flow control service tries to adapt the sending data rate and the receiving data among nodes. If a sender node is too fast, it might overwhelm the receiver node and result in dropped packets that have to be retransmitted. In JGroups, the flow control is implemented via a credit-based system. The sender and receiver nodes have the same number of credits (bytes) to start with. The sender subtracts credits by the number of bytes in messages it sends. The receiver accumulates credits for the bytes in the messages it receives. When the sender's credit drops to a threshold, the receivers sends some credit to the sender. If the sender's credit is used up, the sender blocks until it receives credits from the receiver. The flow control service is configured in the FC sub-element under the JGroups Config element. Here is an example configuration.
<FC max_credits="2000000"
    min_threshold="0.10" 
    ignore_synchronous_response="true"/>
The configurable attributes in the FC element are as follows.
max_credits
Specifies the maximum number of credits in bytes. This value should be smaller than the JVM heap size.
min_credits
Specifies the threshold credit on the sender, below which the receiver should send more credits.
min_threshold
Specifies percentage value of the threshold. This attribute overrides min_credits.
ignore_synchronous_response
Specifies whether threads that have carried messages to the application are allowed to carry outgoing messages back down through flow control without blocking for credits. Synchronous response refers to these messages usually being responses to incoming RPC-type messages. We recommend setting this to true to help prevent certain deadlock scenarios.

Why is FC needed on top of TCP? TCP has its own flow control!

The FC element is required for group communication where group messages must be sent at the highest speed that the slowest receiver can handle.
Say we have a cluster, {A,B,C,D}. Node D is slow, and the other nodes are fast. When A sends a group message, it establishes the following TCP connections: A-A, A-B, A-C, and A-D.
A sends 100 million messages to the cluster. TCP's flow control applies to the connections between A-B, A-C and A-D individually, but not to A-{B,C,D}, where {B,C,D} is the group. It is therefore possible that nodes A, B and C receive the 100 million messages, but that node D will only receive one million messages. This is also the reason we need NAKACK, even though TCP does its own retransmission.
JGroups has to buffer all messages in memory in case the original sender dies and a node asks for retransmission of a message. Because all members buffer all messages they receive, they must occasionally purge stable messages (messages seen by all nodes). This is done with the STABLE protocol, which can be configured to run the stability protocol based on either time (for example, every fifty seconds) or size (every 400 kilobytes of data received).
In the example case, the slow node D will prevent the group from purging messages other than the one million seen by D. In most cases this leads to out-of-memory exceptions, so messages must be sent at a rate that the slowest receiver can handle.

So do I always need FC?

This depends on the application's use of the JGroups channel. If node A from the previous example was able to slow its send rate because D was not keeping up, FC would not be required.
Applications that make synchronous group RPC calls are unlikely to require FC. In synchronous applications, the thread that makes the call blocks waiting for responses from all group members. This means that the threads on node A that make the calls would block waiting for responses from node D, naturally slowing the overall rate of calls.
A JBoss Cache cluster configured for REPL_SYNC is one example of an application that mades synchronous group RPC calls. If a channel is used only for a cache configured for REPL_SYNC, we recommend removing FC from its protocol stack.
If your cluster consists of two nodes, including FC in a TCP-based protocol stack is unnecessary, since TCP's internal flow control can handle one peer-to-peer relationship.
FC may also be omitted where a channel is used by a JBoss Cache configured for buddy replication with a single buddy. Such a channel acts much like a two-node cluster, where messages are only exchanged with one other node. Other messages related to data gravitation will be sent to all members, but these should be infrequent.

If you remove FC

Be sure to load test your application if you remove the FC element.