Chapter 43. Handling Network Partitions (Split Brain)

43.1. Network Partition Recovery

Network Partitions occur when a cluster breaks into two or more partitions. As a result, the nodes in each partition are unable to locate or communicate with nodes in the other partitions. This results in an unintentionally partitioned network.

In the event of a network partition in a distributed system like Red Hat JBoss Data Grid, the CAP (Brewer’s) theorem comes into play. The CAP theorem states that in the event of a Network Partition [P], a distributed system can provide either Consistency [C] or Availability [A] for the data, but not both.

By default, Partition Handling is disabled in JBoss Data Grid. During a network partition, the partitions continue to remain Available [A], at the cost of Consistency [C].

However, when Partition Handling is enabled, JBoss Data Grid prioritizes consistency [C] of data over Availability [A].

Red Hat JBoss Data Grid offers the primary partitions strategy to repair a split network. When the network partition occurs and the cache is split into two or more partitions, at most one partition becomes the primary partition (and stays available) and the others are designated as secondary partitions (and enter Degraded Mode). When the partitions merge back into a single cache, the primary partition is then used as a reference for all secondary partitions. All members of the secondary partitions must remove their current state information and replace it with fresh state information from a member of the primary partition. If there was no primary partition during the split, the state on every node is assumed to be correct.

In JBoss Data Grid, a cache consists of data stored on a number of nodes. To prevent data loss if a node fails, JBoss Data Grid replicates a data item over multiple nodes. In distribution mode, this redundancy is configured using the owners configuration attribute, which specifies the number of replicas for each cache entry in the cache. As a result, as long as the number of nodes that have failed are less than the value of owners, JBoss Data Grid retains a copy of the lost data and can recover.

Note

In JBoss Data Grid’s replication mode, however, owners is always equal to the number of nodes in the cache, because each node contains a copy of every data item in the cache in this mode.

In certain cases, a number of nodes greater than the value of owners can disappear from the cache. Two common reasons for this are:

  • Split-Brain: Usually, as the result of a router crash, the cache is divided into two or more partitions. Each of the partitions operates independently of the other and each may contain different versions of the same data.
  • Sucessive Crashed Nodes: A number of nodes greater than the value of owners crashes in succession for any reason. JBoss Data Grid is unable to properly balance the state between crashes, and the result is partial data loss.

43.2. Detecting and Recovering from a Split-Brain Problem

When a Split-Brain occurs in the data grid, each network partition installs its own JGroups view with nodes from other partitions removed. The partitions remain unaware of each other, therefore there is no way to determine how many partitions the network has split into. Red Hat JBoss Data Grid assumes that the cache has unexpectedly split if one or more nodes disappear from the JGroups cache without sending an explicit leaving message, while in reality the cause can be physical (crashed switches, cable failure, etc.) to virtual (stop-the-world garbage collection).

This state is dangerous because each of the newly split partitions operates independently and can store conflicting updates for the same data entries.

When Partition Handling mode is enabled (see Configure Partition Handling for instructions) and JBoss Data Grid suspects that one or more nodes are no longer accessible, each partition does not start a rebalance immediately, but first it checks whether it should enter degraded mode instead. To enter Degraded Mode, one of the following conditions must be true:

  • At least one segment has lost all its owners, which means that a number of nodes equal to or greater than the value of owners have left the JGroups view.
  • The partition does not contain a majority of nodes (greater than half) of the nodes from the latest stable topology. The stable topology is updated each time a rebalance operation successfully concludes and the coordinator determines that additional rebalancing is not required.

If neither of the conditions are met, the partition continues normal operations and JBoss Data Grid attempts to rebalance its nodes. Based on these conditions, at most one partition can remain in Available mode. Other partitions will enter Degraded Mode.

When a partition enters into Degraded Mode, it only allows read/write access to those entries for which all owners (copies) of the entry exist on nodes within the same partition. Read and write requests for an entry for which one or more of its owners (copies) exist on nodes that have disappeared from the partition are rejected with an AvailabilityException.

Note

A possible limitation is that if two partitions start as isolated partitions and do not merge, they can read and write inconsistent data. JBoss Data Grid does not identify such partitions as split partitions.

Warning

Data consistency can be at risk from the time (t1) when the cache physically split to the time (t2) when JBoss Data Grid detects the connectivity change and changes the state of the partitions:

  • Transactional writes that were in progress at t1 when the split physically occurred may be rolled back on some of the owners. This can result in inconsistency between the copies (after the partitions rejoin) of an entry that is affected by such a write. However, transactional writes that started after t1 will fail as expected.
  • If the write is non-transactional, then during this time window, a value written only in a minor partition (due to physical split and because the partition has not yet been Degraded) can be lost when partitions rejoin, if this minor partition receives state from a primary (Available) partition upon rejoin. If the partition does not receive state upon rejoin (i.e. all partitions are degraded), then the value is not lost, but an inconsistency can remain.
  • There is also a possibility of a stale read in a minor partition during this transition period, as an entry is still Available until the minor partition enters Degraded state.

When partitions merge after a network partition has occurred:

  • If one of the partitions was Available during the network partition, then the joining partition(s) are wiped out and state transfer occurs from the Available (primary) partition to the joining nodes.
  • If all joining partitions were Degraded during the Split Brain, then no state transfer occurs during the merge. The combined cache is then Available only if the merging partitions contain a simple majority of the members in the latest stable topology (one with the highest topology ID) and has at least an owner for each segment (i.e. keys are not lost).
Warning

Between the time (t1) when partitions begin merging to the time (t2) when the merge is complete, nodes reconnect through a series of merge events. During this time window, it is possible that a node can be reported as having temporarily left the cluster. For a Transactional cache, if during this window between t1 and t2, such a node is executing a transaction that spans other nodes, then this transaction may not execute on the remote node, but still succeed on the originating node. The result is a potential stale value for affected entries on a node that did not commit this transaction.

After t2, once the merge has completed on all nodes, this situation will not occur for subsequent transactions. However, an inconsistency introduced on entries that were affected by a transaction in progress during the time window between t1 and t2 is not resolved until these entries are subsequently updated or deleted. Until then, a read on such impacted entries can potentially return the stale value.

43.3. Split Brain Timing: Detecting a Split

When using the FD_ALL protocol a given node becomes suspected after the following amount of milliseconds have passed:

FD_ALL.timeout + FD_ALL.interval + VERIFY_SUSPECT.timeout + GMS.view_ack_collection_timeout
Important

The amount of time taken in the formulas above is how long it takes JBoss Data Grid to install a cluster view without the leavers; however, as JBoss Data Grid runs inside a JVM excessive Garbage Collection (GC) times can increase this time beyond the failure detection outlined above. JBoss Data Grid has no control over these GC times, and excessive GC on the coordinator can delay this detection by an amount equal to the GC time.

43.4. Split Brain Timing: Recovering From a Split

After a split occurs JBoss Data Grid will merge the partitions back, and the maximum time to detect a merge after the network partition is healed is:

3.1 * MERGE3.max_interval

In some cases multiple merges will occur after a split so that the cluster may contain all available partitions. In this case, where multiple merges occur, time should be allowed for all of these to complete, and as there may be as many as three merges occurring sequentially the total delay should be no more than the following:

10 * MERGE3.max_interval
Important

The amount of time taken in the formulas above is how long it takes JBoss Data Grid to install a cluster view without the leavers; however, as JBoss Data Grid runs inside a JVM excessive Garbage Collection (GC) times can increase this time beyond the failure detection outlined above. JBoss Data Grid has no control over these GC times, and excessive GC on the coordinator can delay this detection by an amount equal to the GC time.

In addition, when merging cluster views JBoss Data Grid tries to confirm all members are present; however, there is no upper bound on waiting for these responses, and merging the cluster views may be delayed due to networking issues.

43.5. Detecting and Recovering from Successive Crashed Nodes

Red Hat JBoss Data Grid cannot distinguish whether a node left the cluster because of a process or machine crash, or because of a network failure.

If a single node exits the cluster, and if the value of owners is greater than 1, the cluster remains available and JBoss Data Grid attempts to create new replicas of the lost data. However, if additional nodes crash during this rebalancing process, it is possible that for some entries, all copies of its data have left the node and therefore cannot be recovered.

The recommended way to protect the data grid against successive crashed nodes is to enable partition handling (see Configure Partition Handling for instructions) and to set an appropriately high value for owners to ensure that even if a large number of nodes leave the cluster in rapid succession, JBoss Data Grid is able to rebalance the nodes to recover the lost data.

Alternatively, if you can tolerate some data loss, you can force JBoss Data Grid into AVAILABLE mode from DEGRADED mode using the Cache JMX MBean. See Cache JMX MBean.

Likewise, the AdvancedCache interface lets you read and change the cache availability. See The AdvancedCache Interface in the Developer Guide.

43.6. Network Partition Recovery Examples

43.6.1. Network Partition Recovery Examples

The following examples illustrate how network partitions occur in Red Hat JBoss Data Grid clusters and how they are dealt with and eventually merged. The following examples scenarios are described in detail:

  1. A distributed four node cluster with owners set to 3 at Distributed 4-Node Cache Example With 3 Owners
  2. A distributed four node cluster with owners set to 2 at Distributed 4-Node Cache Example With 2 Owners
  3. A distributed five node cluster with owners set to 3 at Distributed 5-Node Cache Example With 3 Owners
  4. A replicated four node cluster with owners set to 4 at Replicated 4-Node Cache Example With 4 Owners
  5. A replicated five node cluster with owners set to 5 at Replicated 5-Node Cache Example With 5 Owners
  6. A replicated eight node cluster with owners set to 8 at Replicated 8-Node Cache Example With 8 Owners

43.6.2. Distributed 4-Node Cache Example With 3 Owners

The first example scenario includes a four-node distributed cache that contains four data entries (k1, k2, k3, and k4). For this cache, owners equals 3, which means that each data entry must have three copies on various nodes in the cache.

Figure 43.1. Cache Before and After a Network Partition

Cache before and after a network partition occurs

As seen in the diagram, after the network partition occurs, Node 1 and Node 2 form Partition 1 while Node 3 and Node 4 form a Partition 2. After the split, the two partitions enter into Degraded Mode (represented by grayed-out nodes in the diagram) because neither has at least 3 (the value of owners) nodes left from the last stable view. As a result, none of the four entries (k1, k2, k3, and k4) are available for reads or writes. No new entries can be written in either degraded partition, as neither partition can store 3 copies of an entry.

Figure 43.2. Cache After Partitions Are Merged

Cache after the partitions are merged

JBoss Data Grid subsequently merges the two split partitions. No state transfer is required and the new merged cache is subsequently in Available Mode with four nodes and four entries (k1, k2, k3, and k4).

43.6.3. Distributed 4-Node Cache Example With 2 Owners

The second example scenario includes a distributed cache with four nodes. In this scenario, owners equals 2, so the four data entries (k1, k2, k3 and k4) have two copies each in the cache.

Figure 43.3. Cache Before and After a Network Partition

Cache before and after a network partition occurs

After the network partition occurs, Partitions 1 and 2 enter Degraded mode (depicted in the diagram as grayed-out nodes). Within each partition, an entry will only be available for read or write operations if both its copies are in the same partition. In Partition 1, the data entry k1 is available for reads and writes because owners equals 2 and both copies of the entry remain in Partition 1. In Partition 2, k4 is available for reads and writes for the same reason. The entries k2 and k3 become unavailable in both partitions, as neither partition contains all copies of these entries. A new entry k5 can be written to a partition only if that partition were to own both copies of k5.

Figure 43.4. Cache After Partitions Are Merged

Cache after partitions are merged

JBoss Data Grid subsequently merges the two split partitions into a single cache. No state transfer is required and the cache returns to Available Mode with four nodes and four data entries (k1, k2, k3 and k4).

43.6.4. Distributed 5-Node Cache Example With 3 Owners

The third example scenario includes a distributed cache with five nodes and with owners equal to 3.

Figure 43.5. Cache Before and After a Network Partition

Cache before and after a network partition occurs

After the network partition occurs, the cache splits into two partitions. Partition 1 includes Node 1, Node 2, and Node 3 and Partition 2 includes Node 4 and Node 5. Partition 2 is Degraded because it does not include the majority of nodes from the total number of nodes in the cache. Partition 1 remains Available because it has the majority of nodes and lost less than owners nodes.

No new entries can be added to Partition 2 because this partition is Degraded and it cannot own all copies of the data.

Figure 43.6. Partition 1 Rebalances and Another Entry is Added

Partition 1 rebalances and then another entry is added to the partition

After the partition split, Partition 1 retains the majority of nodes and therefore can rebalance itself by creating copies to replace the missing entries. As displayed in the diagram above, rebalancing ensures that there are three copies of each entry (owners equals 3) in the cache. As a result, each of the three nodes contains a copy of every entry in the cache. Next, we add a new entry, k6, to the cache. Since the owners value is still 3, and there are three nodes in Partition 1, each node includes a copy of k6.

Figure 43.7. Cache After Partitions Are Merged

Cache after partitions are merged

Eventually, Partition 1 and 2 are merged into a cache. Since only three copies are required for each data entry (owners=3), JBoss Data Grid rebalances the nodes so that the data entries are distributed between the four nodes in the cache. The new combined cache becomes fully available.

43.6.5. Replicated 4-Node Cache Example With 4 Owners

The fourth example scenario includes a replicated cache with four nodes and with owners equal to 4.

Figure 43.8. Cache Before and After a Network Partition

Cache Before and After a Network Partition

After a network partition occurs, Partition 1 contains Node 1 and Node 2 while Node 3 and Node 4 are in Partition 2. Both partitions enter Degraded Mode because neither has a simple majority of nodes. All four keys (k1, k2, k3, and k4 are unavailable for reads and writes because neither of the two partitions owns all copies of any of the four keys.

Figure 43.9. Cache After Partitions Are Merged

Cache After Partitions Are Merged

JBoss Data Grid subsequently merges the two split partitions into a single cache. No state transfer is required and the cache returns to its original state in Available Mode with four nodes and four data entries (k1, k2, k3, and k4).

43.6.6. Replicated 5-Node Cache Example With 5 Owners

The fifth example scenario includes a replicated cache with five nodes and with owners equal to 5.

Figure 43.10. Cache Before and After a Network Partition

Cache before and after a network partition occurs

After a network partition occurs, the cache splits into two partitions. Partition 1 contains Node 1 and Node 2 and Partition 2 includes Node 3, Node 4, and Node 5. Partition 1 enters Degraded Mode (indicated by the grayed-out nodes) because it does not contain the majority of nodes. Partition 2, however, remains available.

Figure 43.11. Both Partitions Are Merged Into One Cache

Both Partitions Are Merged Into One Cache

When JBoss Data Grid merges partitions in this example, Partition 2, which was fully available, is considered the primary partition. State is transferred from Partition 1 and to Partition 2. The combined cache becomes fully available."

43.6.7. Replicated 8-Node Cache Example With 8 Owners

The sixth scenario is for a replicated cache with eight nodes and owners equal to 8.

Figure 43.12. Cache Before and After a Network Partition

Cache before and after a network partition occurs

A network partition splits the cluster into Partition 1 with 3 nodes and Partition 2 with 5 nodes. Partition 1 enters Degraded state, but Partition 2 remains Available.

Figure 43.13. Partition 2 Further Splits into Partitions 2A and 2B

Partition 2 Further Splits into Partitions 2A and 2B

Now another network partition affects Partition 2, which subsequently splits further into Partition 2A and 2B. Partition 2A contains Node 4 and Node 5 while Partition 2B contains Node 6, Node 7, and Node 8. Partition 2A enters Degraded Mode because it does not contain the majority of nodes. However, Partition 2B remains Available.

Potential Resolution Scenarios

There are four potential resolutions for the caches from this scenario:

  • Case 1: Partitions 2A and 2B Merge
  • Case 2: Partition 1 and 2A Merge
  • Case 3: Partition 1 and 2B Merge
  • Case 4: Partition 1, Partition 2A, and Partition 2B Merge Together

Figure 43.14. Case 1: Partitions 2A and 2B Merge

Case 1: Partitions 2A and 2B Merge

The first potential resolution to the partitioned network involves Partition 2B’s state information being copied into Partition 2A. The result is Partition 2, which contains Node 5, Node 6, Node 7, and Node 8. The newly merged partition becomes Available.

Figure 43.15. Case 2: Partition 1 and 2A Merge

Case 2: Partition 1 and 2A Merge

The second potential resolution to the partitioned network involves Partition 1 and Partition 2A merging. The combined partition contains Node 1, Node 2, Node 3, Node 4, and Node 5. As neither partition has the latest stable topology, the resulting merged partition remains in Degraded mode.

Figure 43.16. Case 3: Partition 1 and 2B Merge

Case 3: Partition 1 and 2B Merge

The third potential resolution to the partitioned network involves Partition 1 and Partition 2B merging. Partition 1 receives state information from Partition 2B, and the combined partition becomes Available.

Figure 43.17. Case 4: Partition 1, Partition 2A, and Partition 2B Merge Together

Case 4: Partition 1, Partition 2A, and Partition 2B Merge Together

The fourth and final potential resolution to the partitioned network involves Partition 1, Partition 2A, and Partition 2B merging to form Partition 1. The state is transferred from Partition 2B to both partitions 1 and 2A. The resulting cache contains eight nodes (Node 1, Node 2, Node 3, Node 4, Node 5, Node 6, Node 7, and Node 8) and is Available.

43.7. Configure Partition Handling

In Red Hat JBoss Data Grid, partition handling is disabled as a default.

Declarative Configuration (Library Mode)

Enable partition handling declaratively as follows:

<distributed-cache name="distributed_cache"
        l1-lifespan="20000">
    <partition-handling enabled="true"/>
</distributed-cache>

Declarative Configuration (Remote Client-server Mode)

Enable partition handling declaratively in remote client-server mode by using the following configuration:

<subsystem xmlns="urn:infinispan:server:core:8.4" default-cache-container="clustered">
    <cache-container name="clustered" default-cache="default" statistics="true">
        <distributed-cache name="default" >
            <partition-handling enabled="true" />
            <locking isolation="READ_COMMITTED" acquire-timeout="30000"
                     concurrency-level="1000" striping="false"/>
        </distributed-cache>
    </cache-container>
</subsystem>