Chapter 34. Handling Network Partitions (Split Brain)

Network Partitions occur when a cluster breaks into two or more partitions. As a result, the nodes in each partition are unable to locate or communicate with nodes in the other partitions. This results in an unintentionally partitioned network.
In the event of a network partition in a distributed system like Red Hat JBoss Data Grid, the CAP (Brewer’s) theorem comes into play. The CAP theorem states that in the event of a Network Partition (P), a distributed system can provide either Consistency (C) or Availability (A) for the data, but not both.
By default, Partition Handling is disabled in JBoss Data Grid. During a network partition, the partitions continue to remain Available (A), at the cost of Consistency (C).
However, when Partition Handling is enabled, JBoss Data Grid prioritizes consistency (C) of data over Availability (A).
Red Hat JBoss Data Grid offers the primary partitions strategy to repair a split network. When the network partition occurs and the cache is split into two or more partitions, at most one partition becomes the primary partition (and stays available) and the others are designated as secondary partitions (and enter Degraded Mode). When the partitions merge back into a single cache, the primary partition is then used as a reference for all secondary partitions. All members of the secondary partitions must remove their current state information and replace it with fresh state information from a member of the primary partition. If there was no primary partition during the split, the state on every node is assumed to be correct.
In JBoss Data Grid, a cache consists of data stored on a number of nodes. To prevent data loss if a node fails, JBoss Data Grid replicates a data item over multiple nodes. In distribution mode, this redundancy is configured using the numOwners configuration attribute, which specifies the number of replicas for each cache entry in the cache. As a result, as long as the number of nodes that have failed are less than the value of numOwners, JBoss Data Grid retains a copy of the lost data and can recover.

Note

In JBoss Data Grid's replication mode, however, numOwners is always equal to the number of nodes in the cache, because each node contains a copy of every data item in the cache in this mode.
In certain cases, a number of nodes greater than the value of numOwners can disappear from the cache. Two common reasons for this are:
  • Split-Brain: Usually, as the result of a router crash, the cache is divided into two or more partitions. Each of the partitions operates independently of the other and each may contain different versions of the same data.
  • Sucessive Crashed Nodes: A number of nodes greater than the value of numOwners crashes in succession for any reason. JBoss Data Grid is unable to properly balance the state between crashes, and the result is partial data loss.

34.1. Detecting and Recovering from a Split-Brain Problem

When a Split-Brain occurs in the data grid, each network partition installs its own JGroups view with nodes from other partitions removed. The partitions remain unaware of each other, therefore there is no way to determine how many partitions the network has split into. Red Hat JBoss Data Grid assumes that the cache has unexpectedly split if one or more nodes disappear from the JGroups cache without sending an explicit leaving message, while in reality the cause can be physical (crashed switches, cable failure, etc.) to virtual (stop-the-world garbage collection).
This state is dangerous because each of the newly split partitions operates independently and can store conflicting updates for the same data entries.
When Partition Handling mode is enabled (see Section 34.6, “Configure Partition Handling” for instructions) and JBoss Data Grid suspects that one or more nodes are no longer accessible, each partition does not start a rebalance immediately, but first it checks whether it should enter degraded mode instead. To enter Degraded Mode, one of the following conditions must be true:
  • At least one segment has lost all its owners, which means that a number of nodes equal to or greater than the value of numOwners have left the JGroups view.
  • The partition does not contain a majority of nodes (greater than half) of the nodes from the latest stable topology. The stable topology is updated each time a rebalance operation successfully concludes and the coordinator determines that additional rebalancing is not required.
If neither of the conditions are met, the partition continues normal operations and JBoss Data Grid attempts to rebalance its nodes. Based on these conditions, at most one partition can remain in Available mode. Other partitions will enter Degraded Mode.
When a partition enters into Degraded Mode, it only allows read/write access to those entries for which all owners (copies) of the entry exist on nodes within the same partition. Read and write requests for an entry for which one or more of its owners (copies) exist on nodes that have disappeared from the partition are rejected with an AvailabilityException.

Note

A possible limitation is that if two partitions start as isolated partitions and do not merge, they can read and write inconsistent data. JBoss Data Grid does not identify such partitions as split partitions.

Warning

Data consistency can be at risk from the time (t1) when the cache physically split to the time (t2) when JBoss Data Grid detects the connectivity change and changes the state of the partitions:
  • Transactional writes that were in progress at t1 when the split physically occurred may be rolled back on some of the owners. This can result in inconsistency between the copies (after the partitions rejoin) of an entry that is affected by such a write. However, transactional writes that started after t1 will fail as expected.
  • If the write is non-transactional, then during this time window, a value written only in a minor partition (due to physical split and because the partition has not yet been Degraded) can be lost when partitions rejoin, if this minor partition receives state from a primary (Available) partition upon rejoin. If the partition does not receive state upon rejoin (i.e. all partitions are degraded), then the value is not lost, but an inconsistency can remain.
  • There is also a possibility of a stale read in a minor partition during this transition period, as an entry is still Available until the minor partition enters Degraded state.
When partitions merge after a network partition has occurred:
  • If one of the partitions was Available during the network partition, then the joining partition(s) are wiped out and state transfer occurs from the Available (primary) partition to the joining nodes.
  • If all joining partitions were Degraded during the Split Brain, then no state transfer occurs during the merge. The combined cache is then Available only if the merging partitions contain a simple majority of the members in the latest stable topology (one with the highest topology ID) and has at least an owner for each segment (i.e. keys are not lost).

Warning

Between the time (t1) when partitions begin merging to the time (t2) when the merge is complete, nodes reconnect through a series of merge events. During this time window, it is possible that a node can be reported as having temporarily left the cluster. For a Transactional cache, if during this window between t1 and t2, such a node is executing a transaction that spans other nodes, then this transaction may not execute on the remote node, but still succeed on the originating node. The result is a potential stale value for affected entries on a node that did not commit this transaction.
After t2, once the merge has completed on all nodes, this situation will not occur for subsequent transactions. However, an inconsistency introduced on entries that were affected by a transaction in progress during the time window between t1 and t2 is not resolved until these entries are subsequently updated or deleted. Until then, a read on such impacted entries can potentially return the stale value.