32.5. Network Partition Recovery Examples

The following examples illustrate how network partitions occur in Red Hat JBoss Data Grid clusters and how they are dealt with and eventually merged. The following examples scenarios are described in detail:
  1. A distributed four node cluster with numOwners set to 3 at Section 32.5.1, “Distributed 4-Node Cache Example With 3 NumOwners”
  2. A distributed four node cluster with numOwners set to 2 at Section 32.5.2, “Distributed 4-Node Cache Example With 2 NumOwners”
  3. A distributed five node cluster with numOwners set to 3 at Section 32.5.3, “Distributed 5-Node Cache Example With 3 NumOwners”
  4. A replicated four node cluster with numOwners set to 4 at Section 32.5.4, “Replicated 4-Node Cache Example With 4 NumOwners”
  5. A replicated five node cluster with numOwners set to 5 at Section 32.5.5, “Replicated 5-Node Cache Example With 5 NumOwners”
  6. A replicated eight node cluster with numOwners set to 8 at Section 32.5.6, “Replicated 8-Node Cache Example With 8 NumOwners”

32.5.1. Distributed 4-Node Cache Example With 3 NumOwners

The first example scenario includes a four-node distributed cache that contains four data entries (k1, k2, k3, and k4). For this cache, numOwners equals 3, which means that each data entry must have three copies on various nodes in the cache.
Cache before and after a network partition occurs

Figure 32.1. Cache Before and After a Network Partition

As seen in the diagram, after the network partition occurs, Node 1 and Node 2 form Partition 1 while Node 3 and Node 4 form a Partition 2. After the split, the two partitions enter into Degraded Mode (represented by grayed-out nodes in the diagram) because neither has at least 3 (the value of numOwners) nodes left from the last stable view. As a result, none of the four entries (k1, k2, k3, and k4) are available for reads or writes. No new entries can be written in either degraded partition, as neither partition can store 3 copies of an entry.
Cache after the partitions are merged

Figure 32.2. Cache After Partitions Are Merged

JBoss Data Grid subsequently merges the two split partitions. No state transfer is required and the new merged cache is subsequently in Available Mode with four nodes and four entries (k1, k2, k3, and k4).

32.5.2. Distributed 4-Node Cache Example With 2 NumOwners

The second example scenario includes a distributed cache with four nodes. In this scenario, numOwners equals 2, so the four data entries (k1, k2, k3 and k4) have two copies each in the cache.
Cache before and after a network partition occurs

Figure 32.3. Cache Before and After a Network Partition

After the network partition occurs, Partitions 1 and 2 enter Degraded mode (depicted in the diagram as grayed-out nodes). Within each partition, an entry will only be available for read or write operations if both its copies are in the same partition. In Partition 1, the data entry k1 is available for reads and writes because numOwners equals 2 and both copies of the entry remain in Partition 1. In Partition 2, k4 is available for reads and writes for the same reason. The entries k2 and k3 become unavailable in both partitions, as neither partition contains all copies of these entries. A new entry k5 can be written to a partition only if that partition were to own both copies of k5.
Cache after partitions are merged

Figure 32.4. Cache After Partitions Are Merged

JBoss Data Grid subsequently merges the two split partitions into a single cache. No state transfer is required and the cache returns to Available Mode with four nodes and four data entries (k1, k2, k3 and k4).

32.5.3. Distributed 5-Node Cache Example With 3 NumOwners

The third example scenario includes a distributed cache with five nodes and with numOwners equal to 3.
Cache before and after a network partition occurs

Figure 32.5. Cache Before and After a Network Partition

After the network partition occurs, the cache splits into two partitions. Partition 1 includes Node 1, Node 2, and Node 3 and Partition 2 includes Node 4 and Node 5. Partition 2 is Degraded because it does not include the majority of nodes from the total number of nodes in the cache. Partition 1 remains Available because it has the majority of nodes and lost less than numOwners nodes.
No new entries can be added to Partition 2 because this partition is Degraded and it cannot own all copies of the data.
Partition 1 rebalances and then another entry is added to the partition

Figure 32.6. Partition 1 Rebalances and Another Entry is Added

After the partition split, Partition 1 retains the majority of nodes and therefore can rebalance itself by creating copies to replace the missing entries. As displayed in the diagram above, rebalancing ensures that there are three copies of each entry (numOwners equals 3) in the cache. As a result, each of the three nodes contains a copy of every entry in the cache. Next, we add a new entry, k6, to the cache. Since the numOwners value is still 3, and there are three nodes in Partition 1, each node includes a copy of k6.
Cache after partitions are merged

Figure 32.7. Cache After Partitions Are Merged

Eventually, Partition 1 and 2 are merged into a cache. Since only three copies are required for each data entry (numOwners=3), JBoss Data Grid rebalances the nodes so that the data entries are distributed between the four nodes in the cache. The new combined cache becomes fully available.

32.5.4. Replicated 4-Node Cache Example With 4 NumOwners

The fourth example scenario includes a replicated cache with four nodes and with numOwners equal to 4.
Cache Before and After a Network Partition

Figure 32.8. Cache Before and After a Network Partition

After a network partition occurs, Partition 1 contains Node 1 and Node 2 while Node 3 and Node 4 are in Partition 2. Both partitions enter Degraded Mode because neither has a simple majority of nodes. All four keys (k1, k2, k3, and k4 are unavailable for reads and writes because neither of the two partitions owns all copies of any of the four keys.
Cache After Partitions Are Merged

Figure 32.9. Cache After Partitions Are Merged

JBoss Data Grid subsequently merges the two split partitions into a single cache. No state transfer is required and the cache returns to its original state in Available Mode with four nodes and four data entries (k1, k2, k3, and k4).

32.5.5. Replicated 5-Node Cache Example With 5 NumOwners

The fifth example scenario includes a replicated cache with five nodes and with numOwners equal to 5.
Cache before and after a network partition occurs

Figure 32.10. Cache Before and After a Network Partition

After a network partition occurs, the cache splits into two partitions. Partition 1 contains Node 1 and Node 2 and Partition 2 includes Node 3, Node 4, and Node 5. Partition 1 enters Degraded Mode (indicated by the grayed-out nodes) because it does not contain the majority of nodes. Partition 2, however, remains available.
Both Partitions Are Merged Into One Cache

Figure 32.11. Both Partitions Are Merged Into One Cache

When JBoss Data Grid merges partitions in this example, Partition 2, which was fully available, is considered the primary partition. State is transferred from Partition 1 and to Partition 2. The combined cache becomes fully available."

32.5.6. Replicated 8-Node Cache Example With 8 NumOwners

The sixth scenario is for a replicated cache with eight nodes and numOwners equal to 8.
Cache before and after a network partition occurs

Figure 32.12. Cache Before and After a Network Partition

A network partition splits the cluster into Partition 1 with 3 nodes and Partition 2 with 5 nodes. Partition 1 enters Degraded state, but Partition 2 remains Available.
Partition 2 Further Splits into Partitions 2A and 2B

Figure 32.13. Partition 2 Further Splits into Partitions 2A and 2B

Now another network partition affects Partition 2, which subsequently splits further into Partition 2A and 2B. Partition 2A contains Node 4 and Node 5 while Partition 2B contains Node 6, Node 7, and Node 8. Partition 2A enters Degraded Mode because it does not contain the majority of nodes. However, Partition 2B remains Available.
Potential Resolution Scenarios

There are four potential resolutions for the caches from this scenario:

  • Case 1: Partitions 2A and 2B Merge
  • Case 2: Partition 1 and 2A Merge
  • Case 3: Partition 1 and 2B Merge
  • Case 4: Partition 1, Partition 2A, and Partition 2B Merge Together
Case 1: Partitions 2A and 2B Merge

Figure 32.14. Case 1: Partitions 2A and 2B Merge

The first potential resolution to the partitioned network involves Partition 2B's state information being copied into Partition 2A. The result is Partition 2, which contains Node 5, Node 6, Node 7, and Node 8. The newly merged partition becomes Available.
Case 2: Partition 1 and 2A Merge

Figure 32.15. Case 2: Partition 1 and 2A Merge

The second potential resolution to the partitioned network involves Partition 1 and Partition 2A merging. The combined partition contains Node 1, Node 2, Node 3, Node 4, and Node 5. As neither partition has the latest stable topology, the resulting merged partition remains in Degraded mode.
Case 3: Partition 1 and 2B Merge

Figure 32.16. Case 3: Partition 1 and 2B Merge

The third potential resolution to the partitioned network involves Partition 1 and Partition 2B merging. Partition 1 receives state information from Partition 2B, and the combined partition becomes Available.
Case 4: Partition 1, Partition 2A, and Partition 2B Merge Together

Figure 32.17. Case 4: Partition 1, Partition 2A, and Partition 2B Merge Together

The fourth and final potential resolution to the partitioned network involves Partition 1, Partition 2A, and Partition 2B merging to form Partition 1. The state is transferred from Partition 2B to both partitions 1 and 2A. The resulting cache contains eight nodes (Node 1, Node 2, Node 3, Node 4, Node 5, Node 6, Node 7, and Node 8) and is Available.