8.10. Managing Split-brain

Split-brain is a state when a data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and synchronizing their data to each other.
In Red Hat Storage, split-brain is a term applicable to Red Hat Storage volumes in a replicate configuration. A file is said to be in split-brain when the copies of the same file in different bricks that constitute the replica-pair have mismatching data and/or meta-data contents such that they are conflicting each other and automatic healing is not possible. In this scenario, you can decide which is the correct file (source) and which is the one that require healing (sink) by inspecting at the mismatching files from the backend bricks.
The AFR translator in glusterFS makes use of extended attributes to keep track of the operations on a file. These attributes determine which brick is the source and which brick is the sink for a file that require healing. If the files are clean, the extended attributes are all zeroes indicating that no heal is necessary. When a heal is required, they are marked in such a way that there is a distinguishable source and sink and the heal can happen automatically. But, when a split brain occurs, these extended attributes are marked in such a way that both bricks mark themselves as sources, making automatic healing impossible.
When a split-brain occurs, applications cannot perform certain operations like read and write on the file. Accessing the files results in the application receiving an Input/Output Error.
The three types of split-brains that occur in Red Hat Storage are:
  • Data split-brain: Contents of the file under split-brain are different in different replica pairs and automatic healing is not possible.
  • Metadata split-brain : The metadata of the files (example, user defined extended attribute) are different and automatic healing is not possible.
  • Entry split-brain: This happens when a file have different gfids on each of the replica pair.
The only way to resolve split-brains is by manually inspecting the file contents from the backend and deciding which is the true copy (source ) and modifying the appropriate extended attributes such that healing can happen automatically.

8.10.1. Preventing Split-brain

To prevent split-brain in the trusted storage pool, you must configure server-side and client-side quorum.

8.10.1.1. Configuring Server-Side Quorum

The quorum configuration in a trusted storage pool determines the number of server failures that the trusted storage pool can sustain. If an additional failure occurs, the trusted storage pool will become unavailable. If too many server failures occur, or if there is a problem with communication between the trusted storage pool nodes, it is essential that the trusted storage pool be taken offline to prevent data loss.
After configuring the quorum ratio at the trusted storage pool level, you must enable the quorum on a particular volume by setting cluster.server-quorum-type volume option as server. For more information on this volume option, see Section 8.1, “Configuring Volume Options”.
Configuration of the quorum is necessary to prevent network partitions in the trusted storage pool. Network Partition is a scenario where, a small set of nodes might be able to communicate together across a functioning part of a network, but not be able to communicate with a different set of nodes in another part of the network. This can cause undesirable situations, such as split-brain in a distributed system. To prevent a split-brain situation, all the nodes in at least one of the partitions must stop running to avoid inconsistencies.
This quorum is on the server-side, that is, the glusterd service. Whenever the glusterd service on a machine observes that the quorum is not met, it brings down the bricks to prevent data split-brain. When the network connections are brought back up and the quorum is restored, the bricks in the volume are brought back up. When the quorum is not met for a volume, any commands that update the volume configuration or peer addition or detach are not allowed. It is to be noted that both, the glusterd service not running and the network connection between two machines being down are treated equally.
You can configure the quorum percentage ratio for a trusted storage pool. If the percentage ratio of the quorum is not met due to network outages, the bricks of the volume participating in the quorum in those nodes are taken offline. By default, the quorum is met if the percentage of active nodes is more than 50% of the total storage nodes. However, if the quorum ratio is manually configured, then the quorum is met only if the percentage of active storage nodes of the total storage nodes is greater than or equal to the set value.
To configure the quorum ratio, use the following command:
# gluster volume set all cluster.server-quorum-ratio PERCENTAGE
For example, to set the quorum to 51% of the trusted storage pool:
# gluster volume set all cluster.server-quorum-ratio 51%
In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time. If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.
You must ensure to enable the quorum on a particular volume to participate in the server-side quorum by running the following command:
# gluster volume set VOLNAME cluster.server-quorum-type server

Important

For a two-node trusted storage pool, it is important to set the quorum ratio to be greater than 50% so that two nodes separated from each other do not both believe they have a quorum.
For a replicated volume with two nodes and one brick on each machine, if the server-side quorum is enabled and one of the nodes goes offline, the other node will also be taken offline because of the quorum configuration. As a result, the high availability provided by the replication is ineffective. To prevent this situation, a dummy node can be added to the trusted storage pool which does not contain any bricks. This ensures that even if one of the nodes which contains data goes offline, the other node will remain online. Note that if the dummy node and one of the data nodes goes offline, the brick on other node will be also be taken offline, and will result in data unavailability.

8.10.1.2. Configuring Client-Side Quorum

Replication in Red Hat Storage Server allows modifications as long as at least one of the bricks in a replica group is online. In a network-partition scenario, different clients connect to different bricks in the replicated environment. In this situation different clients may modify the same file on different bricks. When a client is witnessing brick disconnections, a file could be modified on different bricks at different times while the other brick is off-line in the replica. For example, in a 1 X 2 replicate volume, while modifying the same file, it can so happen that client C1 can connect only to brick B1 and client C2 can connect only to brick B2. These situations lead to split-brain and the file becomes unusable and manual intervention is required to fix this issue.
Client-side quorum is implemented to minimize split-brains. Client-side quorum configuration determines the number of bricks that must be up for it to allow data modification. If client-side quorum is not met, files in that replica group become read-only. This client-side quorum configuration applies for all the replica groups in the volume, if client-side quorum is not met for m of n replica groups only m replica groups becomes read-only and the rest of the replica groups continue to allow data modifications.

Example 8.8. Client-Side Quorum

In the above scenario, when the client-side quorum is not met for replica group A, only replica group A becomes read-only. Replica groups B and C continue to allow data modifications.

Important

  1. If cluster.quorum-type is fixed, writes will continue till number of bricks up and running in replica pair is equal to the count specified in cluster.quorum-count option. This is irrespective of first or second or third brick. All the bricks are equivalent here.
  2. If cluster.quorum-type is auto, then at least ceil (n/2) number of bricks need to be up to allow writes, where n is the replica count. For example,
    for replica 2, ceil(2/2)= 1 brick
    for replica 3, ceil(3/2)= 2 bricks
    for replica 4, ceil(4/2)= 2 bricks
    for replica 5, ceil(5/2)= 3 bricks
    for replica 6, ceil(6/2)= 3 bricks
    and so on
    
    In addition, for auto, if the number of bricks that are up is exactly ceil (n/2), and n is an even number, then the first brick of the replica must also be up to allow writes. For replica 6, if more than 3 bricks are up, then it can be any of the bricks. But if exactly 3 bricks are up, then the first brick has to be up and running.
  3. In a three-way replication setup, it is recommended to set cluster.quorum-type to auto to avoid split brains. If the quorum is not met, the replica pair becomes read-only.
Configure the client-side quorum using cluster.quorum-type and cluster.quorum-count options. For more information on these options, see Section 8.1, “Configuring Volume Options”.

Important

When you integrate Red Hat Storage with Red Hat Enterprise Virtualization or Red Hat OpenStack, the client-side quorum is enabled when you run gluster volume set VOLNAME group virt command. If on a two replica set up, if the first brick in the replica pair is offline, virtual machines will be paused because quorum is not met and writes are disallowed.
Consistency is achieved at the cost of fault tolerance. If fault-tolerance is preferred over consistency, disable client-side quorum with the following command:
# gluster volume reset VOLNAME quorum-type
Example - Setting up server-side and client-side quorum to avoid split-brain scenario

This example provides information on how to set server-side and client-side quorum on a Distribute Replicate volume to avoid split-brain scenario. The configuration of this example has 2 X 2 ( 4 bricks) Distribute Replicate setup.

# gluster volume info testvol
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 0df52d58-bded-4e5d-ac37-4c82f7c89cfh
Status: Created
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/brick1
Brick2: server2:/bricks/brick2
Brick3: server3:/bricks/brick3
Brick4: server4:/bricks/brick4
Setting Server-side Quorum
Enable the quorum on a particular volume to participate in the server-side quorum by running the following command:
# gluster volume set VOLNAME cluster.server-quorum-type server
Set the quorum to 51% of the trusted storage pool:
# gluster volume set all cluster.server-quorum-ratio 51%
In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time. If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.
Setting Client-side Quourm
Set the quorum-typeoption to auto to allow writes to the file only if the percentage of active replicate bricks is more than 50% of the total number of bricks that constitute that replica.
# gluster volume set VOLNAME quorum-type auto
In this example, as there are only two bricks in the replica pair, the first brick must be up and running to allow writes.

Important

Atleast n/2 bricks need to be up for the quorum to be met. If the number of bricks (n) in a replica set is an even number, it is mandatory that the n/2 count must consist of the primary brick and it must be up and running. If n is an odd number, the n/2 count can have any brick up and running, that is, the primary brick need not be up and running to allow writes.

8.10.2. Recovering from File Split-brain

Steps to recover from a file split-brain

  1. Run the following command to obtain the path of the file that is in split-brain:
    # gluster volume heal VOLNAME info split-brain
    From the command output, identify the files for which file operations performed from the client keep failing with Input/Output error.
  2. Close the applications that opened split-brain file from the mount point. If you are using a virtual machine, you must power off the machine.
  3. Obtain and verify the AFR changelog extended attributes of the file using the getfattr command. Then identify the type of split-brain to determine which of the bricks contains the 'good copy' of the file.
    getfattr -d -m . -e hex <file-path-on-brick>
    For example,
    # getfattr -d -e hex -m. brick-a/file.txt  
    \#file: brick-a/file.txt
    security.selinux=0x726f6f743a6f626a6563745f723a66696c655f743a733000
    trusted.afr.vol-client-2=0x000000000000000000000000
    trusted.afr.vol-client-3=0x000000000200000000000000
    trusted.gfid=0x307a5c9efddd4e7c96e94fd4bcdcbd1b
    The extended attributes with trusted.afr.VOLNAMEvolname-client-<subvolume-index> are used by AFR to maintain changelog of the file. The values of the trusted.afr.VOLNAMEvolname-client-<subvolume-index> are calculated by the glusterFS client (FUSE or NFS-server) processes. When the glusterFS client modifies a file or directory, the client contacts each brick and updates the changelog extended attribute according to the response of the brick.
    subvolume-index is the brick number - 1 of gluster volume info VOLNAME output.
    For example,
    # gluster volume info vol
    Volume Name: vol 
    Type: Distributed-Replicate  
    Volume ID: 4f2d7849-fbd6-40a2-b346-d13420978a01  
    Status: Created  
    Number of Bricks: 4 x 2 = 8 
    Transport-type: tcp  
    Bricks:  
    brick-a: server1:/gfs/brick-a  
    brick-b: server1:/gfs/brick-b  
    brick-c: server1:/gfs/brick-c  
    brick-d: server1:/gfs/brick-d  
    brick-e: server1:/gfs/brick-e  
    brick-f: server1:/gfs/brick-f  
    brick-g: server1:/gfs/brick-g  
    brick-h: server1:/gfs/brick-h
    In the example above:
    Brick             |    Replica set        |    Brick subvolume index
    ----------------------------------------------------------------------------
    -/gfs/brick-a     |       0               |       0
    -/gfs/brick-b     |       0               |       1
    -/gfs/brick-c     |       1               |       2
    -/gfs/brick-d     |       1               |       3
    -/gfs/brick-e     |       2               |       4
    -/gfs/brick-f     |       2               |       5
    -/gfs/brick-g     |       3               |       6
    -/gfs/brick-h     |       3               |       7
    ```
    Each file in a brick maintains the changelog of itself and that of the files present in all the other bricks in it's replica set as seen by that brick.
    In the example volume given above, all files in brick-a will have 2 entries, one for itself and the other for the file present in it's replica pair. The following is the changelog for brick-b,
    • trusted.afr.vol-client-0=0x000000000000000000000000 - is the changelog for itself (brick-a)
    • trusted.afr.vol-client-1=0x000000000000000000000000 - changelog for brick-b as seen by brick-a
    Likewise, all files in brick-b will have the following:
    • trusted.afr.vol-client-0=0x000000000000000000000000 - changelog for brick-a as seen by brick-b
    • trusted.afr.vol-client-1=0x000000000000000000000000 - changelog for itself (brick-b)
    The same can be extended for other replica pairs.
    Interpreting changelog (approximate pending operation count) value

    Each extended attribute has a value which is 24 hexa decimal digits. First 8 digits represent changelog of data. Second 8 digits represent changelog of metadata. Last 8 digits represent Changelog of directory entries.

    Pictorially representing the same is as follows:
    0x 000003d7 00000001 00000000110
            |      |       |
            |      |        \_ changelog of directory entries
            |       \_ changelog of metadata
             \ _ changelog of data
    For directories, metadata and entry changelogs are valid. For regular files, data and metadata changelogs are valid. For special files like device files and so on, metadata changelog is valid. When a file split-brain happens it could be either be data split-brain or meta-data split-brain or both.
    The following is an example of both data, metadata split-brain on the same file:
    # getfattr -d -m . -e hex /gfs/brick-?/a 
    getfattr: Removing leading '/' from absolute path names
    \#file: gfs/brick-a/a 
    trusted.afr.vol-client-0=0x000000000000000000000000  
    trusted.afr.vol-client-1=0x000003d70000000100000000  
    trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57  
    \#file: gfs/brick-b/a  
    trusted.afr.vol-client-0=0x000003b00000000100000000  
    trusted.afr.vol-client-1=0x000000000000000000000000 
    trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57
    Scrutinize the changelogs

    The changelog extended attributes on file /gfs/brick-a/a are as follows:
    • The first 8 digits of trusted.afr.vol-client-0 are all zeros (0x00000000................),
      The first 8 digits of trusted.afr.vol-client-1 are not all zeros (0x000003d7................).
      So the changelog on /gfs/brick-a/a implies that some data operations succeeded on itself but failed on /gfs/brick-b/a.
    • The second 8 digits of trusted.afr.vol-client-0 are all zeros (0x........00000000........), and the second 8 digits of trusted.afr.vol-client-1 are not all zeros (0x........00000001........).
      So the changelog on /gfs/brick-a/a implies that some metadata operations succeeded on itself but failed on /gfs/brick-b/a.
    The changelog extended attributes on file /gfs/brick-b/a are as follows:
    • The first 8 digits of trusted.afr.vol-client-0 are not all zeros (0x000003b0................).
      The first 8 digits of trusted.afr.vol-client-1 are all zeros (0x00000000................).
      So the changelog on /gfs/brick-b/a implies that some data operations succeeded on itself but failed on /gfs/brick-a/a.
    • The second 8 digits of trusted.afr.vol-client-0 are not all zeros (0x........00000001........)
      The second 8 digits of trusted.afr.vol-client-1 are all zeros (0x........00000000........).
      So the changelog on /gfs/brick-b/a implies that some metadata operations succeeded on itself but failed on /gfs/brick-a/a.
    Here, both the copies have data, metadata changes that are not on the other file. Hence, it is both data and metadata split-brain.
    Deciding on the correct copy

    You must inspect stat and getfattr output of the files to decide which metadata to retain and contents of the file to decide which data to retain. To continue with the example above, here, we are retaining the data of /gfs/brick-a/a and metadata of /gfs/brick-b/a.

    Resetting the relevant changelogs to resolve the split-brain

    Resolving data split-brain

    You must change the changelog extended attributes on the files as if some data operations succeeded on /gfs/brick-a/a but failed on /gfs/brick-b/a. But /gfs/brick-b/a should not have any changelog showing data operations succeeded on /gfs/brick-b/a but failed on /gfs/brick-a/a. You must reset the data part of the changelog on trusted.afr.vol-client-0 of /gfs/brick-b/a.

    Resolving metadata split-brain

    You must change the changelog extended attributes on the files as if some metadata operations succeeded on /gfs/brick-b/a but failed on /gfs/brick-a/a. But /gfs/brick-a/a should not have any changelog which says some metadata operations succeeded on /gfs/brick-a/a but failed on /gfs/brick-b/a. You must reset metadata part of the changelog on trusted.afr.vol-client-1 of /gfs/brick-a/a
    Run the following commands to reset the extended attributes.
    1. On /gfs/brick-b/a, for trusted.afr.vol-client-0 0x000003b00000000100000000 to 0x000000000000000100000000, execute the following command:
      # setfattr -n trusted.afr.vol-client-0 -v 0x000000000000000100000000 /gfs/brick-b/a
    2. On /gfs/brick-a/a, for trusted.afr.vol-client-1 0x0000000000000000ffffffff to 0x000003d70000000000000000, execute the following command:
      # setfattr -n trusted.afr.vol-client-1 -v 0x000003d70000000000000000 /gfs/brick-a/a
    After you reset the extended attributes, the changelogs would look similar to the following:
    # getfattr -d -m . -e hex /gfs/brick-?/a  
    getfattr: Removing leading '/' from absolute path names  
    \#file: gfs/brick-a/a  
    trusted.afr.vol-client-0=0x000000000000000000000000  
    trusted.afr.vol-client-1=0x000003d70000000000000000  
    trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57  
    
    \#file: gfs/brick-b/a  
    trusted.afr.vol-client-0=0x000000000000000100000000  
    trusted.afr.vol-client-1=0x000000000000000000000000  
    trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57
    
    Resolving Directory entry split-brain

    AFR has the ability to conservatively merge different entries in the directories when there is a split-brain on directory. If on one brick directory storage has entries 1, 2 and has entries 3, 4 on the other brick then AFR will merge all of the entries in the directory to have 1, 2, 3, 4 entries in the same directory. But this may result in deleted files to re-appear in case the split-brain happens because of deletion of files on the directory. Split-brain resolution needs human intervention when there is at least one entry which has same file name but different gfid in that directory.

    For example:
    On brick-a the directory has 2 entries file1 with gfid_x and file2 . On brick-b directory has 2 entries file1 with gfid_y and file3. Here the gfid's of file1 on the bricks are different. These kinds of directory split-brain needs human intervention to resolve the issue. You must remove either file1 on brick-a or the file1 on brick-b to resolve the split-brain.
    In addition, the corresponding gfid-link file must be removed. The gfid-link files are present in the .glusterfs directory in the top-level directory of the brick. If the gfid of the file is 0x307a5c9efddd4e7c96e94fd4bcdcbd1b (the trusted.gfid extended attribute received from the getfattr command earlier), the gfid-link file can be found at /gfs/brick-a/.glusterfs/30/7a/307a5c9efddd4e7c96e94fd4bcdcbd1b.

    Warning

    Before deleting the gfid-link, you must ensure that there are no hard links to the file present on that brick. If hard-links exist, you must delete them.
  4. Trigger self-heal by running the following command:
    # ls -l <file-path-on-gluster-mount>
    or
    # gluster volume heal VOLNAME

8.10.3. Triggering Self-Healing on Replicated Volumes

For replicated volumes, when a brick goes offline and comes back online, self-healing is required to resync all the replicas. There is a self-heal daemon which runs in the background, and automatically initiates self-healing every 10 minutes on any files which require healing.
There are various commands that can be used to check the healing status of volumes and files, or to manually initiate healing:
  • To view the list of files that need healing:
    # gluster volume heal VOLNAME info
    For example, to view the list of files on test-volume that need healing:
    # gluster volume heal test-volume info
    Brick server1:/gfs/test-volume_0
    Number of entries: 0
     
    Brick server2:/gfs/test-volume_1
    /95.txt
    /32.txt
    /66.txt
    /35.txt
    /18.txt
    /26.txt - Possibly undergoing heal
    /47.txt 
    /55.txt
    /85.txt - Possibly undergoing heal
    ...
    Number of entries: 101
  • To trigger self-healing only on the files which require healing:
    # gluster volume heal VOLNAME
    For example, to trigger self-healing on files which require healing on test-volume:
    # gluster volume heal test-volume
    Heal operation on volume test-volume has been successful
  • To trigger self-healing on all the files on a volume:
    # gluster volume heal VOLNAME full
    For example, to trigger self-heal on all the files on test-volume:
    # gluster volume heal test-volume full
    Heal operation on volume test-volume has been successful
  • To view the list of files on a volume that are in a split-brain state:
    # gluster volume heal VOLNAME info split-brain
    For example, to view the list of files on test-volume that are in a split-brain state:
    # gluster volume heal test-volume info split-brain
    Brick server1:/gfs/test-volume_2 
    Number of entries: 12
    at                   path on brick
    ----------------------------------
    2012-06-13 04:02:05  /dir/file.83
    2012-06-13 04:02:05  /dir/file.28
    2012-06-13 04:02:05  /dir/file.69
    Brick server2:/gfs/test-volume_2
    Number of entries: 12
    at                   path on brick
    ----------------------------------
    2012-06-13 04:02:05  /dir/file.83
    2012-06-13 04:02:05  /dir/file.28
    2012-06-13 04:02:05  /dir/file.69
    ...