8.10. Managing Split-brain
- Data split-brain: Contents of the file under split-brain are different in different replica pairs and automatic healing is not possible.
- Metadata split-brain : The metadata of the files (example, user defined extended attribute) are different and automatic healing is not possible.
- Entry split-brain: This happens when a file have different gfids on each of the replica pair.
8.10.1. Preventing Split-brain
22.214.171.124. Configuring Server-Side Quorum
cluster.server-quorum-typevolume option as
server. For more information on this volume option, see Section 8.1, “Configuring Volume Options”.
glusterdservice. Whenever the
glusterdservice on a machine observes that the quorum is not met, it brings down the bricks to prevent data split-brain. When the network connections are brought back up and the quorum is restored, the bricks in the volume are brought back up. When the quorum is not met for a volume, any commands that update the volume configuration or peer addition or detach are not allowed. It is to be noted that both, the
glusterdservice not running and the network connection between two machines being down are treated equally.
# gluster volume set all cluster.server-quorum-ratio PERCENTAGE
# gluster volume set all cluster.server-quorum-ratio 51%
# gluster volume set VOLNAME cluster.server-quorum-type
126.96.36.199. Configuring Client-Side Quorum
nreplica groups only
mreplica groups becomes read-only and the rest of the replica groups continue to allow data modifications.
Example 8.8. Client-Side Quorum
A, only replica group
Abecomes read-only. Replica groups
Ccontinue to allow data modifications.
fixed, writes will continue till number of bricks up and running in replica pair is equal to the count specified in
cluster.quorum-countoption. This is irrespective of first or second or third brick. All the bricks are equivalent here.
auto, then at least ceil (n/2) number of bricks need to be up to allow writes, where
nis the replica count. For example,
for replica 2, ceil(2/2)= 1 brick for replica 3, ceil(3/2)= 2 bricks for replica 4, ceil(4/2)= 2 bricks for replica 5, ceil(5/2)= 3 bricks for replica 6, ceil(6/2)= 3 bricks and so onIn addition, for
auto, if the number of bricks that are up is exactly ceil (n/2), and
nis an even number, then the first brick of the replica must also be up to allow writes. For replica 6, if more than 3 bricks are up, then it can be any of the bricks. But if exactly 3 bricks are up, then the first brick has to be up and running.
- In a three-way replication setup, it is recommended to set
autoto avoid split brains. If the quorum is not met, the replica pair becomes read-only.
cluster.quorum-countoptions. For more information on these options, see Section 8.1, “Configuring Volume Options”.
gluster volume set VOLNAME group virtcommand. If on a two replica set up, if the first brick in the replica pair is offline, virtual machines will be paused because quorum is not met and writes are disallowed.
# gluster volume reset VOLNAME quorum-type
This example provides information on how to set server-side and client-side quorum on a Distribute Replicate volume to avoid split-brain scenario. The configuration of this example has 2 X 2 ( 4 bricks) Distribute Replicate setup.
# gluster volume info testvol Volume Name: testvol Type: Distributed-Replicate Volume ID: 0df52d58-bded-4e5d-ac37-4c82f7c89cfh Status: Created Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: server1:/bricks/brick1 Brick2: server2:/bricks/brick2 Brick3: server3:/bricks/brick3 Brick4: server4:/bricks/brick4
# gluster volume set VOLNAME cluster.server-quorum-type server
# gluster volume set all cluster.server-quorum-ratio 51%
autoto allow writes to the file only if the percentage of active replicate bricks is more than 50% of the total number of bricks that constitute that replica.
# gluster volume set VOLNAME quorum-type
n) in a replica set is an even number, it is mandatory that the
n/2count must consist of the primary brick and it must be up and running. If
nis an odd number, the
n/2count can have any brick up and running, that is, the primary brick need not be up and running to allow writes.
8.10.2. Recovering from File Split-brain
Steps to recover from a file split-brain
- Run the following command to obtain the path of the file that is in split-brain:
# gluster volume heal VOLNAME info split-brainFrom the command output, identify the files for which file operations performed from the client keep failing with Input/Output error.
- Close the applications that opened split-brain file from the mount point. If you are using a virtual machine, you must power off the machine.
- Obtain and verify the AFR changelog extended attributes of the file using the
getfattrcommand. Then identify the type of split-brain to determine which of the bricks contains the 'good copy' of the file.
getfattr -d -m . -e hex <file-path-on-brick>For example,
# getfattr -d -e hex -m. brick-a/file.txt \#file: brick-a/file.txt security.selinux=0x726f6f743a6f626a6563745f723a66696c655f743a733000 trusted.afr.vol-client-2=0x000000000000000000000000 trusted.afr.vol-client-3=0x000000000200000000000000 trusted.gfid=0x307a5c9efddd4e7c96e94fd4bcdcbd1bThe extended attributes with
trusted.afr.VOLNAMEvolname-client-<subvolume-index>are used by AFR to maintain changelog of the file. The values of the
trusted.afr.VOLNAMEvolname-client-<subvolume-index>are calculated by the glusterFS client (FUSE or NFS-server) processes. When the glusterFS client modifies a file or directory, the client contacts each brick and updates the changelog extended attribute according to the response of the brick.
brick number - 1of
gluster volume info VOLNAMEoutput.For example,
# gluster volume info vol Volume Name: vol Type: Distributed-Replicate Volume ID: 4f2d7849-fbd6-40a2-b346-d13420978a01 Status: Created Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: brick-a: server1:/gfs/brick-a brick-b: server1:/gfs/brick-b brick-c: server1:/gfs/brick-c brick-d: server1:/gfs/brick-d brick-e: server1:/gfs/brick-e brick-f: server1:/gfs/brick-f brick-g: server1:/gfs/brick-g brick-h: server1:/gfs/brick-hIn the example above:
Brick | Replica set | Brick subvolume index ---------------------------------------------------------------------------- -/gfs/brick-a | 0 | 0 -/gfs/brick-b | 0 | 1 -/gfs/brick-c | 1 | 2 -/gfs/brick-d | 1 | 3 -/gfs/brick-e | 2 | 4 -/gfs/brick-f | 2 | 5 -/gfs/brick-g | 3 | 6 -/gfs/brick-h | 3 | 7 ```Each file in a brick maintains the changelog of itself and that of the files present in all the other bricks in it's replica set as seen by that brick.In the example volume given above, all files in brick-a will have 2 entries, one for itself and the other for the file present in it's replica pair. The following is the changelog for brick-b,
Likewise, all files in brick-b will have the following:
- trusted.afr.vol-client-0=0x000000000000000000000000 - is the changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000000000000 - changelog for brick-b as seen by brick-a
The same can be extended for other replica pairs.Interpreting changelog (approximate pending operation count) value
- trusted.afr.vol-client-0=0x000000000000000000000000 - changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000000000000 - changelog for itself (brick-b)
Each extended attribute has a value which is 24 hexa decimal digits. First 8 digits represent changelog of data. Second 8 digits represent changelog of metadata. Last 8 digits represent Changelog of directory entries.Pictorially representing the same is as follows:
0x 000003d7 00000001 00000000110 | | | | | \_ changelog of directory entries | \_ changelog of metadata \ _ changelog of dataFor directories, metadata and entry changelogs are valid. For regular files, data and metadata changelogs are valid. For special files like device files and so on, metadata changelog is valid. When a file split-brain happens it could be either be data split-brain or meta-data split-brain or both.The following is an example of both data, metadata split-brain on the same file:
# getfattr -d -m . -e hex /gfs/brick-?/a getfattr: Removing leading '/' from absolute path names \#file: gfs/brick-a/a trusted.afr.vol-client-0=0x000000000000000000000000 trusted.afr.vol-client-1=0x000003d70000000100000000 trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 \#file: gfs/brick-b/a trusted.afr.vol-client-0=0x000003b00000000100000000 trusted.afr.vol-client-1=0x000000000000000000000000 trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57Scrutinize the changelogsThe changelog extended attributes on file
/gfs/brick-a/aare as follows:
The changelog extended attributes on file
- The first 8 digits of
trusted.afr.vol-client-0 are all zeros (0x00000000................),The first 8 digits of
trusted.afr.vol-client-1are not all zeros (0x000003d7................).So the changelog on
/gfs/brick-a/aimplies that some data operations succeeded on itself but failed on
- The second 8 digits of
trusted.afr.vol-client-0 are all zeros (0x........00000000........), and the second 8 digits of
trusted.afr.vol-client-1are not all zeros (0x........00000001........).So the changelog on
/gfs/brick-a/aimplies that some metadata operations succeeded on itself but failed on
/gfs/brick-b/aare as follows:
Here, both the copies have data, metadata changes that are not on the other file. Hence, it is both data and metadata split-brain.Deciding on the correct copy
- The first 8 digits of
trusted.afr.vol-client-0are not all zeros (0x000003b0................).The first 8 digits of
trusted.afr.vol-client-1are all zeros (0x00000000................).So the changelog on
/gfs/brick-b/aimplies that some data operations succeeded on itself but failed on
- The second 8 digits of
trusted.afr.vol-client-0are not all zeros (0x........00000001........)The second 8 digits of
trusted.afr.vol-client-1are all zeros (0x........00000000........).So the changelog on
/gfs/brick-b/aimplies that some metadata operations succeeded on itself but failed on
You must inspect
getfattroutput of the files to decide which metadata to retain and contents of the file to decide which data to retain. To continue with the example above, here, we are retaining the data of
/gfs/brick-a/aand metadata of
/gfs/brick-b/a.Resetting the relevant changelogs to resolve the split-brainResolving data split-brain
You must change the changelog extended attributes on the files as if some data operations succeeded on
/gfs/brick-a/abut failed on /gfs/brick-b/a. But
nothave any changelog showing data operations succeeded on
/gfs/brick-b/abut failed on
/gfs/brick-a/a. You must reset the data part of the changelog on
/gfs/brick-b/a.Resolving metadata split-brainYou must change the changelog extended attributes on the files as if some metadata operations succeeded on
/gfs/brick-b/abut failed on
nothave any changelog which says some metadata operations succeeded on
/gfs/brick-a/abut failed on
/gfs/brick-b/a. You must reset metadata part of the changelog on
/gfs/brick-a/aRun the following commands to reset the extended attributes.
After you reset the extended attributes, the changelogs would look similar to the following:
0x000000000000000100000000, execute the following command:
# setfattr -n trusted.afr.vol-client-0 -v 0x000000000000000100000000 /gfs/brick-b/a
0x000003d70000000000000000, execute the following command:
# setfattr -n trusted.afr.vol-client-1 -v 0x000003d70000000000000000 /gfs/brick-a/a
# getfattr -d -m . -e hex /gfs/brick-?/a getfattr: Removing leading '/' from absolute path names \#file: gfs/brick-a/a trusted.afr.vol-client-0=0x000000000000000000000000 trusted.afr.vol-client-1=0x000003d70000000000000000 trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 \#file: gfs/brick-b/a trusted.afr.vol-client-0=0x000000000000000100000000 trusted.afr.vol-client-1=0x000000000000000000000000 trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57Resolving Directory entry split-brain
AFR has the ability to conservatively merge different entries in the directories when there is a split-brain on directory. If on one brick directory
2and has entries
4on the other brick then AFR will merge all of the entries in the directory to have
1, 2, 3, 4entries in the same directory. But this may result in deleted files to re-appear in case the split-brain happens because of deletion of files on the directory. Split-brain resolution needs human intervention when there is at least one entry which has same file name but different
gfidin that directory.For example:On
brick-athe directory has 2 entries
brick-bdirectory has 2 entries
file3. Here the gfid's of
file1on the bricks are different. These kinds of directory split-brain needs human intervention to resolve the issue. You must remove either
brick-bto resolve the split-brain.In addition, the corresponding
gfid-linkfile must be removed. The
gfid-linkfiles are present in the .
glusterfsdirectory in the top-level directory of the brick. If the gfid of the file is
0x307a5c9efddd4e7c96e94fd4bcdcbd1b(the trusted.gfid extended attribute received from the
getfattrcommand earlier), the gfid-link file can be found at
WarningBefore deleting the
gfid-link, you must ensure that there are no hard links to the file present on that brick. If hard-links exist, you must delete them.
- Trigger self-heal by running the following command:
# ls -l <file-path-on-gluster-mount>or
# gluster volume heal VOLNAME
8.10.3. Triggering Self-Healing on Replicated Volumes
- To view the list of files that need healing:
# gluster volume heal VOLNAME infoFor example, to view the list of files on test-volume that need healing:
# gluster volume heal test-volume info Brick server1:/gfs/test-volume_0 Number of entries: 0 Brick server2:/gfs/test-volume_1 /95.txt /32.txt /66.txt /35.txt /18.txt /26.txt - Possibly undergoing heal /47.txt /55.txt /85.txt - Possibly undergoing heal ... Number of entries: 101
- To trigger self-healing only on the files which require healing:
# gluster volume heal VOLNAMEFor example, to trigger self-healing on files which require healing on test-volume:
# gluster volume heal test-volume Heal operation on volume test-volume has been successful
- To trigger self-healing on all the files on a volume:
# gluster volume heal VOLNAME fullFor example, to trigger self-heal on all the files on test-volume:
# gluster volume heal test-volume full Heal operation on volume test-volume has been successful
- To view the list of files on a volume that are in a split-brain state:
# gluster volume heal VOLNAME info split-brainFor example, to view the list of files on test-volume that are in a split-brain state:
# gluster volume heal test-volume info split-brain Brick server1:/gfs/test-volume_2 Number of entries: 12 at path on brick ---------------------------------- 2012-06-13 04:02:05 /dir/file.83 2012-06-13 04:02:05 /dir/file.28 2012-06-13 04:02:05 /dir/file.69 Brick server2:/gfs/test-volume_2 Number of entries: 12 at path on brick ---------------------------------- 2012-06-13 04:02:05 /dir/file.83 2012-06-13 04:02:05 /dir/file.28 2012-06-13 04:02:05 /dir/file.69 ...