11.13. Managing Split-brain

Split-brain is a state when a data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and synchronizing their data to each other.
In Red Hat Gluster Storage, split-brain is a term applicable to Red Hat Gluster Storage volumes in a replicate configuration. A file is said to be in split-brain when the copies of the same file in different bricks that constitute the replica-pair have mismatching data and/or meta-data contents such that they are conflicting each other and automatic healing is not possible. In this scenario, you can decide which is the correct file (source) and which is the one that require healing (sink) by inspecting at the mismatching files from the backend bricks.
The AFR translator in glusterFS makes use of extended attributes to keep track of the operations on a file. These attributes determine which brick is the source and which brick is the sink for a file that require healing. If the files are clean, the extended attributes are all zeroes indicating that no heal is necessary. When a heal is required, they are marked in such a way that there is a distinguishable source and sink and the heal can happen automatically. But, when a split-brain occurs, these extended attributes are marked in such a way that both bricks mark themselves as sources, making automatic healing impossible.
When a split-brain occurs, applications cannot perform certain operations like read and write on the file. Accessing the files results in the application receiving an Input/Output Error.
The three types of split-brains that occur in Red Hat Gluster Storage are:
  • Data split-brain: Contents of the file under split-brain are different in different replica pairs and automatic healing is not possible.
  • Metadata split-brain : The metadata of the files (example, user defined extended attribute) are different and automatic healing is not possible.
  • Entry split-brain: This happens when a file have different gfids on each of the replica pair.
The only way to resolve split-brains is by manually inspecting the file contents from the backend and deciding which is the true copy (source ) and modifying the appropriate extended attributes such that healing can happen automatically.

11.13.1. Preventing Split-brain

To prevent split-brain in the trusted storage pool, you must configure server-side and client-side quorum.

11.13.1.1. Configuring Server-Side Quorum

The quorum configuration in a trusted storage pool determines the number of server failures that the trusted storage pool can sustain. If an additional failure occurs, the trusted storage pool will become unavailable. If too many server failures occur, or if there is a problem with communication between the trusted storage pool nodes, it is essential that the trusted storage pool be taken offline to prevent data loss.
After configuring the quorum ratio at the trusted storage pool level, you must enable the quorum on a particular volume by setting cluster.server-quorum-type volume option as server. For more information on this volume option, see Section 11.1, “Configuring Volume Options”.
Configuration of the quorum is necessary to prevent network partitions in the trusted storage pool. Network Partition is a scenario where, a small set of nodes might be able to communicate together across a functioning part of a network, but not be able to communicate with a different set of nodes in another part of the network. This can cause undesirable situations, such as split-brain in a distributed system. To prevent a split-brain situation, all the nodes in at least one of the partitions must stop running to avoid inconsistencies.
This quorum is on the server-side, that is, the glusterd service. Whenever the glusterd service on a machine observes that the quorum is not met, it brings down the bricks to prevent data split-brain. When the network connections are brought back up and the quorum is restored, the bricks in the volume are brought back up. When the quorum is not met for a volume, any commands that update the volume configuration or peer addition or detach are not allowed. It is to be noted that both, the glusterd service not running and the network connection between two machines being down are treated equally.
You can configure the quorum percentage ratio for a trusted storage pool. If the percentage ratio of the quorum is not met due to network outages, the bricks of the volume participating in the quorum in those nodes are taken offline. By default, the quorum is met if the percentage of active nodes is more than 50% of the total storage nodes. However, if the quorum ratio is manually configured, then the quorum is met only if the percentage of active storage nodes of the total storage nodes is greater than or equal to the set value.
To configure the quorum ratio, use the following command:
# gluster volume set all cluster.server-quorum-ratio PERCENTAGE
For example, to set the quorum to 51% of the trusted storage pool:
# gluster volume set all cluster.server-quorum-ratio 51%
In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time. If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.
You must ensure to enable the quorum on a particular volume to participate in the server-side quorum by running the following command:
# gluster volume set VOLNAME cluster.server-quorum-type server

Important

For a two-node trusted storage pool, it is important to set the quorum ratio to be greater than 50% so that two nodes separated from each other do not both believe they have a quorum.
For a replicated volume with two nodes and one brick on each machine, if the server-side quorum is enabled and one of the nodes goes offline, the other node will also be taken offline because of the quorum configuration. As a result, the high availability provided by the replication is ineffective. To prevent this situation, a dummy node can be added to the trusted storage pool which does not contain any bricks. This ensures that even if one of the nodes which contains data goes offline, the other node will remain online. Note that if the dummy node and one of the data nodes goes offline, the brick on other node will be also be taken offline, and will result in data unavailability.

11.13.1.2. Configuring Client-Side Quorum

By default, when replication is configured, clients can modify files as long as at least one brick in the replica group is available. If network partitioning occurs, different clients are only able to connect to different bricks in a replica set, potentially allowing different clients to modify a single file simultaneously.
For example, imagine a three-way replicated volume is accessed by two clients, C1 and C2, who both want to modify the same file. If network partitioning occurs such that client C1 can only access brick B1, and client C2 can only access brick B2, then both clients are able to modify the file independently, creating split-brain conditions on the volume. The file becomes unusable, and manual intervention is required to correct the issue.
Client-side quorum allows administrators to set a minimum number of bricks that a client must be able to access in order to allow data in the volume to be modified. If client-side quorum is not met, files in the replica set are treated as read-only. This is useful when three-way replication is configured.
Client-side quorum is configured on a per-volume basis, and applies to all replica sets in a volume. If client-side quorum is not met for X of Y volume sets, only X volume sets are treated as read-only; the remaining volume sets continue to allow data modification.

Client-Side Quorum Options

cluster.quorum-count
The minimum number of bricks that must be available in order for writes to be allowed. This is set on a per-volume basis. Valid values are between 1 and the number of bricks in a replica set. This option is used by the cluster.quorum-type option to determine write behavior.
cluster.quorum-type
Determines when the client is allowed to write to a volume. Valid values are fixed and auto.
If cluster.quorum-type is fixed, writes are allowed as long as the number of bricks available in the replica set is greater than or equal to the value of the cluster.quorum-count option.
If cluster.quorum-type is auto, writes are allowed when at least 50% of the bricks in a replica set are be available. In a replica set with an even number of bricks, if exactly 50% of the bricks are available, the first brick in the replica set must be available in order for writes to continue.
In a three-way replication setup, it is recommended to set cluster.quorum-type to auto to avoid split-brains. If the quorum is not met, the replica pair becomes read-only.

Example 11.8. Client-Side Quorum

In the above scenario, when the client-side quorum is not met for replica group A, only replica group A becomes read-only. Replica groups B and C continue to allow data modifications.
Configure the client-side quorum using cluster.quorum-type and cluster.quorum-count options.

Important

When you integrate Red Hat Gluster Storage with Red Hat Enterprise Virtualization or Red Hat OpenStack, the client-side quorum is enabled when you run gluster volume set VOLNAME group virt command. If on a two replica set up, if the first brick in the replica pair is offline, virtual machines will be paused because quorum is not met and writes are disallowed.
Consistency is achieved at the cost of fault tolerance. If fault-tolerance is preferred over consistency, disable client-side quorum with the following command:
# gluster volume reset VOLNAME quorum-type
Example - Setting up server-side and client-side quorum to avoid split-brain scenario

This example provides information on how to set server-side and client-side quorum on a Distribute Replicate volume to avoid split-brain scenario. The configuration of this example has 2 X 2 ( 4 bricks) Distribute Replicate setup.

# gluster volume info testvol
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 0df52d58-bded-4e5d-ac37-4c82f7c89cfh
Status: Created
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: server1:/rhgs/brick1
Brick2: server2:/rhgs/brick2
Brick3: server3:/rhgs/brick3
Brick4: server4:/rhgs/brick4
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
Setting Server-side Quorum
Enable the quorum on a particular volume to participate in the server-side quorum by running the following command:
# gluster volume set VOLNAME cluster.server-quorum-type server
Set the quorum to 51% of the trusted storage pool:
# gluster volume set all cluster.server-quorum-ratio 51%
In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time. If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.
Setting Client-side Quorum
Set the quorum-typeoption to auto to allow writes to the file only if the percentage of active replicate bricks is more than 50% of the total number of bricks that constitute that replica.
# gluster volume set VOLNAME quorum-type auto
In this example, as there are only two bricks in the replica pair, the first brick must be up and running to allow writes.

Important

Atleast n/2 bricks need to be up for the quorum to be met. If the number of bricks (n) in a replica set is an even number, it is mandatory that the n/2 count must consist of the primary brick and it must be up and running. If n is an odd number, the n/2 count can have any brick up and running, that is, the primary brick need not be up and running to allow writes.

11.13.2. Recovering from File Split-brain

You can recover from the data and meta-data split-brain using one of the following methods:
For information on resolving gfid/entry split-brain, see Chapter 26, Manually Recovering File Split-brain .

11.13.2.1.  Recovering File Split-brain from the Mount Point

Steps to recover from a split-brain from the mount point

  1. You can use a set of getfattr and setfattr commands to detect the data and meta-data split-brain status of a file and resolve split-brain from the mount point.

    Important

    This process for split-brain resolution from mount will not work on NFS mounts as it does not provide extended attributes support.
    In this example, the test-volume volume has bricks brick0, brick1, brick2 and brick3.
    # gluster volume info test-volume
    Volume Name: test-volume
    Type: Distributed-Replicate
    Status: Started
    Snapshot Count: 0
    Number of Bricks: 2 x 2 = 4
    Transport-type: tcp
    Bricks:
    Brick1: test-host:/rhgs/brick0
    Brick2: test-host:/rhgs/brick1
    Brick3: test-host:/rhgs/brick2
    Brick4: test-host:/rhgs/brick3
    Options Reconfigured:
    transport.address-family: inet
    nfs.disable: on
    Directory structure of the bricks is as follows:
    # tree -R /test/b?
    /rhgs/brick0
    ├── dir
    │   └── a
    └── file100
    
    /rhgs/brick1
    ├── dir
    │   └── a
    └── file100
    
    /rhgs/brick2
    ├── dir
    ├── file1
    ├── file2
    └── file99
    
    /rhgs/brick3
    ├── dir
    ├── file1
    ├── file2
    └── file99
    In the following output, some of the files in the volume are in split-brain.
    # gluster volume heal test-volume info split-brain
    Brick test-host:/rhgs/brick0/
    /file100
    /dir
    Number of entries in split-brain: 2
    
    Brick test-host:/rhgs/brick1/
    /file100
    /dir
    Number of entries in split-brain: 2
    
    Brick test-host:/rhgs/brick2/
    /file99
    <gfid:5399a8d1-aee9-4653-bb7f-606df02b3696>
    Number of entries in split-brain: 2
    
    Brick test-host:/rhgs/brick3/
    <gfid:05c4b283-af58-48ed-999e-4d706c7b97d5>
    <gfid:5399a8d1-aee9-4653-bb7f-606df02b3696>
    Number of entries in split-brain: 2
    To know data or meta-data split-brain status of a file:
    # getfattr -n replica.split-brain-status <path-to-file>
    The above command executed from mount provides information if a file is in data or meta-data split-brain. This command is not applicable to gfid/entry split-brain.
    For example,
    • file100 is in meta-data split-brain. Executing the above mentioned command for file100 gives :
      # getfattr -n replica.split-brain-status file100
      # file: file100
      replica.split-brain-status="data-split-brain:no    metadata-split-brain:yes    Choices:test-client-0,test-client-1"
    • file1 is in data split-brain.
      # getfattr -n replica.split-brain-status file1
      # file: file1
      replica.split-brain-status="data-split-brain:yes    metadata-split-brain:no    Choices:test-client-2,test-client-3"
    • file99 is in both data and meta-data split-brain.
      # getfattr -n replica.split-brain-status file99
      # file: file99
      replica.split-brain-status="data-split-brain:yes    metadata-split-brain:yes    Choices:test-client-2,test-client-3"
    • dir is in gfid/entry split-brain but as mentioned earlier, the above command is does not display if the file is in gfid/entry split-brain. Hence, the command displays The file is not under data or metadata split-brain. For information on resolving gfid/entry split-brain, see Chapter 26, Manually Recovering File Split-brain .
      # getfattr -n replica.split-brain-status dir
      # file: dir
      replica.split-brain-status="The file is not under data or metadata split-brain"
    • file2 is not in any kind of split-brain.
      # getfattr -n replica.split-brain-status file2
      # file: file2
      replica.split-brain-status="The file is not under data or metadata split-brain"
  2. Analyze the files in data and meta-data split-brain and resolve the issue

    When you perform operations like cat, getfattr, and more from the mount on files in split-brain, it throws an input/output error. For further analyzing such files, you can use setfattr command.

    # setfattr -n replica.split-brain-choice -v "choiceX" <path-to-file>
    Using this command, a particular brick can be chosen to access the file in split-brain.
    For example,
    file1 is in data-split-brain and when you try to read from the file, it throws input/output error.
    # cat file1
    cat: file1: Input/output error
    Split-brain choices provided for file1 were test-client-2 and test-client-3.
    Setting test-client-2 as split-brain choice for file1 serves reads from b2 for the file.
    # setfattr -n replica.split-brain-choice -v test-client-2 file1
    Now, you can perform operations on the file. For example, read operations on the file:
    # cat file1
    xyz
    Similarly, to inspect the file from other choice, replica.split-brain-choice is to be set to test-client-3.
    Trying to inspect the file from a wrong choice errors out. You can undo the split-brain-choice that has been set, the above mentioned setfattr command can be used with none as the value for extended attribute.
    For example,
    # setfattr -n replica.split-brain-choice -v none file1
    Now performing cat operation on the file will again result in input/output error, as before.
    # cat file
    cat: file1: Input/output error
    After you decide which brick to use as a source for resolving the split-brain, it must be set for the healing to be done.
    # setfattr -n replica.split-brain-heal-finalize -v <heal-choice> <path-to-file>
    Example
    # setfattr -n replica.split-brain-heal-finalize -v test-client-2 file1
    The above process can be used to resolve data and/or meta-data split-brain on all the files.
    Setting the split-brain-choice on the file
    After setting the split-brain-choice on the file, the file can be analyzed only for five minutes. If the duration of analyzing the file needs to be increased, use the following command and set the required time in timeout-in-minute argument.
    # setfattr -n replica.split-brain-choice-timeout -v <timeout-in-minutes> <mount_point/file>
    This is a global timeout and is applicable to all files as long as the mount exists. The timeout need not be set each time a file needs to be inspected but for a new mount it will have to be set again for the first time. This option becomes invalid if the operations like add-brick or remove-brick are performed.

    Note

    If fopen-keep-cache FUSE mount option is disabled, then inode must be invalidated each time before selecting a new replica.split-brain-choice to inspect a file using the following command:
    # setfattr -n inode-invalidate -v 0 <path-to-file>

11.13.2.2. Recovering File Split-brain from the gluster CLI

You can resolve the split-brin from the gluster CLI by the following ways:
  • Use bigger-file as source
  • Use the file with latest mtime as source
  • Use one replica as source for a particular file
  • Use one replica as source for all files

Note

The entry/gfid split-brain resolution is not supported using CLI. For information on resolving gfid/entry split-brain, see Chapter 26, Manually Recovering File Split-brain .
Selecting the bigger-file as source

This method is useful for per file healing and where you can decided that the file with bigger size is to be considered as source.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    Brick <hostname:brickpath-b1>
    <gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2>
    <gfid:39f301ae-4038-48c2-a889-7dac143e82dd>
    <gfid:c3c94de2-232d-4083-b534-5da17fc476ac>
    Number of entries in split-brain: 3
    
    Brick <hostname:brickpath-b2>
    /dir/file1
    /dir
    /file4
    Number of entries in split-brain: 3
    From the command output, identify the files that are in split-brain.
    You can find the differences in the file size and md5 checksums by performing a stat and md5 checksums on the file from the bricks. The following is the stat and md5 checksum output of a file:
    On brick b1:
    # stat b1/dir/file1
      File: ‘b1/dir/file1’
      Size: 17              Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919362      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:55:40.149897333 +0530
    Modify: 2015-03-06 13:55:37.206880347 +0530
    Change: 2015-03-06 13:55:37.206880347 +0530
     Birth: -
    
    # md5sum b1/dir/file1
    040751929ceabf77c3c0b3b662f341a8  b1/dir/file1
    
    On brick b2:
    # stat b2/dir/file1
      File: ‘b2/dir/file1’
      Size: 13              Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919365      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:54:22.974451898 +0530
    Modify: 2015-03-06 13:52:22.910758923 +0530
    Change: 2015-03-06 13:52:22.910758923 +0530
     Birth: -
    
    # md5sum b2/dir/file1
    cb11635a45d45668a403145059c2a0d5  b2/dir/file1
    You can notice the differences in the file size and md5 checksums.
  2. Execute the following command along with the full file name as seen from the root of the volume (or) the gfid-string representation of the file, which is displayed in the heal info command's output.
    # gluster volume heal <VOLNAME> split-brain bigger-file <FILE>
    For example,
    # gluster volume heal test-volume split-brain bigger-file /dir/file1
    Healed /dir/file1.
After the healing is complete, the md5sum and file size on both bricks must be same. The following is a sample output of the stat and md5 checksums command after completion of healing the file.
On brick b1:
# stat b1/dir/file1
  File: ‘b1/dir/file1’
  Size: 17              Blocks: 16         IO Block: 4096   regular file
Device: fd03h/64771d    Inode: 919362      Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-03-06 14:17:27.752429505 +0530
Modify: 2015-03-06 13:55:37.206880347 +0530
Change: 2015-03-06 14:17:12.880343950 +0530
 Birth: -

# md5sum b1/dir/file1
040751929ceabf77c3c0b3b662f341a8  b1/dir/file1

On brick b2:
# stat b2/dir/file1
  File: ‘b2/dir/file1’
  Size: 17              Blocks: 16         IO Block: 4096   regular file
Device: fd03h/64771d    Inode: 919365      Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-03-06 14:17:23.249403600 +0530
Modify: 2015-03-06 13:55:37.206880000 +0530
Change: 2015-03-06 14:17:12.881343955 +0530
 Birth: -

# md5sum b2/dir/file1
040751929ceabf77c3c0b3b662f341a8  b2/dir/file1
Selecting the file with latest mtime as source

This method is useful for per file healing and if you want the file with latest mtime has to be considered as source.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    Brick <hostname:brickpath-b1>
    <gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2>
    <gfid:39f301ae-4038-48c2-a889-7dac143e82dd>
    <gfid:c3c94de2-232d-4083-b534-5da17fc476ac>
    Number of entries in split-brain: 3
    
    Brick <hostname:brickpath-b2>
    /dir/file1
    /dir
    /file4
    Number of entries in split-brain: 3
    From the command output, identify the files that are in split-brain.
    You can find the differences in the file size and md5 checksums by performing a stat and md5 checksums on the file from the bricks. The following is the stat and md5 checksum output of a file:
    On brick b1:
    
     stat b1/file4
      File: ‘b1/file4’
        Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919356      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:53:19.417085062 +0530
    Modify: 2015-03-06 13:53:19.426085114 +0530
    Change: 2015-03-06 13:53:19.426085114 +0530
     Birth: -
    
    
    # md5sum b1/file4
    b6273b589df2dfdbd8fe35b1011e3183  b1/file4
    
    On brick b2:
    
    # stat b2/file4
      File: ‘b2/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919358      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:52:35.761833096 +0530
    Modify: 2015-03-06 13:52:35.769833142 +0530
    Change: 2015-03-06 13:52:35.769833142 +0530
     Birth: -
    
    
    # md5sum b2/file4
    0bee89b07a248e27c83fc3d5951213c1  b2/file4
    You can notice the differences in the md5 checksums, and the modify time.
  2. Execute the following command
    # gluster volume heal <VOLNAME> split-brain latest-mtime <FILE>
    In this command, FILE can be either the full file name as seen from the root of the volume or the gfid-string representation of the file.
    For example,
    #gluster volume heal test-volume split-brain latest-mtime /file4
    Healed /file4
    
    After the healing is complete, the md5 checksum, file size, and modify time on both bricks must be same. The following is a sample output of the stat and md5 checksums command after completion of healing the file. You can notice that the file has been healed using the brick having the latest mtime (brick b1, in this example) as the source.
    On brick b1:
    # stat b1/file4
      File: ‘b1/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919356      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 14:23:38.944609863 +0530
    Modify: 2015-03-06 13:53:19.426085114 +0530
    Change: 2015-03-06 14:27:15.058927962 +0530
     Birth: -
    
    # md5sum b1/file4
    b6273b589df2dfdbd8fe35b1011e3183  b1/file4
    
    On brick b2:
    # stat b2/file4
     File: ‘b2/file4’
       Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919358      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 14:23:38.944609000 +0530
    Modify: 2015-03-06 13:53:19.426085000 +0530
    Change: 2015-03-06 14:27:15.059927968 +0530
     Birth:
    
    # md5sum b2/file4
    b6273b589df2dfdbd8fe35b1011e3183  b2/file4
Selecting one replica as source for a particular file

This method is useful if you know which file is to be considered as source.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    Brick <hostname:brickpath-b1>
    <gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2>
    <gfid:39f301ae-4038-48c2-a889-7dac143e82dd>
    <gfid:c3c94de2-232d-4083-b534-5da17fc476ac>
    Number of entries in split-brain: 3
    
    Brick <hostname:brickpath-b2>
    /dir/file1
    /dir
    /file4
    Number of entries in split-brain: 3
    From the command output, identify the files that are in split-brain.
    You can find the differences in the file size and md5 checksums by performing a stat and md5 checksums on the file from the bricks. The following is the stat and md5 checksum output of a file:
    On brick b1:
    
     stat b1/file4
      File: ‘b1/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919356      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:53:19.417085062 +0530
    Modify: 2015-03-06 13:53:19.426085114 +0530
    Change: 2015-03-06 13:53:19.426085114 +0530
     Birth: -
    
    # md5sum b1/file4
    b6273b589df2dfdbd8fe35b1011e3183  b1/file4
    
    On brick b2:
    
    # stat b2/file4
      File: ‘b2/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919358      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:52:35.761833096 +0530
    Modify: 2015-03-06 13:52:35.769833142 +0530
    Change: 2015-03-06 13:52:35.769833142 +0530
     Birth: -
    
    # md5sum b2/file4
    0bee89b07a248e27c83fc3d5951213c1  b2/file4
    You can notice the differences in the file size and md5 checksums.
  2. Execute the following command
    # gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME> <FILE>
    In this command, FILE present in <HOSTNAME:BRICKNAME> is taken as source for healing.
    For example,
    # gluster volume heal test-volume split-brain source-brick test-host:b1 /file4
    Healed /file4
    After the healing is complete, the md5 checksum and file size on both bricks must be same. The following is a sample output of the stat and md5 checksums command after completion of healing the file.
    On brick b1:
    # stat b1/file4
      File: ‘b1/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919356      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 14:23:38.944609863 +0530
    Modify: 2015-03-06 13:53:19.426085114 +0530
    Change: 2015-03-06 14:27:15.058927962 +0530
     Birth: -
    
    # md5sum b1/file4
    b6273b589df2dfdbd8fe35b1011e3183  b1/file4
    
    On brick b2:
    # stat b2/file4
     File: ‘b2/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919358      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 14:23:38.944609000 +0530
    Modify: 2015-03-06 13:53:19.426085000 +0530
    Change: 2015-03-06 14:27:15.059927968 +0530
     Birth: -
    
    # md5sum b2/file4
    b6273b589df2dfdbd8fe35b1011e3183  b2/file4
Selecting one replica as source for all files

This method is useful if you know want to use a particular brick as a source for the split-brain files in that replica pair.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    From the command output, identify the files that are in split-brain.
  2. Execute the following command
    # gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME>
    In this command, for all the files that are in split-brain in this replica, <HOSTNAME:BRICKNAME> is taken as source for healing.
    For example,
    # gluster volume heal test-volume split-brain source-brick test-host:b1

11.13.3. Triggering Self-Healing on Replicated Volumes

For replicated volumes, when a brick goes offline and comes back online, self-healing is required to re-sync all the replicas. There is a self-heal daemon which runs in the background, and automatically initiates self-healing every 10 minutes on any files which require healing.
Multithreaded Self-heal

Self-heal daemon has the capability to handle multiple heals in parallel and is supported on Replicate and Distribute-replicate volumes. However, increasing the number of heals has impact on I/O performance so the following options have been provided. The cluster.shd-max-threads volume option controls the number of entries that can be self healed in parallel on each replica by self-heal daemon using. Using cluster.shd-wait-qlength volume option, you can configure the number of entries that must be kept in the queue for self-heal daemon threads to take up as soon as any of the threads are free to heal.

For more information on cluster.shd-max-threads and cluster.shd-wait-qlength volume set options, see Section 11.1, “Configuring Volume Options”.
There are various commands that can be used to check the healing status of volumes and files, or to manually initiate healing:
  • To view the list of files that need healing:
    # gluster volume heal VOLNAME info
    For example, to view the list of files on test-volume that need healing:
    # gluster volume heal test-volume info
    Brick server1:/gfs/test-volume_0
    Number of entries: 0
    
    Brick server2:/gfs/test-volume_1
    /95.txt
    /32.txt
    /66.txt
    /35.txt
    /18.txt
    /26.txt - Possibly undergoing heal
    /47.txt
    /55.txt
    /85.txt - Possibly undergoing heal
    ...
    Number of entries: 101
  • To trigger self-healing only on the files which require healing:
    # gluster volume heal VOLNAME
    For example, to trigger self-healing on files which require healing on test-volume:
    # gluster volume heal test-volume
    Heal operation on volume test-volume has been successful
  • To trigger self-healing on all the files on a volume:
    # gluster volume heal VOLNAME full
    For example, to trigger self-heal on all the files on test-volume:
    # gluster volume heal test-volume full
    Heal operation on volume test-volume has been successful
  • To view the list of files on a volume that are in a split-brain state:
    # gluster volume heal VOLNAME info split-brain
    For example, to view the list of files on test-volume that are in a split-brain state:
    # gluster volume heal test-volume info split-brain
    Brick server1:/gfs/test-volume_2
    Number of entries: 12
    at                   path on brick
    ----------------------------------
    2012-06-13 04:02:05  /dir/file.83
    2012-06-13 04:02:05  /dir/file.28
    2012-06-13 04:02:05  /dir/file.69
    Brick server2:/gfs/test-volume_2
    Number of entries: 12
    at                   path on brick
    ----------------------------------
    2012-06-13 04:02:05  /dir/file.83
    2012-06-13 04:02:05  /dir/file.28
    2012-06-13 04:02:05  /dir/file.69
    ...