5.7. Creating Arbitrated Replicated Volumes

An arbitrated replicated volume contains two full copies of the files in the volume. Arbitrated volumes have an extra arbiter brick for every two data bricks in the volume. Arbiter bricks do not store file data; they only store file names, structure, and metadata. Arbiter bricks use client quorum to compare metadata on the arbiter with the metadata of the other nodes to ensure consistency in the volume and prevent split-brain conditions.

Advantages of arbitrated replicated volumes

Better consistency
When an arbiter is configured, arbitration logic uses client-side quorum in auto mode to prevent file operations that would lead to split-brain conditions.
Less disk space required
Because an arbiter brick only stores file names and metadata, an arbiter brick can be much smaller than the other bricks in the volume.
Fewer nodes required
The node that contains the arbiter brick of one volume can be configured with the data brick of another volume. This "chaining" configuration allows you to use fewer nodes to fulfill your overall storage requirements.
Easy migration from deprecated two-way replicated volumes
Red Hat Gluster Storage can convert a two-way replicated volume without arbiter bricks into an arbitrated replicated volume. See Section 5.7.5, “Converting to an arbitrated volume” for details.

Limitations of arbitrated replicated volumes

  • Arbitrated replicated volumes provide better data consistency than a two-way replicated volume that does not have arbiter bricks. However, because arbitrated replicated volumes store only metadata, they provide the same level of availability as a two-way replicated volume that does not have arbiter bricks. To achieve high-availability, you need to use a three-way replicated volume instead of an arbitrated replicated volume.
  • Tiering is not compatible with arbitrated replicated volumes.
  • Arbitrated volumes can only be configured in sets of three bricks at a time. Red Hat Gluster Storage can convert an existing two-way replicated volume without arbiter bricks into an arbitrated replicated volume by adding an arbiter brick to that volume. See Section 5.7.5, “Converting to an arbitrated volume” for details.

5.7.1. Arbitrated volume requirements

This section outlines the requirements of a supported arbitrated volume deployment.

5.7.1.1. System requirements for nodes hosting arbiter bricks

The minimum system requirements for a node that contains an arbiter brick differ depending on the configuration choices made by the administrator. See Section 5.7.4, “Creating multiple arbitrated replicated volumes across fewer total nodes” for details about the differences between the dedicated arbiter and chained arbiter configurations.

Table 5.1. Requirements for arbitrated configurations on physical machines

Configuration typeMin CPUMin RAMNICArbiter Brick SizeMax Latency
Dedicated arbiter64-bit quad-core processor with 2 sockets8 GB[a]Match to other nodes in the storage pool1 TB to 4 TB[b]5 ms[c]
Chained arbiterMatch to other nodes in the storage pool1 TB to 4 TB[d]5 ms[e]
[a] More RAM may be necessary depending on the combined capacity of the number of arbiter bricks on the node.
[b] Arbiter and data bricks can be configured on the same device provided that the data and arbiter bricks belong to different replica sets. See Section 5.7.1.2, “Arbiter capacity requirements” for further details on sizing arbiter volumes.
[c] This is the maximum round trip latency requirement between all nodes irrespective of Aribiter node. See KCS#413623 to know how to determine latency between nodes.
[d] Multiple bricks can be created on a single RAIDed physical device. Please refer the following product documentation: Section 19.2, “Brick Configuration”
[e] This is the maximum round trip latency requirement between all nodes irrespective of Aribiter node. See KCS#413623 to know how to determine latency between nodes.
The requirements for arbitrated configurations on virtual machines are:
  • minimum 4 vCPUs
  • minimum 16 GB RAM
  • 1 TB to 4 TB of virtual disk space
  • maximum 5 ms latency

5.7.1.2. Arbiter capacity requirements

Because an arbiter brick only stores file names and metadata, an arbiter brick can be much smaller than the other bricks in the volume or replica set. The required size for an arbiter brick depends on the number of files being stored on the volume.
The recommended minimum arbiter brick size can be calculated with the following formula:
minimum arbiter brick size = 4 KB * ( size in KB of largest data brick in volume or replica set / average file size in KB)
For example, if you have two 1 TB data bricks, and the average size of the files is 2 GB, then the recommended minimum size for your arbiter brick 2 MB, as shown in the following example:
minimum arbiter brick size  = 4 KB * ( 1 TB / 2 GB )
                            = 4 KB * ( 1000000000 KB / 2000000 KB )
                            = 4 KB * 500 KB
                            = 2000 KB
                            = 2 MB
If sharding is enabled, and your shard-block-size is smaller than the average file size in KB, then you need to use the following formula instead, because each shard also has a metadata file:
minimum arbiter brick size = 4 KB * ( size in KB of largest data brick in volume or replica set / shard block size in KB )
Alternatively, if you know how many files you will store in a volume, the recommended minimum arbiter brick size is the maximum number of files multiplied by 4 KB. For example, if you expect to have 200,000 files on your volume, your arbiter brick should be at least 800,000 KB, or 0.8 GB, in size.
Red Hat also recommends overprovisioning where possible so that there is no short-term need to increase the size of the arbiter brick. Also, refer to Brick Configuration for more information on usage of maxpct.

5.7.2. Arbitration logic

In an arbitrated volume, whether a file operation is permitted depends on the current state of the bricks in the volume. The following table describes arbitration behavior in all possible volume states.

Table 5.2. Allowed operations for current volume state

Volume stateArbitration behavior
All bricks availableAll file operations permitted.
Arbiter and 1 data brick available
If the arbiter does not agree with the available data node, write operations fail with ENOTCONN (since the brick that is correct is not available). Other file operations are permitted.
If the arbiter's metadata agrees with the available data node, all file operations are permitted.
Arbiter down, data bricks availableAll file operations are permitted. The arbiter's records are healed when it becomes available.
Only one brick available
All file operations fail with ENOTCONN.

5.7.3. Creating an arbitrated replicated volume

The command for creating an arbitrated replicated volume has the following syntax:
# gluster volume create VOLNAME replica 3 arbiter 1 HOST1:DATA_BRICK1 HOST2:DATA_BRICK2 HOST3:ARBITER_BRICK3
This creates a volume with one arbiter for every three replicate bricks. The arbiter is the last brick in every set of three bricks.

Note

The syntax of this command is misleading. There are a total of 3 bricks in this set. This command creates a volume with two bricks that replicate all data and one arbiter brick that replicates only metadata.
In the following example, the bricks on server3 and server6 are the arbiter bricks. Note that because multiple sets of three bricks are provided, this creates a distributed replicated volume with arbiter bricks.
# gluster volume create testvol replica 3 arbiter 1 \
server1:/bricks/brick server2:/bricks/brick server3:/bricks/arbiter_brick \
server4:/bricks/brick server5:/bricks/brick server6:/bricks/arbiter_brick
# gluster volume info testvol
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: ed9fa4d5-37f1-49bb-83c3-925e90fab1bc
Status: Created
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/brick
Brick2: server2:/bricks/brick
Brick3: server3:/bricks/arbiter_brick (arbiter)
Brick1: server4:/bricks/brick
Brick2: server5:/bricks/brick
Brick3: server6:/bricks/arbiter_brick (arbiter)
Options Reconfigured:
cluster.granular-entry-heal: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

5.7.4. Creating multiple arbitrated replicated volumes across fewer total nodes

If you are configuring more than one arbitrated-replicated volume, or a single volume with multiple replica sets, you can use fewer nodes in total by using either of the following techniques:
  • Chain multiple arbitrated replicated volumes together, by placing the arbiter brick for one volume on the same node as a data brick for another volume. Chaining is useful for write-heavy workloads when file size is closer to metadata file size (that is, from 32–128 KiB). This avoids all metadata I/O going through a single disk.
    In arbitrated distributed-replicated volumes, you can also place an arbiter brick on the same node as another replica sub-volume's data brick, since these do not share the same data.
  • Place the arbiter bricks from multiple volumes on a single dedicated node. A dedicated arbiter node is suited to write-heavy workloads with larger files, and read-heavy workloads.

Example 5.6. Example of a dedicated configuration

The following commands create two arbitrated replicated volumes, firstvol and secondvol. Server3 contains the arbiter bricks of both volumes.
# gluster volume create firstvol replica 3 arbiter 1 server1:/bricks/brick server2:/bricks/brick server3:/bricks/arbiter_brick
# gluster volume create secondvol replica 3 arbiter 1 server4:/bricks/data_brick server5:/bricks/brick server3:/bricks/brick
Dedicated Arbiter Node Configuration
Two gluster volumes configured across five servers to create two three-way arbitrated replicated volumes, with the arbiter bricks on a dedicated arbiter node.

Example 5.7. Example of a chained configuration

The following command configures an arbitrated replicated volume with six sub-volumes chained across six servers in a 6 x (2 + 1) configuration.
# gluster volume create arbrepvol replica 3 arbiter 1 server1:/bricks/brick1 server2:/bricks/brick1 server3:/bricks/arbiter_brick1 server2:/bricks/brick2 server3:/bricks/brick2 server4:/bricks/arbiter_brick2 server3:/bricks/brick3 server4:/bricks/brick3 server5:/bricks/arbiter_brick3 server4:/bricks/brick4 server5:/bricks/brick4 server6:/bricks/arbiter_brick4 server5:/bricks/brick5 server6:/bricks/brick5 server1:/bricks/arbiter_brick5 server6:/bricks/brick6 server1:/bricks/brick6 server2:/bricks/arbiter_brick6
6 x (2 + 1) Arbitrated Distributed-Replicated Configuration
Six replicated gluster sub-volumes chained across six servers to create a 6 * (2 + 1) arbitrated distributed-replicated configuration.

5.7.5. Converting to an arbitrated volume

You can convert a replicated volume into an arbitrated volume by adding new arbiter bricks for each replicated sub-volume, or replacing replica bricks with arbiter bricks.

Procedure 5.1. Converting a replica 2 volume to an arbitrated volume

Warning

Do not perform this process if geo-replication is configured. There is a race condition tracked by Bug 1683893 that means data can be lost when converting a volume if geo-replication is enabled.
  1. Verify that healing is not in progress

    # gluster volume heal VOLNAME info
    Wait until pending heal entries is 0 before proceeding.
  2. Disable and stop self-healing

    Run the following commands to disable data, metadata, and entry self-heal, and the self-heal daemon.
    # gluster volume set VOLNAME cluster.data-self-heal off
    # gluster volume set VOLNAME cluster.metadata-self-heal off
    # gluster volume set VOLNAME cluster.entry-self-heal off
    # gluster volume set VOLNAME self-heal-daemon off
  3. Add arbiter bricks to the volume

    Convert the volume by adding an arbiter brick for each replicated sub-volume.
    # gluster volume add-brick VOLNAME replica 3 arbiter 1 HOST:arbiter-brick-path
    For example, if you have an existing two-way replicated volume called testvol, and a new brick for the arbiter to use, you can add a brick as an arbiter with the following command:
    # gluster volume add-brick testvol replica 3 arbiter 1 server:/bricks/arbiter_brick
    If you have an existing two-way distributed-replicated volume, you need a new brick for each sub-volume in order to convert it to an arbitrated distributed-replicated volume, for example:
    # gluster volume add-brick testvol replica 3 arbiter 1 server1:/bricks/arbiter_brick1 server2:/bricks/arbiter_brick2
  4. Wait for client volfiles to update

    This takes about 5 minutes.
  5. Verify that bricks added successfully

    # gluster volume info VOLNAME
    # gluster volume status VOLNAME
  6. Re-enable self-healing

    Run the following commands to re-enable self-healing on the servers.
    # gluster volume set VOLNAME cluster.data-self-heal on
    # gluster volume set VOLNAME cluster.metadata-self-heal on
    # gluster volume set VOLNAME cluster.entry-self-heal on
    # gluster volume set VOLNAME self-heal-daemon on
  7. Verify all entries are healed

    # gluster volume heal VOLNAME info
    Wait until pending heal entries is 0 to ensure that all heals completed successfully.

Procedure 5.2. Converting a replica 3 volume to an arbitrated volume

Warning

Do not perform this process if geo-replication is configured. There is a race condition tracked by Bug 1683893 that means data can be lost when converting a volume if geo-replication is enabled.
  1. Verify that healing is not in progress

    # gluster volume heal VOLNAME info
    Wait until pending heal entries is 0 before proceeding.
  2. Reduce the replica count of the volume to 2

    Remove one brick from every sub-volume in the volume so that the replica count is reduced to 2. For example, in a replica 3 volume that distributes data across 2 sub-volumes, run the following command:
    # gluster volume remove-brick VOLNAME replica 2 HOST:subvol1-brick-path HOST:subvol2-brick-path force

    Note

    In a distributed replicated volume, data is distributed across sub-volumes, and replicated across bricks in a sub-volume. This means that to reduce the replica count of a volume, you need to remove a brick from every sub-volume.
    Bricks are grouped by sub-volume in the gluster volume info output. If the replica count is 3, the first 3 bricks form the first sub-volume, the next 3 bricks form the second sub-volume, and so on.
    # gluster volume info VOLNAME
    [...]
    Number of Bricks: 2 x 3 = 6
    Transport-type: tcp
    Bricks:
    Brick1: node1:/test1/brick
    Brick2: node2:/test2/brick
    Brick3: node3:/test3/brick
    Brick4: node1:/test4/brick
    Brick5: node2:/test5/brick
    Brick6: node3:/test6/brick
    [...]
    In this volume, data is distributed across two sub-volumes, which each consist of three bricks. The first sub-volume consists of bricks 1, 2, and 3. The second sub-volume consists of bricks 4, 5, and 6. Removing any one brick from each subvolume using the following command reduces the replica count to 2 as required.
    # gluster volume remove-brick VOLNAME replica 2 HOST:subvol1-brick-path HOST:subvol2-brick-path force
  3. Disable and stop self-healing

    Run the following commands to disable data, metadata, and entry self-heal, and the self-heal daemon.
    # gluster volume set VOLNAME cluster.data-self-heal off
    # gluster volume set VOLNAME cluster.metadata-self-heal off
    # gluster volume set VOLNAME cluster.entry-self-heal off
    # gluster volume set VOLNAME self-heal-daemon off
  4. Add arbiter bricks to the volume

    Convert the volume by adding an arbiter brick for each replicated sub-volume.
    # gluster volume add-brick VOLNAME replica 3 arbiter 1 HOST:arbiter-brick-path
    For example, if you have an existing replicated volume:
    # gluster volume add-brick testvol replica 3 arbiter 1 server:/bricks/brick
    If you have an existing distributed-replicated volume:
    # gluster volume add-brick testvol replica 3 arbiter 1 server1:/bricks/arbiter_brick1 server2:/bricks/arbiter_brick2
  5. Wait for client volfiles to update

    This takes about 5 minutes. Verify that this is complete by running the following command on each client.
    # grep -ir connected mount-path/.meta/graphs/active/volname-client-*/private
    The number of times connected=1 appears in the output is the number of bricks connected to the client.
  6. Verify that bricks added successfully

    # gluster volume info VOLNAME
    # gluster volume status VOLNAME
  7. Re-enable self-healing

    Run the following commands to re-enable self-healing on the servers.
    # gluster volume set VOLNAME cluster.data-self-heal on
    # gluster volume set VOLNAME cluster.metadata-self-heal on
    # gluster volume set VOLNAME cluster.entry-self-heal on
    # gluster volume set VOLNAME self-heal-daemon on
  8. Verify all entries are healed

    # gluster volume heal VOLNAME info
    Wait until pending heal entries is 0 to ensure that all heals completed successfully.

5.7.6. Converting an arbitrated volume to a three-way replicated volume

You can convert an arbitrated volume into a three-way replicated volume or a three-way distributed replicated volume by replacing the arbiter bricks with full bricks for each replicated sub-volume.

Warning

Do not perform this process if geo-replication is configured. There is a race condition tracked by Bug 1683893 that means data can be lost when converting a volume if geo-replication is enabled.

Procedure 5.3. Converting an arbitrated volume to a replica 3 volume

  1. Verify that healing is not in progress

    # gluster volume heal VOLNAME info
    Wait until pending heal entries is 0 before proceeding.
  2. Remove arbiter bricks from the volume

    Check which bricks are listed as (arbiter), and then remove those bricks from the volume.
    # gluster volume info VOLNAME
    # gluster volume remove-brick VOLNAME replica 2 HOST:arbiter-brick-path force
  3. Disable and stop self-healing

    Run the following commands to disable data, metadata, and entry self-heal, and the self-heal daemon.
    # gluster volume set VOLNAME cluster.data-self-heal off
    # gluster volume set VOLNAME cluster.metadata-self-heal off
    # gluster volume set VOLNAME cluster.entry-self-heal off
    # gluster volume set VOLNAME self-heal-daemon off
  4. Add full bricks to the volume

    Convert the volume by adding a brick for each replicated sub-volume.
    # gluster volume add-brick VOLNAME replica 3 HOST:brick-path
    For example, if you have an existing arbitrated replicated volume:
    # gluster volume add-brick testvol replica 3 server:/bricks/brick
    If you have an existing arbitrated distributed-replicated volume:
    # gluster volume add-brick testvol replica 3 server1:/bricks/brick1 server2:/bricks/brick2
  5. Wait for client volfiles to update

    This takes about 5 minutes.
  6. Verify that bricks added successfully

    # gluster volume info VOLNAME
    # gluster volume status VOLNAME
  7. Re-enable self-healing

    Run the following commands to re-enable self-healing on the servers.
    # gluster volume set VOLNAME cluster.data-self-heal on
    # gluster volume set VOLNAME cluster.metadata-self-heal on
    # gluster volume set VOLNAME cluster.entry-self-heal on
    # gluster volume set VOLNAME self-heal-daemon on
  8. Verify all entries are healed

    # gluster volume heal VOLNAME info
    Wait until pending heal entries is 0 to ensure that all heals completed successfully.

5.7.7. Tuning recommendations for arbitrated volumes

Red Hat recommends the following when arbitrated volumes are in use:
  • For dedicated arbiter nodes, use JBOD for arbiter bricks, and RAID6 for data bricks.
  • For chained arbiter volumes, use the same RAID6 drive for both data and arbiter bricks.
See Chapter 19, Tuning for Performance for more information on enhancing performance that is not specific to the use of arbiter volumes.