5.8. Creating Arbitrated Replicated Volumes

An arbitrated replicated volume is similar to a two-way replicated volume, in that it contains two full copies of the files in the volume. Arbitrated volumes have an extra arbiter brick for every two data bricks in the volume. Arbiter bricks do not store file data; they only store file names, structure, and metadata. Arbiter bricks use client quorum to compare metadata on the arbiter with the metadata of the other nodes to ensure consistency in the volume and prevent split-brain conditions.

Advantages of arbitrated replicated volumes

Better consistency
When an arbiter is configured, arbitration logic uses client-side quorum in auto mode to prevent file operations that would lead to split-brain conditions.
Less disk space required
Because an arbiter brick only stores file names and metadata, an arbiter brick can be much smaller than the other bricks in the volume.
Fewer nodes required
The node that contains the arbiter brick of one volume can be configured with the data brick of another volume. This "chaining" configuration allows you to use fewer nodes to fulfill your overall storage requirements.
Easy migration from deprecated two-way replicated volumes
Red Hat Gluster Storage can convert a two-way replicated volume without arbiter bricks into an arbitrated replicated volume. See Section 5.8.5, “Converting to an arbitrated volume” for details.

Limitations of arbitrated replicated volumes

  • Arbitrated replicated volumes provide better data consistency than a two-way replicated volume that does not have arbiter bricks. However, because arbitrated replicated volumes store only metadata, they provide the same level of availability as a two-way replicated volume that does not have arbiter bricks. To achieve high-availability, you need to use a three-way replicated volume instead of an arbitrated replicated volume.
  • Tiering is not compatible with arbitrated replicated volumes.
  • Arbitrated volumes can only be configured in sets of three bricks at a time. Red Hat Gluster Storage can convert an existing two-way replicated volume without arbiter bricks into an arbitrated replicated volume by adding an arbiter brick to that volume. See Section 5.8.5, “Converting to an arbitrated volume” for details.

5.8.1. Arbitrated volume requirements

This section outlines the requirements of a supported arbitrated volume deployment.

5.8.1.1. System requirements for nodes hosting arbiter bricks

The minimum system requirements for a node that contains an arbiter brick differ depending on the configuration choices made by the administrator. See Section 5.8.4, “Creating multiple arbitrated replicated volumes across fewer total nodes” for details about the differences between the dedicated arbiter and chained arbiter configurations.

Table 5.1. Requirements for arbitrated configurations on physical machines

Configuration typeMin CPUMin RAMNICArbiter Brick SizeMax Latency
Dedicated arbiter64-bit quad-core processor with 2 sockets8 GB[a]Match to other nodes in the storage pool1 TB to 4 TB[b]5 ms
Chained arbiterMatch to other nodes in the storage pool1 TB to 4 TB[c]5 ms
[a] More RAM may be necessary depending on the combined capacity of the number of arbiter bricks on the node.
[b] Arbiter and data bricks can be configured on the same device provided that the data and arbiter bricks belong to different replica sets. See Section 5.8.1.2, “Arbiter capacity requirements” for further details on sizing arbiter volumes.
[c] Multiple bricks can be created on a single RAIDed physical device. Please refer the following product documentation: Section 21.2, “Brick Configuration”
The requirements for arbitrated configurations on virtual machines are:
  • minimum 4 vCPUs
  • minimum 16 GB RAM
  • 1 TB to 4 TB of virtual disk space
  • maximum 5 ms latency

5.8.1.2. Arbiter capacity requirements

Because an arbiter brick only stores file names and metadata, an arbiter brick can be much smaller than the other bricks in the volume or replica set. The required size for an arbiter brick depends on the number of files being stored on the volume.
The recommended minimum arbiter brick size can be calculated with the following formula:
minimum arbiter brick size = 4 KB * ( size in KB of largest data brick in volume or replica set / average file size in KB)
For example, if you have two 1 TB data bricks, and the average size of the files is 2 GB, then the recommended minimum size for your arbiter brick 2 MB, as shown in the following example:
minimum arbiter brick size  = 4 KB * ( 1 TB / 2 GB )
                            = 4 KB * ( 1000000000 KB / 2000000 KB )
                            = 4 KB * 500 KB
                            = 2000 KB
                            = 2 MB
If sharding is enabled, and your shard-block-size is smaller than the average file size in KB, then you need to use the following formula instead, because each shard also has a metadata file:
minimum arbiter brick size = 4 KB * ( size in KB of largest data brick in volume or replica set / shard block size in KB )
Alternatively, if you know how many files you will store in a volume, the recommended minimum arbiter brick size is the maximum number of files multiplied by 4 KB. For example, if you expect to have 200,000 files on your volume, your arbiter brick should be at least 800,000 KB, or 0.8 GB, in size.
Red Hat also recommends overprovisioning where possible so that there is no short-term need to increase the size of the arbiter brick.

5.8.2. Arbitration logic

In an arbitrated volume, whether a file operation is permitted depends on the current state of the bricks in the volume. The following table describes arbitration behavior in all possible volume states.

Table 5.2. Allowed operations for current volume state

Volume stateArbitration behavior
All bricks availableAll file operations permitted.
Arbiter and 1 data brick available
If the arbiter does not agree with the available data node, write operations fail with ENOTCONN (since the brick that is correct is not available). Other file operations are permitted.
If the arbiter's metadata agrees with the available data node, all file operations are permitted.
Arbiter down, data bricks availableAll file operations are permitted. The arbiter's records are healed when it becomes available.
Only one brick available
If the available brick is a data brick, client quorum is not met, and the volume enters an EROFS state.
If the available brick is the arbiter, all file operations fail with ENOTCONN.

5.8.3. Creating an arbitrated replicated volume

The command for creating an arbitrated replicated volume has the following syntax:
# gluster volume create VOLNAME replica 3 arbiter 1 HOST1:DATA_BRICK1 HOST2:DATA_BRICK2 HOST3:ARBITER_BRICK3
This creates a volume with one arbiter for every three replicate bricks. The arbiter is the last brick in every set of three bricks.

Note

The syntax of this command is misleading. There are a total of 3 bricks in this set. This command creates a volume with two bricks that replicate all data and one arbiter brick that replicates only metadata.
In the following example, the bricks on server3 and server6 are the arbiter bricks. Note that because multiple sets of three bricks are provided, this creates a distributed replicated volume with arbiter bricks.
# gluster volume create testvol replica 3 arbiter 1 \
server1:/bricks/brick server2:/bricks/brick server3:/bricks/arbiter_brick \
server4:/bricks/brick server5:/bricks/brick server6:/bricks/arbiter_brick
# gluster volume info testvol
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: ed9fa4d5-37f1-49bb-83c3-925e90fab1bc
Status: Created
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/brick
Brick2: server2:/bricks/brick
Brick3: server3:/bricks/arbiter_brick (arbiter)
Brick1: server4:/bricks/brick
Brick2: server5:/bricks/brick
Brick3: server6:/bricks/arbiter_brick (arbiter)
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

5.8.4. Creating multiple arbitrated replicated volumes across fewer total nodes

If you are configuring more than one arbitrated-replicated volume, or a single volume with multiple replica sets, you can use fewer nodes in total by using either of the following techniques:
  • Chain multiple arbitrated replicated volumes together, by placing the arbiter brick for one volume on the same node as a data brick for another volume. Chaining is useful for write-heavy workloads when file size is closer to metadata file size (that is, from 32–128 KiB). This avoids all metadata I/O going through a single disk.
    In arbitrated distributed-replicated volumes, you can also place an arbiter brick on the same node as another replica sub-volume's data brick, since these do not share the same data.
  • Place the arbiter bricks from multiple volumes on a single dedicated node. A dedicated arbiter node is suited to write-heavy workloads with larger files, and read-heavy workloads.

Example 5.9. Example of a dedicated configuration

The following commands create two arbitrated replicated volumes, firstvol and secondvol. Server3 contains the arbiter bricks of both volumes.
# gluster volume create firstvol replica 3 arbiter 1 server1:/bricks/brick server2:/bricks/brick server3:/bricks/arbiter_brick
# gluster volume create secondvol replica 3 arbiter 1 server4:/bricks/data_brick server5:/bricks/brick server3:/bricks/brick
Dedicated Arbiter Node Configuration
Two gluster volumes configured across five servers to create two three-way arbitrated replicated volumes, with the arbiter bricks on a dedicated arbiter node.

Example 5.10. Example of a chained configuration

The following command configures an arbitrated replicated volume with six sub-volumes chained across six servers in a 6 x (2 + 1) configuration.
# gluster volume create arbrepvol replica 3 arbiter 1 server1:/bricks/brick1 server2:/bricks/brick1 server3:/bricks/arbiter_brick1 server2:/bricks/brick2 server3:/bricks/brick2 server4:/bricks/arbiter_brick2 server3:/bricks/brick3 server4:/bricks/brick3 server5:/bricks/arbiter_brick3 server4:/bricks/brick4 server5:/bricks/brick4 server6:/bricks/arbiter_brick4 server5:/bricks/brick5 server6:/bricks/brick5 server1:/bricks/arbiter_brick5 server6:/bricks/brick6 server1:/bricks/brick6 server2:/bricks/arbiter_brick6
6 x (2 + 1) Arbitrated Distributed-Replicated Configuration
Six replicated gluster sub-volumes chained across six servers to create a 6 * (2 + 1) arbitrated distributed-replicated configuration.

5.8.5. Converting to an arbitrated volume

Red Hat Gluster Storage lets you convert some existing volumes into arbitrated volumes by adding arbiter bricks.
  • A two-way replicated volume without arbiter bricks can be converted into an arbitrated replicated volume.
  • A two-way distributed-replicated volume without arbiter bricks can be converted into an arbitrated distributed-replicated volume.
You can convert your existing volumes into arbitrated volumes by using the add-brick command.
Red Hat recommends to turn off self-heal and the Self-heal-daemon on the volumes before executing the add-brick command, and turn it on once the conversion is done.
The following command will stop self-heal on client side volumes:
To turn off data self-heal use the below command:
# gluster volume set VOLNAME cluster.data-self-heal off
To turn off metadata self-heal use the following command:
# gluster volume set VOLNAME cluster.metadata-self-heal off
Use the below command to turn off entry self-heal:
# gluster volume set VOLNAMEcluster.entry-self-heal   off
To stop the Self-heal-daemon, use the following command:
# gluster volume set VOLNAME self-heal-daemon off
Execute the following command to convert the existing volumes:
# gluster volume add-brick VOLNAME replica 3 arbiter 1 HOST:arbiter-brick-path
For example, if you have an existing two-way replicated volume called testvol, and a new brick for the arbiter to use, you can add a brick as an arbiter with the following command:
# gluster volume add-brick testvol replica 3 arbiter 1 server:/bricks/arbiter_brick
If you have an existing two-way distributed-replicated volume, you need a new brick for each sub-volume in order to convert it to an arbitrated distributed-replicated volume, for example:
# gluster volume add-brick testvol replica 3 arbiter 1 server1:/bricks/arbiter_brick1 server2:/bricks/arbiter_brick2

Note

Wait for five minutes after the bricks are added and then turn on the self-heal and the Self-heal-daemon.
You can turn on the self-heal and the Self-heal-daemon using the following commands:
# gluster volume set VOLNAME cluster.*-self-heal on
# gluster volume set VOLNAME self-heal-daemon on

5.8.6. Tuning recommendations for arbitrated volumes

Red Hat recommends the following when arbitrated volumes are in use:
  • For dedicated arbiter nodes, use JBOD for arbiter bricks, and RAID6 for data bricks.
  • For chained arbiter volumes, use the same RAID6 drive for both data and arbiter bricks.
See Chapter 21, Tuning for Performance for more information on enhancing performance that is not specific to the use of arbiter volumes.