Chapter 6. Using Galera

In a high-availability deployment, Red Hat OpenStack Platform uses the MariaDB Galera Cluster to manage database replication. As described in Section 4.3, “OpenStack Services Configured in Pacemaker”, Pacemaker runs the Galera service using a bundle set resource that manages the master/slave status.

You can use the pcs status command to check if the galera-bundle service is running, and on which controllers it runs:

Podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest]
  galera-bundle-0      (ocf::heartbeat:galera):        Master overcloud-controller-0
  galera-bundle-1      (ocf::heartbeat:galera):        Master overcloud-controller-1
  galera-bundle-2      (ocf::heartbeat:galera):        Master overcloud-controller-2

6.1. Hostname resolution

When troubleshooting the MariaDB Galera Cluster, you can verify the hostname resolution. By default, the director binds the Galera resource to a hostname rather than to an IP address [1]. Therefore, any problems that prevent hostname resolution, such as misconfigured or failed DNS, might prevent Pacemaker from properly managing the Galera resource.

After you eliminate any hostname resolution problems, you can check the integrity of the cluster itself. To do so, check the write-set replication status on the database of each controller node.

To access MySQL, use the password that was set up by the director during the overcloud deployment. To retrieve the MySQL root password, use the hiera command:

$ sudo hiera -c /etc/puppet/hiera.yaml "mysql::server::root_password"
*<MYSQL-HIERA-PASSWORD>*

Write-set replication information is stored on each node’s MariaDB database. Each relevant variable uses the prefix wsrep_. Therefore, you can query this information directly through the database client:

$ sudo mysql -B --password="<MYSQL-HIERA-PASSWORD>" -e "SHOW GLOBAL STATUS LIKE 'wsrep_%';"
    +------------------------+-------+
    | Variable_name          | Value |
    +------------------------+-------+
    | wsrep_protocol_version | 5     |
    | wsrep_last_committed   | 202   |
    | ...                    | ...   |
    | wsrep_thread_count     | 2     |
    +------------------------+-------+

To verify the health and integrity of the MariaDB Galera Cluster, check first whether the cluster is reporting the right number of nodes. Then, check each node if it:

  • is part of the correct cluster
  • can write to the cluster
  • can receive queries and writes from the cluster
  • is connected to others within the cluster
  • is replicating write-sets to tables in the local database

6.2. Database Cluster Integrity

When investigating problems with the MariaDB Galera Cluster, you can check the integrity of the cluster itself. Verifying cluster integrity involves checking specific wsrep_ database variables on each Controller node. To check a database variable, run:

$ sudo mysql -B --password="<MYSQL-HIERA-PASSWORD>" -e "SHOW GLOBAL STATUS LIKE 'VARIABLE';"

Replace VARIABLE with the wsrep_ database variable you want to check. For example, to view the node’s cluster state UUID:

$ sudo mysql -B --password="<MYSQL-HIERA-PASSWORD>" -e "SHOW GLOBAL STATUS LIKE 'wsrep_cluster_state_uuid';"
    +--------------------------+--------------------------------------+
    | Variable_name            | Value                                |
    +--------------------------+--------------------------------------+
    | wsrep_cluster_state_uuid | e2c9a15e-5485-11e0-0800-6bbb637e7211 |
    +--------------------------+--------------------------------------+

The following table lists the different wsrep_ database variables that relate to cluster integrity.

Table 6.1. Database variables to check for cluster integrity

VariableSummaryDescription

wsrep_cluster_state_uuid

Cluster state UUID

The ID of the cluster to which the node belongs. All nodes must have an identical ID. A node with a different ID is not connected to the cluster.

wsrep_cluster_size

Number of nodes in the cluster

You can check this on any single node. If the value is less than the actual number of nodes, then some nodes have either failed or lost connectivity.

wsrep_cluster_conf_id

Total number of cluster changes

Determines whether or not the cluster has been split into several components, also known as a partition. This is likely caused by a network failure. All nodes must have an identical value.

In case some nodes are reporting a different wsrep_cluster_conf_id, check their wsrep_cluster_status value to see if it can still write to the cluster (Primary).

wsrep_cluster_status

Primary component status

Determines whether or not the node can still write to the cluster. If so, then the wsrep_cluster_status should be Primary. Any other value indicates that the node is part of a non-operational partition.

6.3. Database Cluster Node

If you can isolate a Galera cluster problem to a specific node, other wsrep_ database variables can provide clues on the specific problem. You can check these variables in a similar manner as a cluster check, which is described in Section 6.2, “Database Cluster Integrity”.

$ sudo mysql -B --password="<MYSQL-HIERA-PASSWORD>" -e "SHOW GLOBAL STATUS LIKE 'VARIABLE';"

Replace VARIABLE with any of the following values:

Table 6.2. Database variables to check for node integrity

VariableSummaryDescription

wsrep_ready

Node ability to accept queries

States whether the node can accept write-sets from the cluster. If so, then wsrep_ready should be ON.

wsrep_connected

Node network connectiviry

States whether the node has network connectivity to other nodes. If so, then wsrep_connected should be ON.

wsrep_local_state_comment

Node state

Summarizes the node state. If node can still write to the cluster (ie. if wsrep_cluster_status is Primary, see Section 6.2, “Database Cluster Integrity”), then typical values for wsrep_local_state_comment are Joining, Waiting on SST, Joined, Synced, or Donor.

If the node is part of a non-operational component, then wsrep_local_state_comment is set to Initialized.

Note

A wsrep_connected variable with an ON value might also mean that the node is only connected to some nodes but not all nodes. For example, in cases of a cluster partition, the node might be part of a component that cannot write to the cluster. See Section 6.2, “Database Cluster Integrity” for details.

If wsrep_connected is OFF, then the node is not connected to any cluster components.

6.4. Database Replication Performance

If the cluster and its individual nodes are all healthy and stable, you can check replication throughput to benchmark performance. You do this by checking the wsrep_ database variables, similar to Section 6.3, “Database Cluster Node” and Section 6.2, “Database Cluster Integrity”.

$ sudo mysql -B --password="<MYSQL-HIERA-PASSWORD>" -e "SHOW STATUS LIKE 'VARIABLE';"

Replace VARIABLE with any of the following values:

Table 6.3. Database variables to check for cluster performance (replication throughput)

VariableSummary

wsrep_local_recv_queue_avg

Average size of the local received queue since last query

wsrep_local_send_queue_avg

Average send queue length since the last time the variable was queried

wsrep_local_recv_queue_min and wsrep_local_recv_queue_max

The minimum and maximum sizes the local received queue since either variable was last queried

wsrep_flow_control_paused

Fraction of time that the node paused due to Flow Control since the last time the variable was queried

wsrep_cert_deps_distance

Average distance between the lowest and highest sequence number (seqno) value that can be applied in parallel (such as the potential degree of parallelization)

Each time any of these variables are queried, a FLUSH STATUS command resets its value. Benchmarking cluster replication involves querying these values multiple times to see variances. These variances can help you judge how much Flow Control is affecting the cluster’s performance.

Flow Control is a mechanism used by the cluster to manage replication. When the local received write-set queue exceeds a certain threshold, Flow Control pauses replication in order for the node to catch up. See Flow Control from the Galera Cluster site for more information.

Check the following variables for optimization of different values and benchmarks:

wsrep_local_recv_queue_avg > 0.0
The node cannot apply write-sets as quickly as it receives them, which then triggers replication throttling. Check wsrep_local_recv_queue_min and wsrep_local_recv_queue_max for a detailed look at this benchmark.
wsrep_local_send_queue_avg > 0.0
As the value of wsrep_local_send_queue_avg rises, so does the likelihood of replication throttling and network throughput issues. This is especially true as wsrep_local_recv_queue_avg rises.
wsrep_flow_control_paused > 0.0

Flow Control paused the node. To determine how long the node was paused, multiply the wsrep_flow_control_paused value with the number of seconds between querying it. For example, if wsrep_flow_control_paused = 0.50 a minute after last checking it, then node replication was paused for 30 seconds. If wsrep_flow_control_paused = 1.0, then the node was paused the entire time since the last query.

Ideally, wsrep_flow_control_paused should be as close to 0.0 as possible.

wsrep_cert_deps_distance
In case of throttling and pausing, you can check the wsrep_cert_deps_distance variable to see how many write-sets (on average) can be applied in parallel. Then, check wsrep_slave_threads to see how many write-sets can actually be applied simultaneously.
wsrep_slave_threads

Configuring a higher wsrep_slave_threads can help mitigate throttling and pausing. For example, if the wsrep_cert_deps_distance value is 20, then doubling wsrep_slave_threads from 2 to 4 can also double the amount of write-sets that the node can apply. However, wsrep_slave_threads should be set only as high as the number of CPU cores in the node.

If a problematic node already has an optimal wsrep_slave_threads setting, then consider excluding the node from the cluster as you investigate possible connectivity issues.



[1] This method was implemented to allow Galera to start successfully in overclouds that use IPv6 (specifically, to address BZ#1298671).