Chapter 17. Configuring a multi-site, fault-tolerant messaging system

Large-scale enterprise messaging systems commonly have discrete broker clusters located in geographically distributed data centers. In the event of a data center outage, system administrators might need to preserve existing messaging data and ensure that client applications can continue to produce and consume messages. You can use specific broker topologies and the Red Hat Ceph Storage software-defined storage platform to ensure continuity of your messaging system during a data center outage. This type of solution is called a multi-site, fault-tolerant architecture.

The following sections explain how to protect your messaging system from data center outages. These sections provide information about:

Note

Multi-site fault tolerance is not a replacement for high-availability (HA) broker redundancy within data centers. Broker redundancy based on live-backup groups provides automatic protection against single broker failures within single clusters. By contrast, multi-site fault tolerance protects against large-scale data center outages.

Note

To use Red Hat Ceph Storage to ensure continuity of your messaging system, you must configure your brokers to use the shared store high-availability (HA) policy. You cannot configure your brokers to use the replication HA policy. For more information about these policies, see Implementing High Availability.

17.1. How Red Hat Ceph Storage clusters work

Red Hat Ceph Storage is a clustered object storage system. Red Hat Ceph Storage uses data sharding of objects and policy-based replication to guarantee data integrity and system availability.

Red Hat Ceph Storage uses an algorithm called CRUSH (Controlled Replication Under Scalable Hashing) to determine how to store and retrieve data by automatically computing data storage locations. You configure Ceph items called CRUSH maps, which detail cluster topography and specify how data is replicated across storage clusters.

CRUSH maps contain lists of Object Storage Devices (OSDs), a list of ‘buckets’ for aggregating the devices into a failure domain hierarchy, and rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools.

By reflecting the underlying physical organization of the installation, CRUSH maps can model — and thereby address — potential sources of correlated device failures, such as physical proximity, shared power sources, and shared networks. By encoding this information into the cluster map, CRUSH can separate object replicas across different failure domains (for example, data centers) while still maintaining a pseudo-random distribution of data across the storage cluster. This helps to prevent data loss and enables the cluster to operate in a degraded state.

Red Hat Ceph Storage clusters require a number of nodes (physical or virtual) to operate. Clusters must include the following types of nodes:

Monitor nodes

Each Monitor (MON) node runs the monitor daemon (ceph-mon), which maintains a master copy of the cluster map. The cluster map includes the cluster topology. A client connecting to the Ceph cluster retrieves the current copy of the cluster map from the Monitor, which enables the client to read from and write data to the cluster.

Important

A Red Hat Ceph Storage cluster can run with one Monitor node; however, to ensure high availability in a production cluster, Red Hat supports only deployments with at least three Monitor nodes. A minimum of three Monitor nodes means that in the event of the failure or unavailability of one Monitor, a quorum exists for the remaining Monitor nodes in the cluster to elect a new leader.

Manager nodes

Each Manager (MGR) node runs the Ceph Manager daemon (ceph-mgr), which is responsible for keeping track of runtime metrics and the current state of the Ceph cluster, including storage utilization, current performance metrics, and system load. Usually, Manager nodes are colocated (that is, on the same host machine) with Monitor nodes.

Object Storage Device nodes

Each Object Storage Device (OSD) node runs the Ceph OSD daemon (ceph-osd), which interacts with logical disks attached to the node. Ceph stores data on OSD nodes. Ceph can run with very few OSD nodes (the default is three), but production clusters realize better performance at modest scales, for example, with 50 OSDs in a storage cluster. Having multiple OSDs in a storage cluster enables system administrators to define isolated failure domains within a CRUSH map.

Metadata Server nodes

Each Metadata Server (MDS) node runs the MDS daemon (ceph-mds), which manages metadata related to files stored on the Ceph File System (CephFS). The MDS daemon also coordinates access to the shared cluster.

Additional resources

For more information about Red Hat Ceph Storage, see What is Red Hat Ceph Storage?

17.2. Installing Red Hat Ceph Storage

AMQ Broker multi-site, fault-tolerant architectures use Red Hat Ceph Storage 3. By replicating data across data centers, a Red Hat Ceph Storage cluster effectively creates a shared store available to brokers in separate data centers. You configure your brokers to use the shared store high-availability (HA) policy and store messaging data in the Red Hat Ceph Storage cluster.

Red Hat Ceph Storage clusters intended for production use should have a minimum of:

  • Three Monitor (MON) nodes
  • Three Manager (MGR) nodes
  • Three Object Storage Device (OSD) nodes containing multiple OSD daemons
  • Three Metadata Server (MDS) nodes
Important

You can run the OSD, MON, MGR, and MDS nodes on either the same or separate physical or virtual machines. However, to ensure fault tolerance within your Red Hat Ceph Storage cluster, it is good practice to distribute each of these types of nodes across distinct data centers. In particular, you must ensure that in the event of a single data center outage, your storage cluster still has a minimum of two available MON nodes. Therefore, if you have three MON nodes in you cluster, each of these nodes must run on separate host machines in separate data centers. Do not run two MON nodes in a single data center, because failure of this data center will leave your storage cluster with only one remaining MON node. In this situation, the storage cluster can no longer operate.

The procedures linked-to from this section show you how to install a Red Hat Ceph Storage 3 cluster that includes MON, MGR, OSD, and MDS nodes.

Prerequisites

Procedure

17.3. Configuring a Red Hat Ceph Storage cluster

This example procedure shows how to configure your Red Hat Ceph storage cluster for fault tolerance. You create CRUSH buckets to aggregate your Object Storage Device (OSD) nodes into data centers that reflect your real-life, physical installation. In addition, you create a rule that tells CRUSH how to replicate data in your storage pools. These steps update the default CRUSH map that was created by your Ceph installation.

Prerequisites

  • You have already installed a Red Hat Ceph Storage cluster. For more information, see Installing Red Hat Ceph Storage.
  • You should understand how Red Hat Ceph Storage uses Placement Groups (PGs) to organize large numbers of data objects in a pool, and how to calculate the number of PGs to use in your pool. For more information, see Placement Groups (PGs).
  • You should understand how to set the number of object replicas in a pool. For more information, Set the Number of Object Replicas.

Procedure

  1. Create CRUSH buckets to organize your OSD nodes. Buckets are lists of OSDs, based on physical locations such as data centers. In Ceph, these physical locations are known as failure domains.

    ceph osd crush add-bucket dc1 datacenter
    ceph osd crush add-bucket dc2 datacenter
  2. Move the host machines for your OSD nodes to the data center CRUSH buckets that you created. Replace host names host1-host4 with the names of your host machines.

    ceph osd crush move host1 datacenter=dc1
    ceph osd crush move host2 datacenter=dc1
    ceph osd crush move host3 datacenter=dc2
    ceph osd crush move host4 datacenter=dc2
  3. Ensure that the CRUSH buckets you created are part of the default CRUSH tree.

    ceph osd crush move dc1 root=default
    ceph osd crush move dc2 root=default
  4. Create a rule to map storage object replicas across your data centers. This helps to prevent data loss and enables your cluster to stay running in the event of a single data center outage.

    The command to create a rule uses the following syntax: ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>. An example is shown below.

    ceph osd crush rule create-replicated multi-dc default datacenter hdd
    Note

    In the preceding command, if your storage cluster uses solid-state drives (SSD), specify ssd instead of hdd (hard disk drives).

  5. Configure your Ceph data and metadata pools to use the rule that you created. Initially, this might cause data to be backfilled to the storage destinations determined by the CRUSH algorithm.

    ceph osd pool set cephfs_data crush_rule multi-dc
    ceph osd pool set cephfs_metadata crush_rule multi-dc
  6. Specify the numbers of Placement Groups (PGs) and Placement Groups for Placement (PGPs) for your metadata and data pools. The PGP value should be equal to the PG value.

    ceph osd pool set cephfs_metadata pg_num 128
    ceph osd pool set cephfs_metadata pgp_num 128
    
    ceph osd pool set cephfs_data pg_num 128
    ceph osd pool set cephfs_data pgp_num 128
  7. Specify the numbers of replicas to be used by your data and metadata pools.

    ceph osd pool set cephfs_data min_size 1
    ceph osd pool set cephfs_metadata min_size 1
    
    ceph osd pool set cephfs_data size 2
    ceph osd pool set cephfs_metadata size 2

The following figure shows the Red Hat Ceph Storage cluster created by the preceding example procedure. The storage cluster has OSDs organized into CRUSH buckets corresponding to data centers.

broker disaster recovery 1

The following figure shows a possible layout of the first data center, including your broker servers. Specifically, the data center hosts:

  • The servers for two live-backup broker pairs
  • The OSD nodes that you assigned to the first data center in the preceding procedure
  • Single Metadata Server, Monitor and Manager nodes. The Monitor and Manager nodes are usually co-located on the same machine.
broker disaster recovery 2
Important

You can run the OSD, MON, MGR, and MDS nodes on either the same or separate physical or virtual machines. However, to ensure fault tolerance within your Red Hat Ceph Storage cluster, it is good practice to distribute each of these types of nodes across distinct data centers. In particular, you must ensure that in the event of a single data center outage, you storage cluster still has a minimum of two available MON nodes. Therefore, if you have three MON nodes in you cluster, each of these nodes must run on separate host machines in separate data centers.

The following figure shows a complete example topology. To ensure fault tolerance in your storage cluster, the MON, MGR, and MDS nodes are distributed across three separate data centers.

broker disaster recovery 10
Note

Locating the host machines for certain OSD nodes in the same data center as your broker servers does not mean that you store messaging data on those specific OSD nodes. You configure the brokers to store messaging data in a specified directory in the Ceph File System. The Metadata Server nodes in your cluster then determine how to distribute the stored data across all available OSDs in your data centers and handle replication of this data across data centers. the sections that follow show how to configure brokers to store messaging data on the Ceph File System.

The figure below illustrates replication of data between the two data centers that have broker servers.

broker disaster recovery 3

Additional resources

For more information about:

  • Administrating CRUSH for your Red Hat Ceph Storage cluster, see CRUSH Administration.
  • The full set of attributes that you can set on a storage pool, see Pool Values.

17.4. Mounting the Ceph File System on your broker servers

Before you can configure brokers in your messaging system to store messaging data in your Red Hat Ceph Storage cluster, you first need to mount a Ceph File System (CephFS).

The procedure linked-to from this section shows you how to mount the CephFS on your broker servers.

Prerequisites

Procedure

For instructions on mounting the Ceph File System on your broker servers, see Mounting the Ceph File System as a kernel client.

17.5. Configuring brokers in a multi-site, fault-tolerant messaging system

To configure your brokers as part of a multi-site, fault-tolerant messaging system, you need to:

17.5.1. Adding backup brokers

Within each of your data centers, you need to add idle backup brokers that can take over from live master-slave broker groups that shut down in the event of a data center outage. You should replicate the configuration of live master brokers in your idle backup brokers. You also need to configure your backup brokers to accept client connections in the same way as your existing brokers.

In a later procedure, you see how to configure an idle backup broker to join an existing master-slave broker group. You must locate the idle backup broker in a separate data center to that of the live master-slave broker group. It is also recommended that you manually start the idle backup broker only in the event of a data center failure.

The following figure shows an example topology.

broker disaster recovery 4

Additional resources

17.5.2. Configuring brokers as Ceph clients

When you have added the backup brokers that you need for a fault-tolerant system, you must configure all of the broker servers with the Ceph client role. The client role enable brokers to store data in your Red Hat Ceph Storage cluster.

To learn how to configure Ceph clients, see Installing the Ceph Client Role.

17.5.3. Configuring shared store high availability

The Red Hat Ceph Storage cluster effectively creates a shared store that is available to brokers in different data centers. To ensure that messages remain available to broker clients in the event of a failure, you configure each broker in your live-backup group to use:

  • The shared store high availability (HA) policy
  • The same journal, paging, and large message directories in the Ceph File System

The following procedure shows how to configure the shared store HA policy on the master, slave, and idle backup brokers of your live-backup group.

Procedure

  1. Edit the broker.xml configuration file of each broker in the live-backup group. Configure each broker to use the same paging, bindings, journal, and large message directories in the Ceph File System.

    # Master Broker - DC1
    <paging-directory>mnt/cephfs/broker1/paging</paging-directory>
    <bindings-directory>/mnt/cephfs/data/broker1/bindings</bindings-directory>
    <journal-directory>/mnt/cephfs/data/broker1/journal</journal-directory>
    <large-messages-directory>mnt/cephfs/data/broker1/large-messages</large-messages-directory>
    
    # Slave Broker - DC1
    <paging-directory>mnt/cephfs/broker1/paging</paging-directory>
    <bindings-directory>/mnt/cephfs/data/broker1/bindings</bindings-directory>
    <journal-directory>/mnt/cephfs/data/broker1/journal</journal-directory>
    <large-messages-directory>mnt/cephfs/data/broker1/large-messages</large-messages-directory>
    
    # Backup Broker (Idle) - DC2
    <paging-directory>mnt/cephfs/broker1/paging</paging-directory>
    <bindings-directory>/mnt/cephfs/data/broker1/bindings</bindings-directory>
    <journal-directory>/mnt/cephfs/data/broker1/journal</journal-directory>
    <large-messages-directory>mnt/cephfs/data/broker1/large-messages</large-messages-directory>
  2. Configure the backup broker as a master within it’s HA policy, as shown below. This configuration setting ensures that the backup broker immediately becomes the master when you manually start it. Because the broker is an idle backup, the failover-on-shutdown parameter that you can specify for an active master broker does not apply in this case.

    <configuration>
        <core>
            ...
            <ha-policy>
                <shared-store>
                    <master>
                    </master>
                </shared-store>
            </ha-policy>
            ...
        </core>
    </configuration>

Additional resources

17.6. Configuring clients in a multi-site, fault-tolerant messaging system

An internal client application is one that is running on a machine located in the same data center as the broker server. The following figure shows this topology.

broker disaster recovery 5

An external client application is one running on a machine located outside the broker data center. The following figure shows this topology.

broker disaster recovery 6

The following sub-sections describe show examples of configuring your internal and external client applications to connect to a backup broker in another data center in the event of a data center outage.

17.6.1. Configuring internal clients

If you experience a data center outage, internal client applications will shut down along with your brokers. To mitigate this situation, you must have another instance of the client application available in a separate data center. In the event of a data center outage, you manually start your backup client to connect to a backup broker that you have also manually started.

To enable the backup client to connect to a backup broker, you need to configure the client connection similarly to that of the client in your primary data center.

Example

A basic connection configuration for an AMQ Core Protocol JMS client to a master-slave broker group is shown below. In this example, host1 and host2 are the host servers for the master and slave brokers.

<ConnectionFactory connectionFactory = new ActiveMQConnectionFactory(“(tcp://host1:port,tcp://host2:port)?ha=true&retryInterval=100&retryIntervalMultiplier=1.0&reconnectAttempts=-1”);

To configure a backup client to connect to a backup broker in the event of a data center outage, use a similar connection configuration, but specify only the host name of your backup broker server. In this example, the backup broker server is host3.

<ConnectionFactory connectionFactory = new ActiveMQConnectionFactory(“(tcp://host3:port)?ha=true&retryInterval=100&retryIntervalMultiplier=1.0&reconnectAttempts=-1”);

Additional resources

For more information about configuring broker and client network connections, see:

17.6.2. Configuring external clients

To enable an external broker client to continue producing or consuming messaging data in the event of a data center outage, you must configure the client to fail over to a broker in another data center. In the case of a multi-site, fault-tolerant system, you configure the client to fail over to the backup broker that you manually start in the event of an outage.

Examples

Shown below are examples of configuring the AMQ Core Protocol JMS and AMQP JMS clients to fail over to a backup broker in the event that the primary master-slave group is unavailable. In these examples, host1 and host2 are the host servers for the primary master and slave brokers, while host3 is the host server for the backup broker that you manually start in the event of a data center outage.

  • To configure an AMQ Core Protocol JMS client, include the backup broker on the ordered list of brokers that the client attempts to connect to.

    <ConnectionFactory connectionFactory = new ActiveMQConnectionFactory(“(tcp://host1:port,tcp://host2:port,tcp://host3:port)?ha=true&retryInterval=100&retryIntervalMultiplier=1.0&reconnectAttempts=-1”);
  • To configure an AMQP JMS client, include the backup broker in the failover URI that you configure on the client.

    failover:(amqp://host1:port,amqp://host2:port,amqp://host3:port)?jms.clientID=foo&failover.maxReconnectAttempts=20

Additional resources

For more information about configuring failover on:

17.7. Verifying storage cluster health during a data center outage

When you have configured your Red Hat Ceph Storage cluster for fault tolerance, the cluster continues to run in a degraded state without losing data, even when one of your data centers fails.

This procedure shows how to verify the status of your cluster while it runs in a degraded state.

Procedure

  1. To verify the status of your Ceph storage cluster, use the health or status commands:

    # ceph health
    # ceph status
  2. To watch the ongoing events of the cluster on the command line, open a new terminal. Then, enter:

    # ceph -w

When you run any of the preceding commands, you see output indicating that the storage cluster is still running, but in a degraded state. Specifically, you should see a warning that resembles the following:

health: HEALTH_WARN
        2 osds down
        Degraded data redundancy: 42/84 objects degraded (50.0%), 16 pgs unclean, 16 pgs degraded

Additional resources

  • For more information about monitoring the health of your Red Hat Ceph Storage cluster, see Monitoring.

17.8. Maintaining messaging continuity during a data center outage

The following procedure shows you how to keep brokers and associated messaging data available to clients during a data center outage. Specifically, when a data center fails, you need to:

  • Manually start any idle backup brokers that you created to take over from brokers in your failed data center.
  • Connect internal or external clients to the new active brokers.

Prerequisites

Procedure

  1. For each master-slave broker pair in the failed data center, manually start the idle backup broker that you added.

    broker disaster recovery 7
  2. Reestablish client connections.

    1. If you were using an internal client in the failed data center, manually start the backup client that you created. As described in Configuring clients in a multi-site, fault-tolerant messaging system, you must configure the client to connect to the backup broker that you manually started.

      The following figure shows the new topology.

      broker disaster recovery 8
    2. If you have an external client, manually connect the external client to the new active broker or observe that the clients automatically fails over to the new active broker, based on its configuration. For more information, see Configuring external clients.

      The following figure shows the new topology.

      broker disaster recovery 9

17.9. Restarting a previously failed data center

When a previously failed data center is back online, follow these steps to restore the original state of your messaging system:

  • Restart the servers that host the nodes of your Red Hat Ceph Storage cluster
  • Restart the brokers in your messaging system
  • Re-establish connections from your client applications to your restored brokers

The following sub-sections show to perform these steps.

17.9.1. Restarting storage cluster servers

When you restart Monitor, Metadata Server, Manager, and Object Storage Device (OSD) nodes in a previously failed data center, your Red Hat Ceph Storage cluster self-heals to restore full data redundancy. During this process, Red Hat Ceph Storage automatically backfills data to the restored OSD nodes, as needed.

To verify that your storage cluster is automatically self-healing and restoring full data redundancy, use the commands previously shown in Verifying storage cluster health during a data center outage. When you re-execute these commands, you see that the percentage degradation indicated by the previous HEALTH_WARN message starts to improve until it returns to 100%.

17.9.2. Restarting broker servers

The following procedure shows how to restart your broker servers when your storage cluster is no longer operating in a degraded state.

Procedure

  1. Stop any client applications connected to backup brokers that you manually started when the data center outage occurred.
  2. Stop the backup brokers that you manually started.

    1. On Linux:

      BROKER_INSTANCE_DIR/bin/artemis stop
    2. On Windows:

      BROKER_INSTANCE_DIR\bin\artemis-service.exe stop
  3. In your previously failed data center, restart the original master and slave brokers.

    1. On Linux:

      BROKER_INSTANCE_DIR/bin/artemis run
    2. On Windows:

      BROKER_INSTANCE_DIR\bin\artemis-service.exe start

The original master broker automatically resumes its role as master when you restart it.

17.9.3. Reestablishing client connections

When you have restarted your broker servers, reconnect your client applications to those brokers. The following subsections describe how to reconnect both internal and external client applications.

17.9.3.1. Reconnecting internal clients

Internal clients are those running in the same, previously failed data center as the restored brokers. To reconnect internal clients, restart them. Each client application reconnects to the restored master broker that is specified in its connection configuration.

For more information about configuring broker and client network connections, see:

17.9.3.2. Reconnecting external clients

External clients are those running outside the data center that previously failed. Based on your client type, and the information in Configuring external broker clients, you either configured the client to automatically fail over to a backup broker, or you manually established this connection. When you restore your previously failed data center, you reestablish a connection from your client to the restored master broker in a similar way, as described below.

  • If you configured your external client to automatically fail over to a backup broker, the client automatically fails back to the original master broker when you shut down the backup broker and restart the original master broker.
  • If you manually connected the external client to a backup broker when a data center outage occurred, you must manually reconnect the client to the original master broker that you restart.