Chapter 15. Handling a data center failure

As a storage administrator, you can take preventive measures to avoid a data center failure. These preventive measures include:

  • Configuring the data center infrastructure.
  • Setting up failure domains within the CRUSH map hierarchy.
  • Designating failure nodes within the domains.

15.1. Prerequisites

  • A healthy running Red Hat Ceph Storage cluster.
  • Root-level access to all nodes in the storage cluster.

15.2. Avoiding a data center failure

Configuring the data center infrastructure

Each data center within a stretch cluster can have a different storage cluster configuration to reflect local capabilities and dependencies. Set up replication between the data centers to help preserve the data. If one data center fails, the other data centers in the storage cluster contain copies of the data.

Setting up failure domains within the CRUSH map hierarchy

Failure, or failover, domains are redundant copies of domains within the storage cluster. If an active domain fails, the failure domain becomes the active domain.

By default, the CRUSH map lists all nodes in a storage cluster within a flat hierarchy. However, for best results, create a logical hierarchical structure within the CRUSH map. The hierarchy designates the domains to which each node belongs and the relationships among those domains within the storage cluster, including the failure domains. Defining the failure domains for each domain within the hierarchy improves the reliability of the storage cluster.

When planning a storage cluster that contains multiple data centers, place the nodes within the CRUSH map hierarchy so that if one data center goes down, the rest of the storage cluster stays up and running.

Designating failure nodes within the domains

If you plan to use three-way replication for data within the storage cluster, consider the location of the nodes within the failure domain. If an outage occurs within a data center, it is possible that some data might reside in only one copy. When this scenario happens, there are two options:

  • Leave the data in read-only status with the standard settings.
  • Live with only one copy for the duration of the outage.

With the standard settings, and because of the randomness of data placement across the nodes, not all the data will be affected, but some data can have only one copy and the storage cluster would revert to read-only mode. However, if some data exist in only one copy, the storage cluster reverts to read-only mode.

15.3. Handling a data center failure

Red Hat Ceph Storage can withstand catastrophic failures to the infrastructure, such as losing one of the data centers in a stretch cluster. For the standard object store use case, configuring all three data centers can be done independently with replication set up between them. In this scenario, the storage cluster configuration in each of the data centers might be different, reflecting the local capabilities and dependencies.

A logical structure of the placement hierarchy should be considered. A proper CRUSH map can be used, reflecting the hierarchical structure of the failure domains within the infrastructure. Using logical hierarchical definitions improves the reliability of the storage cluster, versus using the standard hierarchical definitions. Failure domains are defined in the CRUSH map. The default CRUSH map contains all nodes in a flat hierarchy. In a three data center environment, such as a stretch cluster, the placement of nodes should be managed in a way that one data center can go down, but the storage cluster stays up and running. Consider which failure domain a node resides in when using 3-way replication for the data.

In the example below, the resulting map is derived from the initial setup of the storage cluster with 6 OSD nodes. In this example, all nodes have only one disk and hence one OSD. All of the nodes are arranged under the default root, that is the standard root of the hierarchy tree. Because there is a weight assigned to two of the OSDs, these OSDs receive fewer chunks of data than the other OSDs. These nodes were introduced later with bigger disks than the initial OSD disks. This does not affect the data placement to withstand a failure of a group of nodes.

Example

[ceph: root@host01 /]# ceph osd tree
ID WEIGHT  TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.33554 root default
-2 0.04779     host host03
 0 0.04779         osd.0            up  1.00000          1.00000
-3 0.04779     host host02
 1 0.04779         osd.1            up  1.00000          1.00000
-4 0.04779     host host01
 2 0.04779         osd.2            up  1.00000          1.00000
-5 0.04779     host host04
 3 0.04779         osd.3            up  1.00000          1.00000
-6 0.07219     host host06
 4 0.07219         osd.4            up  0.79999          1.00000
-7 0.07219     host host05
 5 0.07219         osd.5            up  0.79999          1.00000

Using logical hierarchical definitions to group the nodes into same data center can achieve data placement maturity. Possible definition types of root, datacenter, rack, row and host allow the reflection of the failure domains for the three data center stretch cluster:

  • Nodes host01 and host02 reside in data center 1 (DC1)
  • Nodes host03 and host05 reside in data center 2 (DC2)
  • Nodes host04 and host06 reside in data center 3 (DC3)
  • All data centers belong to the same structure (allDC)

Since all OSDs in a host belong to the host definition there is no change needed. All the other assignments can be adjusted during runtime of the storage cluster by:

  • Defining the bucket structure with the following commands:

    ceph osd crush add-bucket allDC root
    ceph osd crush add-bucket DC1 datacenter
    ceph osd crush add-bucket DC2 datacenter
    ceph osd crush add-bucket DC3 datacenter
  • Moving the nodes into the appropriate place within this structure by modifying the CRUSH map:

    ceph osd crush move DC1 root=allDC
    ceph osd crush move DC2 root=allDC
    ceph osd crush move DC3 root=allDC
    ceph osd crush move host01 datacenter=DC1
    ceph osd crush move host02 datacenter=DC1
    ceph osd crush move host03 datacenter=DC2
    ceph osd crush move host05 datacenter=DC2
    ceph osd crush move host04 datacenter=DC3
    ceph osd crush move host06 datacenter=DC3

Within this structure any new hosts can be added too, as well as new disks. By placing the OSDs at the right place in the hierarchy the CRUSH algorithm is changed to place redundant pieces into different failure domains within the structure.

The above example results in the following:

Example

[ceph: root@host01 /]# ceph osd tree
ID  WEIGHT  TYPE NAME               UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -8 6.00000 root allDC
 -9 2.00000     datacenter DC1
 -4 1.00000         host host01
  2 1.00000             osd.2            up  1.00000          1.00000
 -3 1.00000         host host02
  1 1.00000             osd.1            up  1.00000          1.00000
-10 2.00000     datacenter DC2
 -2 1.00000         host host03
  0 1.00000             osd.0            up  1.00000          1.00000
 -7 1.00000         host host05
  5 1.00000             osd.5            up  0.79999          1.00000
-11 2.00000     datacenter DC3
 -6 1.00000         host host06
  4 1.00000             osd.4            up  0.79999          1.00000
 -5 1.00000         host host04
  3 1.00000             osd.3            up  1.00000          1.00000
 -1       0 root default

The listing from above shows the resulting CRUSH map by displaying the osd tree. Easy to see is now how the hosts belong to a data center and all data centers belong to the same top level structure but clearly distinguishing between locations.

Note

Placing the data in the proper locations according to the map works only properly within the healthy cluster. Misplacement might happen under circumstances, when some OSDs are not available. Those misplacements will be corrected automatically once it is possible to do so.

Additional Resources

  • See the CRUSH administration chapter in the Red Hat Ceph Storage Storage Strategies Guide for more information.