Red Hat Training

A Red Hat training course is available for Red Hat Ceph Storage

Chapter 3. General Principles

When selecting hardware for Red Hat Ceph Storage, examine the following general principles. These principles will help save time, avoid common mistakes, save money and achieve a more effective solution.

3.1. Identifying a Performance Use Case

One of the most important steps in a successful Ceph deployment is identifying a price-to-performance profile suitable for the cluster’s use case and workload. It is important to choose the right hardware for the use case. For example, choosing IOPS-optimized hardware for a cold storage application increases hardware costs unnecessarily. Whereas, choosing capacity-optimized hardware for its more attractive price point in an IOPS-intensive workload will likely lead to unhappy users complaining about slow performance.

The primary use cases for Ceph are:

  • IOPS optimized: IOPS optimized deployments are suitable for cloud computing operations, such as running MYSQL or MariaDB instances as virtual machines on OpenStack. IOPS optimized deployments require higher performance storage such as 15k RPM SAS drives and separate SSD journals to handle frequent write operations. Some high IOPS scenarios use all flash storage to improve IOPS and total throughput.
  • Throughput optimized: Throughput-optimized deployments are suitable for serving up significant amounts of data, such as graphic, audio and video content. Throughput-optimized deployments require networking hardware, controllers and hard disk drives with acceptable total throughput characteristics. In cases where write performance is a requirement, SSD journals will substantially improve write performance.
  • Capacity-optimized: Capacity-optimized deployments are suitable for storing significant amounts of data as inexpensively as possible. Capacity-optimized deployments typically trade performance for a more attractive price point. For example, capacity-optimized deployments often use slower and less expensive SATA drives and co-locate journals rather than using SSDs for journaling.

This document provides examples of Red Hat tested hardware suitable for these use cases.

3.2. Considering Storage Density

Hardware planning should include distributing Ceph daemons and other processes that use Ceph across many hosts to maintain high availability in the event of hardware faults. Balance storage density considerations with the need to rebalance the cluster in the event of hardware faults. A common hardware selection mistake is to use very high storage density in small clusters, which can overload networking during backfill and recovery operations.

3.3. Use Identical Hardware

Create pools and define CRUSH hierarchies such that the OSD hardware within the pool is identical. That is:

  • Same controller.
  • Same drive size.
  • Same RPMs.
  • Same seek times.
  • Same I/O.
  • Same network throughput.
  • Same journal configuration.

Using the same hardware within a pool provides a consistent performance profile, simplifies provisioning and streamlines troubleshooting.

3.4. Using 10GB Ethernet as the Production Minimum

Carefully consider bandwidth requirements for the cluster network, be mindful of network link oversubscription, and segregate the intra-cluster traffic from the client-to-cluster traffic.

Important

1Gbps isn’t suitable for production clusters.

In the case of a drive failure, replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network, the replication times would be 20 minutes and 1 hour respectively. Remember that when an OSD fails, the cluster will recover by replicating the data it contained to other OSDs within the pool.

  failed OSD(s)
  -------------
   total OSDs

The failure of a larger domain such as a rack means that the cluster will utilize considerably more bandwidth. Administrators usually prefer that a cluster recovers as quickly as possible.

At a minimum, a single 10Gbps Ethernet link should be used for storage hardware. If the Ceph nodes have many drives each, add additional 10Gbps Ethernet links for connectivity and throughput.

Important

Set up front and backside networks on separate NICs.

Ceph supports a public (front-side) network and a cluster (back-side) network. The public network handles client traffic and communication with Ceph monitors. The cluster (back-side) network handles OSD heartbeats, replication, backfilling and recovery traffic. Red Hat recommends allocating bandwidth to the cluster (back-side) network such that it is a multiple of the front-side network using osd_pool_default_size as the basis for your multiple on replicated pools. Red Hat also recommends running the public and cluster networks on separate NICs.

When building a cluster consisting of multiple racks (common for large clusters), consider utilizing as much network bandwidth between switches in a "fat tree" design for optimal performance. A typical 10Gbps Ethernet switch has 48 10Gbps ports and four 40Gbps ports. Use the 40Gbps ports on the spine for maximum throughput. Alternatively, consider aggregating unused 10Gbps ports with QSFP+ and SFP+ cables into more 40Gbps ports to connect to another rack and spine routers.

Important

For network optimization, Red Hat recommends using jumbo frames for a better CPU/bandwidth ratio, and a non-blocking network switch back-plane. Red Hat Ceph Storage requires the same MTU value throughout all networking devices in the communication path, end-to-end for both public and cluster networks. Verify that the MTU value is the same on all nodes and networking equipment in the environment before using a Red Hat Ceph Storage cluster in production.

See the Verifying and configuring the MTU value section in the Red Hat Ceph Storage Configuration Guide for more details.

3.5. Avoid RAID

Ceph can replicate or erasure code objects. RAID duplicates this functionality on the block level and reduces available capacity. Consequently, RAID is an unnecessary expense. Additionally, a degraded RAID will have a negative impact on performance.

Red Hat recommends that each hard drive be exported separately from the RAID controller as a single volume with write-back caching enabled. This requires a battery-backed, or a non-volatile flash memory device on the storage controller. It is important to make sure the battery is working, as most controllers will disable write-back caching if the memory on the controller can be lost as a result of a power failure. Periodically check the batteries and replace them if necessary, as they do degrade over time. See the storage controller vendor’s documentation for details. Typically, the storage controller vendor provides storage management utilities to monitor and adjust the storage controller configuration without any downtime.

Using Just a Bunch of Drives (JBOD) in independent drive mode with Ceph is supported when using all Solid State Drives (SSDs), or for configurations with high numbers of drives per controller, for example, 60 drives attached to one controller. In this scenario, the write-back caching can become a source of I/O contention, and since JBOD disables write-back caching, it is ideal in this scenario. One advantage of using JBOD mode is the ease of adding or replacing drives and then exposing the drive to the operation system immediately after it is physically plugged in.

3.6. Summary

Common mistakes in hardware selection for Ceph include:

  • Repurposing underpowered legacy hardware for use with Ceph.
  • Using dissimilar hardware in the same pool.
  • Using 1Gbps networks instead of 10Gbps or greater.
  • Neglecting to setup both public and cluster networks.
  • Using RAID instead of JBOD.
  • Selecting drives on a price basis without regard to performance or throughput.
  • Journaling on OSD data drives when the use case calls for an SSD journal.
  • Having a disk controller with insufficient throughput characteristics.

Use the examples in this document of Red Hat tested configurations for different workloads to avoid some of the foregoing hardware selection mistakes.