Chapter 2. General Principles
2.1. Identifying a Performance Use Case
One of the most important steps in a successful Ceph deployment is identifying a price/performance profile suitable for the cluster’s use case and workload. It is important to choose the right hardware for the use case. For example, choosing IOPS-optimized hardware for a cold storage application increases hardware costs unnecessarily. Whereas, choosing capacity-optimized hardware for its more attractive price point in an IOPS-intensive workload will likely lead to unhappy users complaining about slow performance.
The primary use cases for Ceph are:
- IOPS optimized: IOPS optimized deployments are suitable for cloud computing operations, such as running MYSQL or MariaDB instances as virtual machines on OpenStack. IOPS optimized deployments require higher performance storage such as SAS drives and separate SSD journals to handle frequent write operations. Some high IOPS scenarios use all solid state drives, and all flash storage to improve IOPS and total throughput. Additionally, the storage industry is evolving with Non-volatile Memory Express (NVMe) for SSDs, which should improve performance substantially.
- Throughput optimized: Throughput optimized deployments are suitable for serving up significant amounts of data, such as graphic, audio and video content. Throughput optimized deployments require networking hardware, controllers and SAS drives with acceptable total throughput characteristics. In cases where write performance is a requirement, SSD journals will substantially improve write performance.
- Capacity-optimized: Capacity optimized deployments are suitable for storing significant amounts of data as inexpensively as possible. Capacity optimized deployments typically trade performance for a more attractive price point. For example, capacity-optimized deployments often use slower and less expensive SATA drives and co-locate journals rather than using SSDs for journaling.
The foregoing use cases aren’t exhaustive. Ceph is also evolving with its technology preview "BlueStore" for rotating disks, which may avoid the added cost of SSD journaling in some use cases when BlueStore is production ready. "PMStore," when it is production ready, should improve the performance of solid state drives and flash memory too.
2.2. Considering Storage Density
Hardware planning should include distributing Ceph daemons and other processes that use Ceph across many hosts to maintain high availability in the event of hardware faults. This means that you will need to balance storage density considerations with the need to rebalance (backfill) your cluster in the event of hardware faults. A common mistake is to use very high storage density in small clusters, which can overload networking.
2.3. Using Identical Hardware
Create pools and define CRUSH hierarchies such that the OSD hardware within the pool is identical. That is:
- Same controller.
- Same drive size.
- Same RPMs.
- Same seek times.
- Same I/O.
- Same network throughput.
- Same journal configuration.
Using the same hardware within a pool provides a consistent performance profile, simplifies provisioning and streamlines troubleshooting.
2.4. Using 10GB Ethernet-Production Minimum
Carefully consider bandwidth requirements for your cluster network, be mindful of network link oversubscription, and segregate the intra-cluster traffic from the client-to-cluster traffic.
1Gbps isn’t suitable for production clusters.
In the case of a drive failure, replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network, the replication times would be 20 minutes and 1 hour respectively. Remember that when an OSD fails, the cluster will recover by replicating the data it contained to other OSDs within the pool.
failed OSD(s) ------------- total OSDs
The failure of a larger domain such as a rack means that your cluster will utilize considerably more bandwidth. Administrators usually prefer that a cluster recovers as quickly as possible.
At a minimum, a single 10Gbps Ethernet link should be used for storage hardware. If your Ceph nodes have many drives each, add additional 10Gbps Ethernet links for connectivity and throughput.
Set up front and backside networks on separate NICs.
Ceph supports a public (front-side) network and a cluster (back-side) network. The public network handles client traffic and communication with Ceph monitors. The cluster (back-side) network handles OSD heartbeats, replication, backfilling and recovery traffic. Red Hat recommends allocating bandwidth to the cluster (back-side) network such that it is a multiple of the front-side network using
osd pool default size as the basis for your multiple on replicated pools. Red Hat also recommends running the public and cluster networks on separate NICs.
When building a cluster consisting of multiple racks (common for large clusters), consider utilizing as much network bandwidth between switches in a "fat tree" design for optimal performance. A typical 10Gbps Ethernet switch has 48 10Gbps ports and four 40Gbps ports. If you only use one 40Gbps port for connectivity, you can only connect 4 servers at full speed (i.e., 10gbps x 4). Use your 40Gbps ports for maximum throughput. If you have unused 10G ports, you can aggregate them (with QSFP+ to 4x SFP+ cables) into more 40G ports to connect to other racks and to spine routers.
For network optimization, we recommend a jumbo frame for a better CPU/bandwidth ratio. We also recommend a non-blocking network switch back-plane.
You may deploy a Ceph cluster across geographic regions; however, this is NOT RECOMMENDED UNLESS you use a dedicated network connection between datacenters. Ceph prefers consistency and acknowledges writes synchronously. Using the internet (packet-switched with many hops) between geographically separate datacenters will introduce significant write latency.
2.5. Avoiding RAID
Ceph replicates or erasure codes objects. RAID is redundant and reduces available capacity. Consequently, RAID is an unnecessary expense. Additionally, a degraded RAID will have a negative impact on performance. If you have systems with RAID controllers, configure them for RAID 0 (JBOD).