Chapter 4. Deploying a Cluster

The initial deployment of a production cluster is identical to deploying a proof-of-concept system. The only material difference is that the initial deployment will use production-grade hardware. First, follow the prerequisites section of the Installation Guide for Red Hat Enterprise Linux and execute the appropriate steps for each node. The following sections provide additional guidance relevant to production clusters.

4.1. Naming Hosts

When naming hosts, consider their use case and performance profile. For example, if the hosts will store client data, consider naming them according to their hardware configuration and performance profile. For example:

  • data-ssd-1, data-ssd-2
  • hot-storage-1, hot-storage-2
  • sata-1, sata-2
  • sas-ssd-1, sas-ssd-2

The naming convention may make it easier to manage the cluster and troubleshoot hardware issues as they arise.

If the host contains hardware for multiple use cases—​for example, the host contains SSDs for data, SAS drives with SSDs for journals, and SATA drives with co-located journals for cold storage—​choose a generic name for the host. For example:

  • osd-node-1 osd-node-2

Generic host names can be extended when using logical host names in the CRUSH hierarchy as needed. For example:

  • osd-node-1-ssd osd-node-1-sata osd-node-1-sas-ssd osd-node-1-bucket-index
  • osd-node-2-ssd osd-node-2-sata osd-node-2-sas-ssd osd-node-2-bucket-index

See Using Logical Host Names in a CRUSH Map for additional details.

4.2. Tuning the Kernel

Production clusters benefit from tuning the operating system, specifically limits and memory allocation. Ensure that adjustments are set for all nodes within the cluster. Consult Red Hat support for additional guidance.

4.2.1. Adjusting TCMalloc

Under heavy multi-threaded memory allocation workloads, TCMalloc can consume significant amounts of CPU and reduce IOPS when it doesn’t have enough thread cache available. Red Hat recommends increasing the amount of thread cache beyond the default 32MB.

To change the TCMalloc cache setting, edit /etc/sysconfig/ceph, and use the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES setting to adjust the cache size. For example, increasing the cache from 64MB to 128MB can substantially increase IOPS while reducing CPU overhead.

To release the memory that TCMalloc has allocated, but which is not being used by the Ceph daemon itself, execute the following:

# ceph tell osd.* heap release

4.2.2. Reserving Free Memory for OSDs

To help prevent insufficient memory-related errors during OSD memory allocation requests, set the vm.min_free_kbytes option in the sysctl.conf file on OSD nodes. This option specifies the amount of physical memory to keep in reserve. The recommended settings are based on the amount of system RAM. For example:

  • For 64GB RAM, reserve 1GB.

    vm.min_free_kbytes = 1048576
  • For 128GB RAM, reserve 2GB.

    vm.min_free_kbytes = 2097152
  • For 256GB RAM, reserve 3GB.

    vm.min_free_kbytes = 3145728

4.2.3. Increasing File Descriptors

The Ceph Object Gateway may hang if it runs out of file descriptors. Modify /etc/security/limits.conf on Ceph Object Gateway nodes to increase the file descriptors for the Ceph Object Gateway. For example:

ceph       soft    nproc     unlimited

4.2.4. Adjusting ulimit On Large Clusters

For system administrators that will run Ceph administrator commands on large clusters—​for example, 1024 OSDs or more—​create an /etc/security/limits.d/50-ceph.conf file on each node that will run administrator commands with the following contents:

<username>       soft    nproc     unlimited

Replace <username> with the name of the non-root account that will run Ceph administrator commands.

Note

The root user’s ulimit is already set to "unlimited" by default on RHEL.

4.2.5. Adjusting PID Count

Hosts with high numbers of OSDs may spawn a lot of threads, especially during recovery and re-balancing. Many Linux kernels default to a relatively small maximum number of threads. Check the default settings to see if they are suitable.

cat /proc/sys/kernel/pid_max

Consider setting kernel.pid_max to a higher number of threads. The theoretical maximum is 4,194,303 threads. For example, add the following to the /etc/sysctl.conf file to set it to the maximum:

kernel.pid_max = 4194303

To effect the changes without rebooting, execute:

# sysctl -p

To verify the changes, execute:

# sysctl -a | grep kernel.pid_max

4.3. Configuring Ansible Groups

This procedure is only pertinent for deploying Ceph using Ansible. The ceph-ansible package is already configured with a default osds group. If the cluster will only have one use case and storage policy, proceed with the procedure documented in the Installing Ceph Using Ansible section of the Installation Guide for Red Hat Enterprise Linux.

If the cluster will support multiple use cases and storage policies, create a group for each one. See Configuring OSD Settings section of the Installation Guide for Red Hat Enterprise Linux for high level details.

Each use case should copy /usr/share/ceph-ansible/group_vars/osd.sample to a file named for the group name. For example, if the cluster has IOPS-optimized, throughput-optimized and capacity-optimized use cases, create separate files representing the groups for each use case. For example:

cd /usr/share/ceph-ansible/group_vars/
cp osds.sample osds-iops
cp osds.sample osds-throughput
cp osds.sample osds-capacity

Then, configure each file according to the use case.

Once the group variable files are configured, edit the site.yml file to ensure that it includes each new group. For example:

- hosts: osds-iops
  gather_facts: false
  become: True
  roles:
  - ceph-osd

- hosts: osds-throughput
  gather_facts: false
  become: True
  roles:
  - ceph-osd

- hosts: osds-capacity
  gather_facts: false
  become: True
  roles:
  - ceph-osd

Finally, in the /etc/ansible/hosts file, place the OSD nodes associated to a group under the corresponding group name. For example:

[osds-iops]
<ceph-host-name> devices="[ '<device_1>', '<device_2>' ]"

[osds-throughput]
<ceph-host-name> devices="[ '<device_1>', '<device_2>' ]"

[osds-capacity]
<ceph-host-name> devices="[ '<device_1>', '<device_2>' ]"

4.4. Deploying Ceph

Once the pre-requisites and initial tuning are complete, consider deploying a Ceph cluster. When deploying a production cluster, Red Hat recommends setting up the initial monitor cluster and enough OSD nodes to reach an active + clean state. See Storage Cluster Installation for details.

Then, install the Ceph CLI client on an administration node. See Ceph CLI installation for details.

Once the initial cluster is running, consider adding the settings in the following sections to the Ceph configuration file.

Note

If deployment uses a tool such as Ansible, add the following settings to the deployment tool’s configuration. For example, see overriding Ceph default settings for examples on how to modify Ceph settings using Ansible.

4.4.1. Setting the Journal Size

Set the journal size for the Ceph cluster. Configuration tools such as Ansible may have a default value. Generally, the journal size should find the product of the synchronization interval and the slower of the disk and network throughput, and multiply the product by two (2).

See Journal Settings for details.

4.4.2. Adjusting Backfill & Recovery Settings

I/O is negatively impacted by both backfilling and recovery operations, leading to poor performance and unhappy end users. To help accommodate I/O demand during a cluster expansion or recovery, set the following options and values in the Ceph Configuration file:

[osd]
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1

4.4.3. Adjusting the Cluster Map Size

When the cluster has thousands of OSDs, download the cluster map and check its file size. By default, the ceph-osd daemon caches 500 previous osdmaps. Even with deduplication, the map may consume a lot of memory per daemon. Tuning the cache size in the Ceph configuration file may help reduce memory consumption significantly. For example:

[global]
osd_map_message_max=10

[osd]
osd_map_cache_size=20
osd_map_max_advance=10
osd_map_share_max_epochs=10
osd_pg_epoch_persisted_max_stale=10

4.4.4. Adjusting Scrubbing

By default, Ceph performs light scrubbing daily and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that PGs are storing the same object data. Over time, disk sectors can go bad irrespective of object sizes and checksums. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. In this respect, deep scrubbing ensures data integrity in the manner of fsck, but the procedure imposes an I/O penalty on the cluster. Even light scrubbing can impact I/O.

The default settings may allow Ceph OSDs to initiate scrubbing at inopportune times such as peak operating times or periods with heavy loads. End users may experience latency and poor performance when scrubbing operations conflict with end user operations.

To prevent end users from experiencing poor performance, Ceph provides a number of scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours. See scrubbing for details.

If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours. For example:

[osd]
osd_scrub_begin_hour = 23   #23:01H, or 10:01PM.
osd_scrub_end_hour = 6      #06:01H or 6:01AM.

If time constraints aren’t an effective method of determining a scrubbing schedule, consider using the osd_scrub_load_threshold. The default value is 0.5, but it could be modified for low load conditions. For example:

[osd]
osd_scrub_load_threshold = 0.25

4.4.5. Expanding the Cluster

Once the initial cluster is running and in an active+clean state, add additional OSD nodes and Ceph Object Gateway nodes to the cluster. Apply the steps detailed in Tuning the Kernel to each node. See Adding and Removing OSD Nodes for details on adding nodes.

For each OSD node added to the cluster, add OSDs to the cluster for each drive in the node that will store client data. See Adding an OSD for additional details. When using Ansible to add OSD nodes, refer to Configuring Ansible Groups, and add the OSD nodes to the appropriate group if the cluster will support multiple use cases.

For each Ceph Object Gateway node, install a gateway instance. See Ceph Object Gateway Installation for details.

Once the cluster returns to an active+clean state, remove any overrides and proceed with Developing Storage Strategies.

Note

Step 3 of Adding a Node and Step 10 of Adding an OSD With the Command Line Interface will be revisited in topics beginning with Developing CRUSH Hierarchies.