Chapter 4. Configuring a Cluster

The initial configuring of a production cluster is identical to configuring a proof-of-concept system. The only material difference is that the initial deployment will use production-grade hardware. First, follow the Requirements for Installing Red Hat Ceph Storage chapter of the Red Hat Ceph Storage 3 Installation Guide for Red Hat Enterprise Linux and execute the appropriate steps for each node. The following sections provide additional guidance relevant to production clusters.

4.1. Naming Hosts

When naming hosts, consider their use case and performance profile. For example, if the hosts will store client data, consider naming them according to their hardware configuration and performance profile. For example:

  • data-ssd-1, data-ssd-2
  • hot-storage-1, hot-storage-2
  • sata-1, sata-2
  • sas-ssd-1, sas-ssd-2

The naming convention may make it easier to manage the cluster and troubleshoot hardware issues as they arise.

If the host contains hardware for multiple use cases—​for example, the host contains SSDs for data, SAS drives with SSDs for journals, and SATA drives with co-located journals for cold storage—​choose a generic name for the host. For example:

  • osd-node-1 osd-node-2

Generic host names can be extended when using logical host names in the CRUSH hierarchy as needed. For example:

  • osd-node-1-ssd osd-node-1-sata osd-node-1-sas-ssd osd-node-1-bucket-index
  • osd-node-2-ssd osd-node-2-sata osd-node-2-sas-ssd osd-node-2-bucket-index

See Using Logical Host Names in a CRUSH Map for additional details.

4.2. Tuning the Kernel

Production clusters benefit from tuning the operating system, specifically limits and memory allocation. Ensure that adjustments are set for all nodes within the cluster. Consult Red Hat support for additional guidance.

4.2.1. Reserving Free Memory for OSDs

To help prevent insufficient memory-related errors during OSD memory allocation requests, set the vm.min_free_kbytes option in the sysctl.conf file on OSD nodes. This option specifies the amount of physical memory to keep in reserve. The recommended settings are based on the amount of system RAM. For example:

  • For 64GB RAM, reserve 1GB.

    vm.min_free_kbytes = 1048576
  • For 128GB RAM, reserve 2GB.

    vm.min_free_kbytes = 2097152
  • For 256GB RAM, reserve 3GB.

    vm.min_free_kbytes = 3145728

4.2.2. Increasing File Descriptors

The Ceph Object Gateway may hang if it runs out of file descriptors. Modify /etc/security/limits.conf on Ceph Object Gateway nodes to increase the file descriptors for the Ceph Object Gateway. For example:

ceph       soft    nproc     unlimited

4.2.3. Adjusting ulimit on Large Clusters

For system administrators that will run Ceph administrator commands on large clusters—​for example, 1024 OSDs or more—​create an /etc/security/limits.d/50-ceph.conf file on each node that will run administrator commands with the following contents:

<username>       soft    nproc     unlimited

Replace <username> with the name of the non-root account that will run Ceph administrator commands.

Note

The root user’s ulimit is already set to "unlimited" by default on RHEL.

4.3. Configuring Ansible Groups

This procedure is only pertinent for deploying Ceph using Ansible. The ceph-ansible package is already configured with a default osds group. If the cluster will only have one use case and storage policy, proceed with the procedure documented in the Installing a Red Hat Ceph Storage Cluster section of the Red Hat Ceph Storage 3 Installation Guide for Red Hat Enterprise Linux. If the cluster will support multiple use cases and storage policies, create a group for each one. Each use case should copy /usr/share/ceph-ansible/group_vars/osd.sample to a file named for the group name. For example, if the cluster has IOPS-optimized, throughput-optimized and capacity-optimized use cases, create separate files representing the groups for each use case. For example:

cd /usr/share/ceph-ansible/group_vars/
cp osds.sample osds-iops
cp osds.sample osds-throughput
cp osds.sample osds-capacity

Then, configure each file according to the use case.

Once the group variable files are configured, edit the site.yml file to ensure that it includes each new group. For example:

- hosts: osds-iops
  gather_facts: false
  become: True
  roles:
  - ceph-osd

- hosts: osds-throughput
  gather_facts: false
  become: True
  roles:
  - ceph-osd

- hosts: osds-capacity
  gather_facts: false
  become: True
  roles:
  - ceph-osd

Finally, in the /etc/ansible/hosts file, place the OSD nodes associated to a group under the corresponding group name. For example:

[osds-iops]
<ceph-host-name> devices="[ '<device_1>', '<device_2>' ]"

[osds-throughput]
<ceph-host-name> devices="[ '<device_1>', '<device_2>' ]"

[osds-capacity]
<ceph-host-name> devices="[ '<device_1>', '<device_2>' ]"

4.4. Configuring Ceph

Generally, administrators should configure the Red Hat Ceph Storage cluster before the initial deployment using the Ceph Ansible configuration files found in the /usr/share/ceph-ansible/group_vars directory.

As indicated in the Installing a Red Hat Ceph Storage Cluster section of the Red Hat Ceph Storage installation guide:

  • For Monitors, create a mons.yml file from the sample.mons.yml file.
  • For OSDs, create an osds.yml file from the sample.osds.yml file.
  • For the cluster, create an all.yml file from the sample.all.yml file.

Modify the settings as directed in the Installation Guide.

Also refer to Installing the Ceph Object Gateway, and create an rgws.yml file from sample.rgws.yml.

NOTE
The settings in the foregoing files may take precedence over settings in ceph_conf_overrides.

To configure settings with no corresponding values in the mons.yml, osds.yml, or rgws.yml file, add configuration settings to the ceph_conf_overrides section of the all.yml file. For example:

ceph_conf_overrides:
   global:
      osd_pool_default_pg_num: <number>

See the Configuration File Structure for details on configuration file sections.

Important

There are syntactic differences between specifying a Ceph configuration setting in an Ansible configuration file and how it renders in the Ceph configuration file.

In RHCS version 3.1 and earlier, the Ceph configuration file uses ini style notion. Sections like [global] in the Ceph configuration file should be specified as global:, indented on their own lines. It is also possible to specify configuration sections for specific daemon instances. For example, placing osd.1: in the ceph_conf_overrides section of the all.yml file will render as [osd.1] in the Ceph configuration file and the settings under that section will apply to osd.1 only.

Ceph configuration settings SHOULD contain dashes (-) or underscores (_) rather than spaces, and should terminate with a colon (:), not an equal sign (=).

Before deploying the Ceph cluster, consider the following configuration settings. When setting Ceph configuration settings, Red Hat recommends setting the values in the ceph-ansible configuration files, which will generate a Ceph configuration file automatically.

4.4.1. Setting the Journal Size

Set the journal size for the Ceph cluster. Configuration tools such as Ansible may have a default value. Generally, the journal size should find the product of the synchronization interval and the slower of the disk and network throughput, and multiply the product by two (2).

For details, see the Journal Settings section in the Configuration Guide for Red Hat Ceph Storage 3.

4.4.2. Adjusting Backfill & Recovery Settings

I/O is negatively impacted by both backfilling and recovery operations, leading to poor performance and unhappy end users. To help accommodate I/O demand during a cluster expansion or recovery, set the following options and values in the Ceph Configuration file:

[osd]
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1

4.4.3. Adjusting the Cluster Map Size

For Red Hat Ceph Storage version 2 and earlier, when the cluster has thousands of OSDs, download the cluster map and check its file size. By default, the ceph-osd daemon caches 500 previous osdmaps. Even with deduplication, the map may consume a lot of memory per daemon. Tuning the cache size in the Ceph configuration file may help reduce memory consumption significantly. For example:

[global]
osd_map_message_max=10

[osd]
osd_map_cache_size=20
osd_map_max_advance=10
osd_map_share_max_epochs=10
osd_pg_epoch_persisted_max_stale=10

For Red Hat Ceph Storage version 3 and later, the ceph-manager daemon handles PG queries, so the cluster map should not impact performance.

4.4.4. Adjusting Scrubbing

By default, Ceph performs light scrubbing daily and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that PGs are storing the same object data. Over time, disk sectors can go bad irrespective of object sizes and checksums. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. In this respect, deep scrubbing ensures data integrity in the manner of fsck, but the procedure imposes an I/O penalty on the cluster. Even light scrubbing can impact I/O.

The default settings may allow Ceph OSDs to initiate scrubbing at inopportune times such as peak operating times or periods with heavy loads. End users may experience latency and poor performance when scrubbing operations conflict with end user operations.

To prevent end users from experiencing poor performance, Ceph provides a number of scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours. For details, see the Scrubbing section in the Configuration Guide for Red Hat Ceph Storage 3.

If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours. For example:

[osd]
osd_scrub_begin_hour = 23   #23:01H, or 10:01PM.
osd_scrub_end_hour = 6      #06:01H or 6:01AM.

If time constraints aren’t an effective method of determining a scrubbing schedule, consider using the osd_scrub_load_threshold. The default value is 0.5, but it could be modified for low load conditions. For example:

[osd]
osd_scrub_load_threshold = 0.25

4.4.5. Increase objecter_inflight_ops

In RHCS 3.0 and earlier releases, consider increasing objecter_inflight_ops to the default size for version 3.1 and later releases to improve scalability.

objecter_inflight_ops = 24576

4.4.6. Increase rgw_thread_pool_size

In RHCS 3.0 and earlier releases, consider increasing rgw_thread_pool_size to the default size for version 3.1 and later releases to improve scalability. For example:

rgw_thread_pool_size = 512

4.4.7. Set the filestore_merge_threshold

In RHCS 3.0 and earlier releases, consider increasing filestore_merge_threshold to the default size for version 3.1 and later releases to improve scalability. For example:

filestore_merge_threshold = -10

4.4.8. Adjusting Garbage Collection Settings

The Ceph Object Gateway allocates storage for new and overwritten objects immediately. Additionally, the parts of a multi-part upload also consume storage.

The Ceph Object Gateway purges the storage space used for deleted objects in the Ceph Storage cluster some time after the gateway deletes the objects from the bucket index. Similarly, the gateway will delete data associated with a multi-part upload after the multi-part upload completes or when the upload has gone inactive or failed to complete for a configurable time frame. The process of purging the deleted object data from the Ceph Storage cluster is known as Garbage Collection or GC.

To view the queue of objects awaiting garbage collection, execute the following:

# radosgw-admin gc list

Garbage collection is a background activity that may execute continuously or during times of low loads, depending upon how the administrator configures the Ceph Object Gateway. By default, the Ceph Object Gateway conducts GC operations continuously. Since GC operations are a normal part of Ceph Object Gateway operations, especially with object delete operations, objects eligible for garbage collection exist most of the time.

Some workloads may temporarily or permanently outpace the rate of garbage collection activity. This is especially true of delete-heavy workloads, where many objects get stored for a short period of time and then deleted. For these types of workloads, administrators can increase the priority of garbage collection operations relative to other operations with the following configuration parameters.

The rgw_gc_obj_min_wait configuration setting waits a minimum length of time (in seconds) before purging a deleted object’s data. By default, it is set to two hours. Under delete heavy workloads, this setting may consume too much storage or leave a large number of deleted objects to purge. To address such a situation, consider decreasing this setting to a shorter time interval.

The rgw_gc_processor_period configuration setting is the GC cycle run time. That is, the amount of time between the start of consecutive runs of GC threads. If GC runs takes more than this period, the Ceph Object Gateway will not wait before running a GC cycle again.

The rgw_gc_max_concurrent_io configuration setting specifies the maximum number of concurrent IO operations that the gateway garbage collection thread will use when purging deleted data. Under delete heavy workloads, consider increasing this setting to a larger number of concurrent IO operations.

The rgw_gc_max_trim_chunk configuration setting specifies the maximum number of keys to remove from the garbage collector log in a single operation. Under delete heavy operations, consider increasing the maximum number of keys so that more objects are purged during each GC operation.

Note

These garbage collection configuration parameters are for Red Hat Ceph Storage 3.2 and higher.

Note

In testing, during an evenly balanced delete-write workload, such as 50% delete and 50% write operations, the cluster fills completely in eleven hours. This is because Ceph Object Gateway garbage collection fails to keep pace with the delete operations. The cluster status switches to the HEALTH_ERR state if this happens. Aggressive settings for parallel garbage collection tunables significantly delayed the onset of cluster fill in testing and can be helpful for many workloads. Typical real-world cluster workloads are not likely to cause a cluster fill primarily due to garbage collection.

Administrators may also modify Ceph configuration settings at runtime after deployment. See Setting a Specific Configuration Setting at Runtime for details.