Red Hat Training

A Red Hat training course is available for Red Hat Ceph Storage

Chapter 8. Adding and Removing OSD Nodes

One of the outstanding features of Ceph is the ability to add or remove Ceph OSD nodes at run time. This means you can resize cluster capacity or replace hardware without taking down the storage cluster. The ability to serve Ceph clients while the cluster is in a degraded state also has operational benefits, for example, you can add or remove or replace hardware during regular business hours, rather than working overtime or weekends.

However, adding and removing Ceph OSD nodes can have a significant impact on performance, and you should consider the performance impact of adding, removing or replacing hardware on the storage cluster before you act.

From a capacity perspective, removing a node removes the OSDs contained within the node and effectively reduces the capacity of the storage cluster. Adding a node adds the OSDs contained within the node, and effectively expands the capacity of the storage cluster. Whether you are expanding or contracting cluster capacity, adding or removing Ceph OSD nodes will induce backfilling as the cluster rebalances. During that rebalancing time period, Ceph uses additional resources which can impact cluster performance.

The following diagram contains Ceph nodes where each node has four OSDs. In a storage cluster of four nodes, with 16 OSDs, removing a node removes 4 OSDs and cuts capacity by 25%. In a storage cluster of three nodes, with 12 OSDs, adding a node adds 4 OSDs and increases capacity by 33%.

diag 9412ab0e7aff62ef6d529f2cd39cb88d

In a production Ceph storage cluster, a Ceph OSD node has a particular hardware configuration that facilitates a particular type of storage strategy. See the Red Hat Ceph Storage Strategies Guide for more details.

Since a Ceph OSD node is part of a CRUSH hierarchy, the performance impact of adding or removing a node typically affects the performance of pools that use that CRUSH hierarchy, that is, the CRUSH ruleset.

8.1. Performance Factors

The following factors typically have an impact on cluster performance when adding or removing Ceph OSD nodes:

  1. Current Client Load on Affected Pools: Ceph clients place load on the I/O interface to the Ceph cluster; namely, a pool. A pool maps to a CRUSH ruleset. The underlying CRUSH hierarchy allows Ceph to place data across failure domains. If the underlying Ceph OSD node involves a pool under high client loads, the client load may have a significant impact on recovery time and impact performance. More specifically, since write operations require data replication for durability, write-intensive client loads will increase the time for the cluster to recover.
  2. Capacity Added or Removed: Generally, the capacity you are adding or removing as a percentage of the overall cluster will have an impact on the cluster’s time to recover. Additionally, the storage density of the node you add or remove may have an impact on the time to recover for example, a node with 36 OSDs will typically take longer to recover compared to a node with 12 OSDs. When removing nodes, you MUST ensure that you have sufficient spare capacity so that you will not reach the full ratio or near full ratio. If the storage cluster reaches the full ratio, Ceph will suspend write operations to prevent data loss.
  3. Pools and CRUSH Ruleset: A Ceph OSD node maps to at least one Ceph CRUSH hierarchy, and the hierarchy maps to at least one pool. Each pool that uses the CRUSH hierarchy (ruleset) where you add or remove a Ceph OSD node will experience a performance impact.
  4. Pool Type and Durability: Replication pools tend to use more network bandwidth to replicate deep copies of the data, whereas erasure coded pools tend to use more CPU to calculate k+m coding chunks. The more copies of the data, for example, the size or the more k+m chunks, the longer it will take for the cluster to recover.
  5. Total Throughput Characteristics: Drives, controllers and network interface cards all have throughput characteristics that may impact the recovery time. Generally, nodes with higher throughput characteristics, for example, 10 Gbps and SSDs will recover faster than nodes with lower throughput characteristics, for example, 1 Gbps and SATA drives.

8.2. Recommendations

The failure of a node may preclude removing one OSD at a time before changing the node. When circumstances allow you to reduce any negative performance impact when adding or removing Ceph OSD nodes, we recommend adding or removing one OSD at a time within a node and allowing the cluster to recover before proceeding to the next OSD. For details on removing an OSD, see Section 6.4.1, “Removing an OSD with the Command Line Interface”. When adding a Ceph node, we also recommend adding one OSD at a time. See Section 6.3, “Adding an OSD” for details.

When adding/removing Ceph OSD nodes, consider that other ongoing processes will have an impact on performance too. To reduce the impact on client I/O, we recommend the following:

  1. Calculate capacity: Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all its OSDs WITHOUT reaching the full ratio. Reaching the full ratio will cause the cluster to refuse write operations.
  2. Temporarily Disable Scrubbing: Scrubbing is essential to ensuring the durability of the storage cluster’s data; however, it is resource intensive. Before adding/ removing a Ceph OSD node, disable scrubbing and deep scrubbing and let the current scrubbing operations complete before proceeding, for example:

    # ceph osd set noscrub
    # ceph osd set nodeep-scrub

    Once you have added or removed a Ceph OSD node and the storage cluster has returned to an active+clean state, unset the noscrub and nodeep-scrub settings. See Chapter 4, Overrides for additional details.

  3. Limit Backfill and Recovery: If you have reasonable data durability, for example, osd pool default size = 3 and osd pool default min size = 2, there is nothing wrong with operating in a degraded state. You can tune the Ceph storage cluster for the fastest possible recovery time, but this will impact Ceph client I/O performance significantly. To maintain the highest Ceph client I/O performance, limit the backfill and recovery operations and allow them to take longer, for example:

    osd_max_backfills = 1
    osd_recovery_max_active = 1
    osd_recovery_op_priority = 1

    You can also set sleep and delay parameters such as osd_recovery_sleep.

Finally, if you are expanding the size of the storage cluster, you may need to increase the number of placement groups. If you determine that you need to expand the number of placement groups, we recommend making incremental increases in the number of placement groups. Increasing the number of placement groups by a significant number will cause performance to degrade considerably.

8.3. Removing a Node

Before removing a Ceph OSD node, ensure that your cluster can back-fill the contents of all its OSDs WITHOUT reaching the full ratio. Reaching the full ratio will cause the cluster to refuse write operations.

  1. Use the following commands to check cluster capacity:

    # ceph df
    # rados df
    # ceph osd df
  2. Temporarily disable scrubbing.

    # ceph osd set noscrub
    # ceph osd set nodeep-scrub
  3. Limit back-fill and recovery.

    osd_max_backfills = 1
    osd_recovery_max_active = 1
    osd_recovery_op_priority = 1

    See Setting a Specific Configuration Setting at Runtime for details.

  4. Remove each Ceph OSD on the node from the Ceph Storage Cluster.

    When removing an OSD node from a Ceph cluster Red Hat recommends removing one OSD at a time within the node and allowing the cluster to recover to an active+clean state before proceeding to the next OSD.

    See Removing an OSD for details on removing an OSD.

    After removing an OSD check to verify the cluster is not nearing the near-full ratio.

    # ceph -s
    # ceph df

    Follow this procedure until all OSDs on the node until you have removed all of them from the Ceph Storage cluster.

  5. Once all OSDs are removed from the OSD node you can remove the OSD node bucket from the CRUSH map.

    # ceph osd crush rm {bucket-name}
  6. Finally, remove the node from Calamari.

    http://{calamari-fqdn}/api/v2/server/{problematic-host-fqdn}

    Click on the 'Delete' button to delete the node from Calamari.

8.4. Adding a Node

To add an OSD node to a Ceph cluster, first provision the node. See Configuring a Host for details. Ensure that other nodes in the cluster can reach the new host by its short host name.

  1. Temporarily disable scrubbing.

    # ceph osd set noscrub
    # ceph osd set nodeep-scrub
  2. Limit back-fill and recovery.

    osd_max_backfills = 1
    osd_recovery_max_active = 1
    osd_recovery_op_priority = 1

    See Setting a Specific Configuration Setting at Runtime for details.

  3. Add the new node to the CRUSH Map.

    # ceph osd crush add-bucket {bucket-name} {type}

    See Add a Bucket and Move a Bucket for details on placing the node at an appropriate location in the CRUSH hierarchy.

  4. Add a Ceph OSD for each storage disk on the node to the Ceph Storage Cluster.

    When adding an OSD node to a Ceph cluster Red Hat recommends adding one OSD at a time within the node and allowing the cluster to recover to an active+clean state before proceeding to the next OSD.

    See Adding an OSD for details on adding an OSD.