Chapter 14. Handling a node failure

As a storage administrator, you can experience a whole node failing within the storage cluster, and handling a node failure is similar to handling a disk failure. With a node failure, instead of Ceph recovering placement groups (PGs) for only one disk, all PGs on the disks within that node must be recovered. Ceph will detect that the OSDs are all down and automatically start the recovery process, known as self-healing.

There are three node failure scenarios. Here is the high-level workflow for each scenario when replacing a node:

  • Replacing the node, but using the root and Ceph OSD disks from the failed node.

    1. Disable backfilling.
    2. Replace the node, taking the disks from old node, and adding them to the new node.
    3. Enable backfilling.
  • Replacing the node, reinstalling the operating system, and using the Ceph OSD disks from the failed node.

    1. Disable backfilling.
    2. Create a backup of the Ceph configuration.
    3. Replace the node and add the Ceph OSD disks from failed node.
    4. Configuring disks as JBOD.
    5. Install the operating system.
    6. Restore the Ceph configuration.
    7. Add the new node to the storage cluster using the Ceph Orchestrator commands and Ceph daemons are placed automatically on the respective node.
    8. Enable backfilling.
  • Replacing the node, reinstalling the operating system, and using all new Ceph OSDs disks.

    1. Disable backfilling.
    2. Remove all OSDs on the failed node from the storage cluster.
    3. Create a backup of the Ceph configuration.
    4. Replace the node and add the Ceph OSD disks from failed node.

      1. Configuring disks as JBOD.
    5. Install the operating system.
    6. Add the new node to the storage cluster using the Ceph Orchestrator commands and Ceph daemons are placed automatically on the respective node.
    7. Enable backfilling.

14.1. Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A failed node.

14.2. Considerations before adding or removing a node

One of the outstanding features of Ceph is the ability to add or remove Ceph OSD nodes at run time. This means that you can resize the storage cluster capacity or replace hardware without taking down the storage cluster.

The ability to serve Ceph clients while the storage cluster is in a degraded state also has operational benefits. For example, you can add or remove or replace hardware during regular business hours, rather than working overtime or on weekends. However, adding and removing Ceph OSD nodes can have a significant impact on performance.

Before you add or remove Ceph OSD nodes, consider the effects on storage cluster performance:

  • Whether you are expanding or reducing the storage cluster capacity, adding or removing Ceph OSD nodes induces backfilling as the storage cluster rebalances. During that rebalancing time period, Ceph uses additional resources, which can impact storage cluster performance.
  • In a production Ceph storage cluster, a Ceph OSD node has a particular hardware configuration that facilitates a particular type of storage strategy.
  • Since a Ceph OSD node is part of a CRUSH hierarchy, the performance impact of adding or removing a node typically affects the performance of pools that use the CRUSH ruleset.

Additional Resources

14.3. Performance considerations

The following factors typically affect a storage cluster’s performance when adding or removing Ceph OSD nodes:

  • Ceph clients place load on the I/O interface to Ceph; that is, the clients place load on a pool. A pool maps to a CRUSH ruleset. The underlying CRUSH hierarchy allows Ceph to place data across failure domains. If the underlying Ceph OSD node involves a pool that is experiencing high client load, the client load could significantly affect recovery time and reduce performance. Because write operations require data replication for durability, write-intensive client loads in particular can increase the time for the storage cluster to recover.
  • Generally, the capacity you are adding or removing affects the storage cluster’s time to recover. In addition, the storage density of the node you add or remove might also affect recovery times. For example, a node with 36 OSDs typically takes longer to recover than a node with 12 OSDs.
  • When removing nodes, you MUST ensure that you have sufficient spare capacity so that you will not reach full ratio or near full ratio. If the storage cluster reaches full ratio, Ceph will suspend write operations to prevent data loss.
  • A Ceph OSD node maps to at least one Ceph CRUSH hierarchy, and the hierarchy maps to at least one pool. Each pool that uses a CRUSH ruleset experiences a performance impact when Ceph OSD nodes are added or removed.
  • Replication pools tend to use more network bandwidth to replicate deep copies of the data, whereas erasure coded pools tend to use more CPU to calculate k+m coding chunks. The more copies that exist of the data, the longer it takes for the storage cluster to recover. For example, a larger pool or one that has a greater number of k+m chunks will take longer to recover than a replication pool with fewer copies of the same data.
  • Drives, controllers and network interface cards all have throughput characteristics that might impact the recovery time. Generally, nodes with higher throughput characteristics, such as 10 Gbps and SSDs, recover more quickly than nodes with lower throughput characteristics, such as 1 Gbps and SATA drives.

14.4. Recommendations for adding or removing nodes

Red Hat recommends adding or removing one OSD at a time within a node and allowing the storage cluster to recover before proceeding to the next OSD. This helps to minimize the impact on storage cluster performance. Note that if a node fails, you might need to change the entire node at once, rather than one OSD at a time.

To remove an OSD:

To add an OSD:

When adding or removing Ceph OSD nodes, consider that other ongoing processes also affect storage cluster performance. To reduce the impact on client I/O, Red Hat recommends the following:

Calculate capacity

Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all its OSDs without reaching the full ratio. Reaching the full ratio will cause the storage cluster to refuse write operations.

Temporarily disable scrubbing

Scrubbing is essential to ensuring the durability of the storage cluster’s data; however, it is resource intensive. Before adding or removing a Ceph OSD node, disable scrubbing and deep-scrubbing and let the current scrubbing operations complete before proceeding.

ceph osd set noscrub
ceph osd set nodeep-scrub

Once you have added or removed a Ceph OSD node and the storage cluster has returned to an active+clean state, unset the noscrub and nodeep-scrub settings.

ceph osd unset noscrub
ceph osd unset nodeep-scrub

Limit backfill and recovery

If you have reasonable data durability, there is nothing wrong with operating in a degraded state. For example, you can operate the storage cluster with osd_pool_default_size = 3 and osd_pool_default_min_size = 2. You can tune the storage cluster for the fastest possible recovery time, but doing so significantly affects Ceph client I/O performance. To maintain the highest Ceph client I/O performance, limit the backfill and recovery operations and allow them to take longer.

osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1

You can also consider setting the sleep and delay parameters such as, osd_recovery_sleep.

Increase the number of placement groups

Finally, if you are expanding the size of the storage cluster, you may need to increase the number of placement groups. If you determine that you need to expand the number of placement groups, Red Hat recommends making incremental increases in the number of placement groups. Increasing the number of placement groups by a significant amount will cause a considerable degradation in performance.

14.5. Adding a Ceph OSD node

To expand the capacity of the Red Hat Ceph Storage cluster, you can add an OSD node.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A provisioned node with a network connection.

Procedure

  1. Verify that other nodes in the storage cluster can reach the new node by its short host name.
  2. Temporarily disable scrubbing:

    Example

    [ceph: root@host01 /]# ceph osd set noscrub
    [ceph: root@host01 /]# ceph osd set nodeep-scrub

  3. Limit the backfill and recovery features:

    Syntax

    ceph tell DAEMON_TYPE.* injectargs --OPTION_NAME VALUE [--OPTION_NAME VALUE]

    Example

    [ceph: root@host01 /]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 1 --osd-recovery-op-priority 1

  4. Extract the cluster’s public SSH keys to a folder:

    Syntax

    ceph cephadm get-pub-key > ~/PATH

    Example

    [ceph: root@host01 /]# ceph cephadm get-pub-key > ~/ceph.pub

  5. Copy ceph cluster’s public SSH keys to the root user’s authorized_keys file on the new host:

    Syntax

    ssh-copy-id -f -i ~/PATH root@HOST_NAME_2

    Example

    [ceph: root@host01 /]# ssh-copy-id -f -i ~/ceph.pub root@host02

  6. Add the new node to the CRUSH map:

    Syntax

    ceph orch host add NODE_NAME IP_ADDRESS

    Example

    [ceph: root@host01 /]# ceph orch host add host02 10.10.128.70

  7. Add an OSD for each disk on the node to the storage cluster.

Important

When adding an OSD node to a Red Hat Ceph Storage cluster, Red Hat recommends adding one OSD daemon at a time and allowing the cluster to recover to an active+clean state before proceeding to the next OSD.

Additional Resources

14.6. Removing a Ceph OSD node

To reduce the capacity of a storage cluster, remove an OSD node.

Warning

Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all OSDs without reaching the full ratio. Reaching the full ratio will cause the storage cluster to refuse write operations.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to all nodes in the storage cluster.

Procedure

  1. Check the storage cluster’s capacity:

    Syntax

    ceph df
    rados df
    ceph osd df

  2. Temporarily disable scrubbing:

    Syntax

    ceph osd set noscrub
    ceph osd set nodeep-scrub

  3. Limit the backfill and recovery features:

    Syntax

    ceph tell DAEMON_TYPE.* injectargs --OPTION_NAME VALUE [--OPTION_NAME VALUE]

    Example

    [ceph: root@host01 /]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 1 --osd-recovery-op-priority 1

  4. Remove each OSD on the node from the storage cluster:

    • Using Removing the OSD daemons using the Ceph Orchestrator.

      Important

      When removing an OSD node from the storage cluster, Red Hat recommends removing one OSD at a time within the node and allowing the cluster to recover to an active+clean state before proceeding to remove the next OSD.

      1. After you remove an OSD, check to verify that the storage cluster is not getting to the near-full ratio:

        Syntax

        ceph -s
        ceph df

      2. Repeat this step until all OSDs on the node are removed from the storage cluster.
  5. Once all OSDs are removed, remove the host:

Additional Resources

14.7. Simulating a node failure

To simulate a hard node failure, power off the node and reinstall the operating system.

Prerequisites

  • A healthy running Red Hat Ceph Storage cluster.
  • Root-level access to all nodes on the storage cluster.

Procedure

  1. Check the storage cluster’s capacity to understand the impact of removing the node:

    Example

    [ceph: root@host01 /]# ceph df
    [ceph: root@host01 /]# rados df
    [ceph: root@host01 /]# ceph osd df

  2. Optionally, disable recovery and backfilling:

    Example

    [ceph: root@host01 /]# ceph osd set noout
    [ceph: root@host01 /]# ceph osd set noscrub
    [ceph: root@host01 /]# ceph osd set nodeep-scrub

  3. Shut down the node.
  4. If you are changing the host name, remove the node from CRUSH map:

    Example

    [ceph: root@host01 /]# ceph osd crush rm host03

  5. Check the status of the storage cluster:

    Example

    [ceph: root@host01 /]# ceph -s

  6. Reinstall the operating system on the node.
  7. Add the new node:

  8. Optionally, enable recovery and backfilling:

    Example

    [ceph: root@host01 /]# ceph osd unset noout
    [ceph: root@host01 /]# ceph osd unset noscrub
    [ceph: root@host01 /]# ceph osd unset nodeep-scrub

  9. Check Ceph’s health:

    Example

    [ceph: root@host01 /]# ceph -s

Additional Resources