Chapter 3. Handling a node failure

As a storage administrator, you can experience a whole node failing within the storage cluster, and handling a node failure is similar to handling a disk failure. With a node failure, instead of Ceph recovering placement groups (PGs) for only one disk, all PGs on the disks within that node must be recovered. Ceph will detect that the OSDs are all down and automatically start the recovery process, known as self-healing.

There are three node failure scenarios. Here is the high-level workflow for each scenario when replacing a node:

  • Replacing the node, but using the root and Ceph OSD disks from the failed node.

    1. Disable backfilling.
    2. Replace the node, taking the disks from old node, and adding them to the new node.
    3. Enable backfilling.
  • Replacing the node, reinstalling the operating system, and using the Ceph OSD disks from the failed node.

    1. Disable backfilling.
    2. Create a backup of the Ceph configuration.
    3. Replace the node and add the Ceph OSD disks from failed node.

      1. Configuring disks as JBOD.
    4. Install the operating system.
    5. Restore the Ceph configuration.
    6. Run ceph-ansible.
    7. Enable backfilling.
  • Replacing the node, reinstalling the operating system, and using all new Ceph OSDs disks.

    1. Disable backfilling.
    2. Remove all OSDs on the failed node from the storage cluster.
    3. Create a backup of the Ceph configuration.
    4. Replace the node and add the Ceph OSD disks from failed node.

      1. Configuring disks as JBOD.
    5. Install the operating system.
    6. Run ceph-ansible.
    7. Enable backfilling.

3.1. Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A failed node.

3.2. Considerations before adding or removing a node

One of the outstanding features of Ceph is the ability to add or remove Ceph OSD nodes at run time. This means that you can resize the storage cluster capacity or replace hardware without taking down the storage cluster.

The ability to serve Ceph clients while the storage cluster is in a degraded state also has operational benefits. For example, you can add or remove or replace hardware during regular business hours, rather than working overtime or on weekends. However, adding and removing Ceph OSD nodes can have a significant impact on performance.

Before you add or remove Ceph OSD nodes, consider the effects on storage cluster performance:

  • Whether you are expanding or reducing the storage cluster capacity, adding or removing Ceph OSD nodes induces backfilling as the storage cluster rebalances. During that rebalancing time period, Ceph uses additional resources, which can impact storage cluster performance.
  • In a production Ceph storage cluster, a Ceph OSD node has a particular hardware configuration that facilitates a particular type of storage strategy.
  • Since a Ceph OSD node is part of a CRUSH hierarchy, the performance impact of adding or removing a node typically affects the performance of pools that use the CRUSH ruleset.

3.3. Performance considerations

The following factors typically affect a storage cluster’s performance when adding or removing Ceph OSD nodes:

  • Ceph clients place load on the I/O interface to Ceph; that is, the clients place load on a pool. A pool maps to a CRUSH ruleset. The underlying CRUSH hierarchy allows Ceph to place data across failure domains. If the underlying Ceph OSD node involves a pool that is experiencing high client load, the client load could significantly affect recovery time and reduce performance. Because write operations require data replication for durability, write-intensive client loads in particular can increase the time for the storage cluster to recover.
  • Generally, the capacity you are adding or removing affects the storage cluster’s time to recover. In addition, the storage density of the node you add or remove might also affect recovery times. For example, a node with 36 OSDs typically takes longer to recover than a node with 12 OSDs.
  • When removing nodes, you MUST ensure that you have sufficient spare capacity so that you will not reach full ratio or near full ratio. If the storage cluster reaches full ratio, Ceph will suspend write operations to prevent data loss.
  • A Ceph OSD node maps to at least one Ceph CRUSH hierarchy, and the hierarchy maps to at least one pool. Each pool that uses a CRUSH ruleset experiences a performance impact when Ceph OSD nodes are added or removed.
  • Replication pools tend to use more network bandwidth to replicate deep copies of the data, whereas erasure coded pools tend to use more CPU to calculate k+m coding chunks. The more copies that exist of the data, the longer it takes for the storage cluster to recover. For example, a larger pool or one that has a greater number of k+m chunks will take longer to recover than a replication pool with fewer copies of the same data.
  • Drives, controllers and network interface cards all have throughput characteristics that might impact the recovery time. Generally, nodes with higher throughput characteristics, such as 10 Gbps and SSDs, recover more quickly than nodes with lower throughput characteristics, such as 1 Gbps and SATA drives.

3.4. Recommendations for adding or removing nodes

Red Hat recommends adding or removing one OSD at a time within a node and allowing the storage cluster to recover before proceeding to the next OSD. This helps to minimize the impact on storage cluster performance. Note that if a node fails, you might need to change the entire node at once, rather than one OSD at a time.

To remove an OSD:

To add an OSD:

When adding or removing Ceph OSD nodes, consider that other ongoing processes also affect storage cluster performance. To reduce the impact on client I/O, Red Hat recommends the following:

Calculate capacity

Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all its OSDs without reaching the full ratio. Reaching the full ratio will cause the storage cluster to refuse write operations.

Temporarily disable scrubbing

Scrubbing is essential to ensuring the durability of the storage cluster’s data; however, it is resource intensive. Before adding or removing a Ceph OSD node, disable scrubbing and deep scrubbing and let the current scrubbing operations complete before proceeding.

ceph osd_set_noscrub
ceph osd_set_nodeep-scrub

Once you have added or removed a Ceph OSD node and the storage cluster has returned to an active+clean state, unset the noscrub and nodeep-scrub settings.

Limit backfill and recovery

If you have reasonable data durability, there is nothing wrong with operating in a degraded state. For example, you can operate the storage cluster with osd_pool_default_size = 3 and osd_pool_default_min_size = 2. You can tune the storage cluster for the fastest possible recovery time, but doing so significantly affects Ceph client I/O performance. To maintain the highest Ceph client I/O performance, limit the backfill and recovery operations and allow them to take longer.

osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1

You can also consider setting the sleep and delay parameters such as, osd_recovery_sleep.

Increase the number of placement groups

Finally, if you are expanding the size of the storage cluster, you may need to increase the number of placement groups. If you determine that you need to expand the number of placement groups, Red Hat recommends making incremental increases in the number of placement groups. Increasing the number of placement groups by a significant amount will cause a considerable degradation in performance.

Note

See the KnowledgeBase article How do I increase placement group (PG) count in a Ceph Cluster for additional details.

3.5. Adding a Ceph OSD node

To expand the capacity of the Red Hat Ceph Storage cluster, add an OSD node.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A provisioned node with a network connection.
  • Installation of Red Hat Enterprise Linux 8.
  • Review the Requirements for Installing Red Hat Ceph Storage chapter in the Red Hat Ceph Storage Installation Guide.

Procedure

  1. Verify that other nodes in the storage cluster can reach the new node by its short host name.
  2. Temporarily disable scrubbing:

    Example

    [root@mon ~]# ceph osd set noscrub
    [root@mon ~]# ceph osd set nodeep-scrub

  3. Limit the backfill and recovery features:

    Syntax

    ceph tell DAEMON_TYPE.* injectargs --OPTION_NAME VALUE [--OPTION_NAME VALUE]

    Example

    [root@mon ~]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 1 --osd-recovery-op-priority 1

  4. Add the new node to the CRUSH map:

    Syntax

    ceph osd crush add-bucket BUCKET_NAME BUCKET_TYPE

    Example

    [root@mon ~]# ceph osd crush add-bucket node2 host

  5. Add an OSD for each disk on the node to the storage cluster.

    • Using Ansible.
    • Using the command-line interface.

      Important

      When adding an OSD node to a Red Hat Ceph Storage cluster, Red Hat recommends adding one OSD at a time within the node and allowing the cluster to recover to an active+clean state before proceeding to the next OSD.

  6. Enable scrubbing:

    Syntax

    ceph osd unset noscrub
    ceph osd unset nodeep-scrub

  7. Set the backfill and recovery features to default:

    Syntax

    ceph tell DAEMON_TYPE.* injectargs --OPTION_NAME VALUE [--OPTION_NAME VALUE]

    Example

    [root@mon ~]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 3 --osd-recovery-op-priority 3

Additional Resources

3.6. Removing a Ceph OSD node

To reduce the capacity of a storage cluster, remove an OSD node.

Warning

Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all OSDs without reaching the full ratio. Reaching the full ratio will cause the storage cluster to refuse write operations.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to all nodes in the storage cluster.

Procedure

  1. Check the storage cluster’s capacity:

    Syntax

    ceph df
    rados df
    ceph osd df

  2. Temporarily disable scrubbing:

    Syntax

    ceph osd set noscrub
    ceph osd set nodeep-scrub

  3. Limit the backfill and recovery features:

    Syntax

    ceph tell DAEMON_TYPE.* injectargs --OPTION_NAME VALUE [--OPTION_NAME VALUE]

    Example

    [root@mon ~]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 1 --osd-recovery-op-priority 1

  4. Remove each OSD on the node from the storage cluster:

    • Using Ansible.
    • Using the command-line interface.

      Important

      When removing an OSD node from the storage cluster, Red Hat recommends removing one OSD at a time within the node and allowing the cluster to recover to an active+clean state before proceeding to remove the next OSD.

      1. After you remove an OSD, check to verify that the storage cluster is not getting to the near-full ratio:

        Syntax

        ceph -s
        ceph df

      2. Repeat this step until all OSDs on the node are removed from the storage cluster.
  5. Once all OSDs are removed, remove the host bucket from the CRUSH map:

    Syntax

    ceph osd crush rm BUCKET_NAME

    Example

    [root@mon ~]# ceph osd crush rm node2

  6. Enable scrubbing:

    Syntax

    ceph osd unset noscrub
    ceph osd unset nodeep-scrub

  7. Set the backfill and recovery features to default:

    Syntax

    ceph tell DAEMON_TYPE.* injectargs --OPTION_NAME VALUE [--OPTION_NAME VALUE]

    Example

    [root@mon ~]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 3 --osd-recovery-op-priority 3

Additional Resources

3.7. Simulating a node failure

To simulate a hard node failure, power off the node and reinstall the operating system.

Prerequisites

  • A healthy running Red Hat Ceph Storage cluster.
  • Root-level access to all nodes on the storage cluster.

Procedure

  1. Check the storage cluster’s capacity to understand the impact of removing the node:

    Example

    [root@ceph1 ~]# ceph df
    [root@ceph1 ~]# rados df
    [root@ceph1 ~]# ceph osd df

  2. Optionally, disable recovery and backfilling:

    Example

    [root@ceph1 ~]# ceph osd set noout
    [root@ceph1 ~]# ceph osd set noscrub
    [root@ceph1 ~]# ceph osd set nodeep-scrub

  3. Shut down the node.
  4. If you are changing the host name, remove the node from CRUSH map:

    Example

    [root@ceph1 ~]# ceph osd crush rm ceph3

  5. Check the status of the storage cluster:

    Example

    [root@ceph1 ~]# ceph -s

  6. Reinstall the operating system on the node.
  7. Add an Ansible user and generate the SSH keys:

    Example

    [root@ceph3 ~]# useradd ansible
    [root@ceph3 ~]# passwd ansible
    [root@ceph3 ~]# cat << EOF > /etc/sudoers.d/ansible
    ansible ALL = (root) NOPASSWD:ALL
    Defaults:ansible !requiretty
    EOF
    [root@ceph3 ~]# su - ansible
    [ansible@ceph3 ~]$ ssh-keygen

  8. From the Ansible administration node, copy the SSH keys for the ansible user on the reinstalled node:

    [ansible@admin ~]$ ssh-copy-id ceph3
  9. From the Ansible administration node, run the Ansible playbook again:

    Example

    [ansible@admin ~]$ cd /usr/share/ceph-ansible
    [ansible@admin ~]$ ansible-playbook site.yml -i hosts
    
    PLAY RECAP ********************************************************************
    ceph1                      : ok=368  changed=2    unreachable=0    failed=0
    ceph2                      : ok=284  changed=0    unreachable=0    failed=0
    ceph3                      : ok=284  changed=15   unreachable=0    failed=0

  10. Optionally, enable recovery and backfilling:

    Example

    [root@ceph3 ~]# ceph osd unset noout
    [root@ceph3 ~]# ceph osd unset noscrub
    [root@ceph3 ~]# ceph osd unset nodeep-scrub

  11. Check Ceph’s health:

    Example

    [root@ceph3 ~]# ceph -s
        cluster 1e0c9c34-901d-4b46-8001-0d1f93ca5f4d
         health HEALTH_OK
         monmap e1: 3 mons at {ceph1=192.168.122.81:6789/0,ceph2=192.168.122.82:6789/0,ceph3=192.168.122.83:6789/0}
                election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3
         osdmap e95: 3 osds: 3 up, 3 in
                flags sortbitwise
          pgmap v1190: 152 pgs, 12 pools, 1024 MB data, 441 objects
                3197 MB used, 293 GB / 296 GB avail
                     152 active+clean

Additional Resources