Chapter 3. Handling a node failure

As a storage administrator, you might experience a whole node failing within the storage cluster, and handling a node failure is similar to handling a disk failure. With a node failure, instead of Ceph recovering PGs (placement groups) for only one disk, all PGs on the disks within that node must be recovered. Ceph will detect that the OSDs are all down and automatically start the recovery process, known as self-healing.

There are three node failure scenarios. Here is the high-level workflow for each scenario when replacing a node:

  • Replacing the node, but using the root and Ceph OSD disks from the failed node.

    1. Disable backfilling.
    2. Replace the node, taking the disks from old node, and adding them to the new node.
    3. Enable backfilling.
  • Replacing the node, reinstalling the operating system, and using the Ceph OSD disks from the failed node.

    1. Disable backfilling.
    2. Create a backup of the Ceph configuration.
    3. Replace the node and add the Ceph OSD disks from failed node.

      1. Configuring disks as JBOD.
    4. Install the operating system.
    5. Restore the Ceph configuration.
    6. Run ceph-ansible.
    7. Enable backfilling.
  • Replacing the node, reinstalling the operating system, and using all new Ceph OSDs disks.

    1. Disable backfilling.
    2. Remove all OSDs on the failed node from the storage cluster.
    3. Create a backup of the Ceph configuration.
    4. Replace the node and add the Ceph OSD disks from failed node.

      1. Configuring disks as JBOD.
    5. Install the operating system.
    6. Run ceph-ansible.
    7. Enable backfilling.

3.1. Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A failed node.

3.2. Considerations before adding or removing a node

One of the outstanding features of Ceph is the ability to add or remove Ceph OSD nodes at run time. This means you can resize the storage cluster capacity or replace hardware without taking down the storage cluster. The ability to serve Ceph clients while the cluster is in a degraded state also has operational benefits, for example, you can add or remove or replace hardware during regular business hours, rather than working overtime or weekends. However, adding and removing Ceph OSD nodes can have a significant impact on performance, and you must consider the performance impact of adding, removing or replacing hardware on the storage cluster before you act.

From a capacity perspective, removing a node removes the OSDs contained within the node and effectively reduces the capacity of the storage cluster. Adding a node adds the OSDs contained within the node, and effectively expands the capacity of the storage cluster. Whether you are expanding or reducing the storage cluster capacity, adding or removing Ceph OSD nodes will induce backfilling as the cluster rebalances. During that rebalancing time period, Ceph uses additional resources which can impact storage cluster performance.

Imagine a storage cluster that contains Ceph nodes where each node has four OSDs. In a storage cluster of four nodes, with 16 OSDs, removing a node removes 4 OSDs and cuts capacity by 25%. In a storage cluster of three nodes, with 12 OSDs, adding a node adds 4 OSDs and increases capacity by 33%.

In a production Ceph storage cluster, a Ceph OSD node has a particular hardware configuration that facilitates a particular type of storage strategy. For more details, see Storage Strategies guide for Red Hat Ceph Storage 3.

Since a Ceph OSD node is part of a CRUSH hierarchy, the performance impact of adding or removing a node typically affects the performance of pools that use that CRUSH hierarchy, that is, the CRUSH ruleset.

3.3. Performance considerations

The following factors typically have an impact on storage cluster’s performance when adding or removing Ceph OSD nodes:

Current Client Load on Affected Pools:

Ceph clients place load on the I/O interface to Ceph; namely, load on a pool. A pool maps to a CRUSH ruleset. The underlying CRUSH hierarchy allows Ceph to place data across failure domains. If the underlying Ceph OSD node involves a pool under high client loads, the client load may have a significant impact on recovery time and impact performance. More specifically, since write operations require data replication for durability, write-intensive client loads will increase the time for the storage cluster to recover.

Capacity Added or Removed:

Generally, the capacity you are adding or removing as a percentage of the overall cluster will have an impact on the storage cluster’s time to recover. Additionally, the storage density of the node you add or remove may have an impact on the time to recover for example, a node with 36 OSDs will typically take longer to recover compared to a node with 12 OSDs. When removing nodes, you MUST ensure that you have sufficient spare capacity so that you will not reach the full ratio or near full ratio. If the storage cluster reaches the full ratio, Ceph will suspend write operations to prevent data loss.

Pools and CRUSH Ruleset:

A Ceph OSD node maps to at least one Ceph CRUSH hierarchy, and the hierarchy maps to at least one pool. Each pool that uses the CRUSH hierarchy (ruleset) where you add or remove a Ceph OSD node will experience a performance impact.

Pool Type and Durability:

Replication pools tend to use more network bandwidth to replicate deep copies of the data, whereas erasure coded pools tend to use more CPU to calculate k+m coding chunks. The more copies of the data, for example, the size or the more k+m chunks, the longer it will take for the storage cluster to recover.

Total Throughput Characteristics:

Drives, controllers and network interface cards all have throughput characteristics that may impact the recovery time. Generally, nodes with higher throughput characteristics, for example, 10 Gbps and SSDs will recover faster than nodes with lower throughput characteristics, for example, 1 Gbps and SATA drives.

3.4. Recommendations for adding or removing nodes

The failure of a node may preclude removing one OSD at a time before changing the node. Circumstances can allow you to reduce a negative performance impact when adding or removing Ceph OSD nodes, Red Hat recommends adding or removing one OSD at a time within a node and allowing the cluster to recover before proceeding to the next OSD. For details on removing an OSD:

When adding a Ceph node, Red hat also recommends adding one OSD at a time. For details on adding an OSD:

When adding or removing Ceph OSD nodes, consider that other ongoing processes will have an impact on performance too. To reduce the impact on client I/O, Red Hat recommends the following:

Calculate capacity:

Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all its OSDs WITHOUT reaching the full ratio. Reaching the full ratio will cause the cluster to refuse write operations.

Temporarily Disable Scrubbing:

Scrubbing is essential to ensuring the durability of the storage cluster’s data; however, it is resource intensive. Before adding or removing a Ceph OSD node, disable scrubbing and deep scrubbing and let the current scrubbing operations complete before proceeding, for example:

ceph osd set noscrub
ceph osd set nodeep-scrub

Once you have added or removed a Ceph OSD node and the storage cluster has returned to an active+clean state, unset the noscrub and nodeep-scrub settings.

Limit Backfill and Recovery:

If you have reasonable data durability, for example, osd pool default size = 3 and osd pool default min size = 2, there is nothing wrong with operating in a degraded state. You can tune the storage cluster for the fastest possible recovery time, but this will impact Ceph client I/O performance significantly. To maintain the highest Ceph client I/O performance, limit the backfill and recovery operations and allow them to take longer, for example:

osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1

You can also set sleep and delay parameters such as osd_recovery_sleep.

Finally, if you are expanding the size of the storage cluster, you may need to increase the number of placement groups. If you determine that you need to expand the number of placement groups, Red Hat recommends making incremental increases in the number of placement groups. Increasing the number of placement groups by a significant number will cause performance to degrade considerably.

3.5. Adding a Ceph OSD node

To expand the capacity of the Red Hat Ceph Storage cluster, add an OSD node.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A provisioned node with a network connection.
  • Installation of Red Hat Enterprise Linux 7 or Ubuntu 16.04.
  • Review the Requirements for Installing Red Hat Ceph Storage chapter in the Installation Guide for Red Hat Enterprise Linux or Ubuntu.

Procedure

  1. Verify that other nodes in the storage cluster can reach the new node by its short host name.
  2. Temporarily disable scrubbing:

    [root@monitor ~]# ceph osd set noscrub
    [root@monitor ~]# ceph osd set nodeep-scrub
  3. Limit the back-fill and recovery features:

    Syntax

    ceph tell $DAEMON_TYPE.* injectargs --$OPTION_NAME $VALUE [--$OPTION_NAME $VALUE]

    Example

    [root@monitor ~]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 1 --osd-recovery-op-priority 1

  4. Add the new node to the CRUSH Map:

    Syntax

    ceph osd crush add-bucket $BUCKET_NAME $BUCKET_TYPE

    Example

    [root@monitor ~]# ceph osd crush add-bucket node2 host

  5. Add an OSD for each disk on the node to the storage cluster.

    • Using Ansible.
    • Using the command-line interface.

      Important

      When adding an OSD node to a Red Hat Ceph Storage cluster Red Hat recommends adding one OSD at a time within the node and allowing the cluster to recover to an active+clean state before proceeding to the next OSD.

Additional Resources

3.6. Removing a Ceph OSD node

To reduce the capacity of a storage cluster remove an OSD node.

Warning

Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all OSDs WITHOUT reaching the full ratio. Reaching the full ratio will cause the cluster to refuse write operations.

Prerequisites

  • A running Red Hat Ceph Storage cluster.

Procedure

  1. Check storage cluster’s capacity:

    [root@monitor ~]# ceph df
    [root@monitor ~]# rados df
    [root@monitor ~]# ceph osd df
  2. Temporarily disable scrubbing:

    [root@monitor ~]# ceph osd set noscrub
    [root@monitor ~]# ceph osd set nodeep-scrub
  3. Limit the back-fill and recovery features:

    Syntax

    ceph tell $DAEMON_TYPE.* injectargs --$OPTION_NAME $VALUE [--$OPTION_NAME $VALUE]

    Example

    [root@monitor ~]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 1 --osd-recovery-op-priority 1

  4. Remove each OSD on the node from the storage cluster:

    • Using Ansible.
    • Using the command-line interface.

      Important

      When removing an OSD node from the storage cluster, Red Hat recommends removing one OSD at a time within the node and allowing the cluster to recover to an active+clean state before proceeding to the next OSD.

      1. After removing an OSD check to verify the storage cluster is not getting to the near-full ratio:

        [root@monitor ~]# ceph -s
        [root@monitor ~]# ceph df
      2. Repeat this step until all OSDs on the node are removed from the storage cluster.
  5. Once all OSDs are removed, remove the host bucket from the CRUSH map:

    Syntax

    ceph osd crush rm $BUCKET_NAME

    Example

    [root@monitor ~]# ceph osd crush rm node2

Additional Resources

3.7. Simulating a node failure

To simulate hard node failure power-off the node and reinstall the operating system.

Prerequisites

  • A healthy running Red Hat Ceph Storage cluster.

Procedure

  1. Check storage capacity to understand what removing node means to storage cluster:

    # ceph df
    # rados df
    # ceph osd df
  2. Optionally, disable recovery and backfilling:

    # ceph osd set noout
    # ceph osd set noscrub
    # ceph osd set nodeep-scrub
  3. Shutdown the node.
  4. If the host name will change, then remove the node from CRUSH map:

    [root@ceph1 ~]# ceph osd crush rm ceph3
  5. Check status of cluster:

    [root@ceph1 ~]# ceph -s
  6. Reinstall the operating system on the node.
  7. Add an Ansible user and SSH keys:

    [root@ceph3 ~]# useradd ansible
    [root@ceph3 ~]# passwd ansible
    [root@ceph3 ~]# cat << EOF > /etc/sudoers.d/ansible
    ansible ALL = (root) NOPASSWD:ALL
    Defaults:ansible !requiretty
    EOF
    [root@ceph3 ~]# su - ansible
    [ansible@ceph3 ~]# ssh-keygen
  8. From the administration node, copy the SSH keys for ansible user:

    [ansible@admin ~]$ ssh-copy-id ceph3
  9. From the administration node, re-run the Ansible playbook:

    [ansible@admin ~]$ cd /usr/share/ceph-ansible
    [ansible@admin ~]$ ansible-playbook site.yml

    Example Output

    PLAY RECAP ********************************************************************
    ceph1                      : ok=368  changed=2    unreachable=0    failed=0
    ceph2                      : ok=284  changed=0    unreachable=0    failed=0
    ceph3                      : ok=284  changed=15   unreachable=0    failed=0

  10. Optionally, enable recovery and backfilling:

    [root@ceph3 ~]# ceph osd unset noout
    [root@ceph3 ~]# ceph osd unset noscrub
    [root@ceph3 ~]# ceph osd unset nodeep-scrub
  11. Check Ceph’s health:

    [root@ceph3 ~]# ceph -s
        cluster 1e0c9c34-901d-4b46-8001-0d1f93ca5f4d
         health HEALTH_OK
         monmap e1: 3 mons at {ceph1=192.168.122.81:6789/0,ceph2=192.168.122.82:6789/0,ceph3=192.168.122.83:6789/0}
                election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3
         osdmap e95: 3 osds: 3 up, 3 in
                flags sortbitwise
          pgmap v1190: 152 pgs, 12 pools, 1024 MB data, 441 objects
                3197 MB used, 293 GB / 296 GB avail
                     152 active+clean

Additional Resources