Chapter 7. Management of OSDs using the Ceph Orchestrator

As a storage administrator, you can use the Ceph Orchestrators to manage OSDs of a Red Hat Ceph Storage cluster.

7.1. Ceph OSDs

When a Red Hat Ceph Storage cluster is up and running, you can add OSDs to the storage cluster at runtime.

A Ceph OSD generally consists of one ceph-osd daemon for one storage drive and its associated journal within a node. If a node has multiple storage drives, then map one ceph-osd daemon for each drive.

Red Hat recommends checking the capacity of a cluster regularly to see if it is reaching the upper end of its storage capacity. As a storage cluster reaches its near full ratio, add one or more OSDs to expand the storage cluster’s capacity.

When you want to reduce the size of a Red Hat Ceph Storage cluster or replace the hardware, you can also remove an OSD at runtime. If the node has multiple storage drives, you might also need to remove one of the ceph-osd daemon for that drive. Generally, it’s a good idea to check the capacity of the storage cluster to see if you are reaching the upper end of its capacity. Ensure that when you remove an OSD that the storage cluster is not at its near full ratio.

Important

Do not let a storage cluster reach the full ratio before adding an OSD. OSD failures that occur after the storage cluster reaches the near full ratio can cause the storage cluster to exceed the full ratio. Ceph blocks write access to protect the data until you resolve the storage capacity issues. Do not remove OSDs without considering the impact on the full ratio first.

7.2. Ceph OSD node configuration

Configure Ceph OSDs and their supporting hardware similarly as a storage strategy for the pool(s) that will use the OSDs. Ceph prefers uniform hardware across pools for a consistent performance profile. For best performance, consider a CRUSH hierarchy with drives of the same type or size.

If you add drives of dissimilar size, adjust their weights accordingly. When you add the OSD to the CRUSH map, consider the weight for the new OSD. Hard drive capacity grows approximately 40% per year, so newer OSD nodes might have larger hard drives than older nodes in the storage cluster, that is, they might have a greater weight.

Before doing a new installation, review the Requirements for Installing Red Hat Ceph Storage chapter in the Installation Guide.

7.3. Automatically tuning OSD memory

The OSD daemons adjust the memory consumption based on the osd_memory_target configuration option. The option osd_memory_target sets OSD memory based upon the available RAM in the system.

If Red Hat Ceph Storage is deployed on dedicated nodes that do not share memory with other services, Cephadm automatically adjusts the per-OSD consumption based on the total amount of RAM and the number of deployed OSDs.

By default, the osd_memory_target_autotune parameter is set to false in Red Hat Ceph Storage 5. You can enable this option globally as follows within the Cephadm shell:

Example

[ceph: root@host01 /]# ceph config set osd osd_memory_target_autotune true

Once the storage cluster is upgraded to Red Hat Ceph Storage 5.0, for cluster maintenance such as addition of OSDs or replacement of OSDs, Red Hat recommends setting osd_memory_target_autotune parameter to true to autotune osd memory as per system memory.

Cephadm starts with a fraction mgr/cephadm/autotune_memory_target_ratio, which defaults to 0.7 of the total RAM in the system, subtract off any memory consumed by non-autotuned daemons such as non-OSDS and for OSDs for which osd_memory_target_autotune is false, and then divide by the remaining OSDs.

The osd_memory_target parameter is calculated as follows:

Syntax

osd_memory_target = TOTAL_RAM_OF_THE_OSD * (1048576) * (0.7)/ NUMBER_OF_OSDS_IN_THE_OSD_NODE

For example, if a node has 24 OSDs and has 251G RAM space, then osd_memory_target is 7860684936.

The final targets are reflected in the configuration database with options. You can view the limits and the current memory consumed by each daemon from the ceph orch ps output under MEM LIMIT column.

You can manually set a specific memory target for an OSD in the storage cluster.

Example

[ceph: root@host01 /]# ceph config set osd.123 osd_memory_target 7860684936

You can exclude an OSD from memory autotuning.

Example

[ceph: root@host01 /]# ceph config set osd.123 osd_memory_target_autotune false

7.4. Listing devices for Ceph OSD deployment

You can check the list of available devices before deploying OSDs using the Ceph Orchestrator. The commands are used to print a list of devices discoverable by Cephadm. A storage device is considered available if all of the following conditions are met:

  • The device must have no partitions.
  • The device must not have any LVM state.
  • The device must not be mounted.
  • The device must not contain a file system.
  • The device must not contain a Ceph BlueStore OSD.
  • The device must be larger than 5 GB.
Note

Ceph will not provision an OSD on a device that is not available.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • All manager and monitor daemons are deployed.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]#cephadm shell

  2. List the available devices to deploy OSDs:

    Syntax

    ceph orch device ls [--hostname=HOSTNAME_1 HOSTNAME_2] [--wide] [--refresh]

    Example

    [ceph: root@host01 /]# ceph orch device ls --wide --refresh

    Using the --wide option provides all details relating to the device, including any reasons that the device might not be eligible for use as an OSD. This option does not support NVMe devices.

  3. Optional: To enable Health, Ident, and Fault fields in the output of ceph orch device ls, run the following commands:

    Note

    These fields are supported by libstoragemgmt library and currently supports SCSI, SAS, and SATA devices.

    1. As root user, check your hardware’s compatibility with libstoragemgmt library to avoid unplanned interruption to services:

      Example

      [root@host01 ~]# cephadm shell lsmcli ldl

      In the output, you see the Health Status as Good with the respective SCSI VPD 0x83 ID.

      Note

      If you do not get this information, then enabling the fields might cause erratic behavior of devices.

    2. Enable libstoragemgmt support:

      Example

      [ceph: root@host01 /]# ceph config set mgr mgr/cephadm/device_enhanced_scan true

      Once this is enabled, ceph orch device ls will give the output of Health field as Good.

Verification

  • List the devices:

    Example

    [ceph: root@host01 /]# ceph orch device ls

7.5. Zapping devices for Ceph OSD deployment

You need to check the list of available devices before deploying OSDs. If there is no space available on the devices, you can clear the data on the devices by zapping them.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • All manager and monitor daemons are deployed.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. List the available devices to deploy OSDs:

    Syntax

    ceph orch device ls [--hostname=HOSTNAME_1 HOSTNAME_2] [--wide] [--refresh]

    Example

    [ceph: root@host01 /]# ceph orch device ls --wide --refresh

  3. Clear the data of a device:

    Syntax

    ceph orch device zap HOSTNAME FILE_PATH --force

    Example

    [ceph: root@host01 /]# ceph orch device zap host02 /dev/sdb --force

Verification

  • Verify the space is available on the device:

    Example

    [ceph: root@host01 /]# ceph orch device ls

    You will see that the field under Available is Yes.

Additional Resources

7.6. Deploying Ceph OSDs on all available devices

You can deploy all OSDS on all the available devices. Cephadm allows the Ceph Orchestrator to discover and deploy the OSDs on any available and unused storage device.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • All manager and monitor daemons are deployed.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. List the available devices to deploy OSDs:

    Syntax

    ceph orch device ls [--hostname=HOSTNAME_1 HOSTNAME_2] [--wide] [--refresh]

    Example

    [ceph: root@host01 /]# ceph orch device ls --wide --refresh

  3. Deploy OSDs on all available devices:

    Example

    [ceph: root@host01 /]# ceph orch apply osd --all-available-devices

    The effect of ceph orch apply is persistent which means that the Orchestrator automatically finds the device, adds it to the cluster, and creates new OSDs. This occurs under the following conditions:

    • New disks or drives are added to the system.
    • Existing disks or drives are zapped.
    • An OSD is removed and the devices are zapped.

      You can disable automatic creation of OSDs on all the available devices by using the --unmanaged parameter.

      Example

      [ceph: root@host01 /]# ceph orch apply osd --all-available-devices --unmanaged=true

      Setting the parameter --unmanaged to true disables the creation of OSDs and also there is no change if you apply a new OSD service.

      Note

      The command ceph orch daemon add creates new OSDs, but does not add an OSD service.

Verification

  • List the service:

    Example

    [ceph: root@host01 /]# ceph orch ls

  • View the details of the node and devices:

    Example

    [ceph: root@host01 /]# ceph osd tree

Additional Resources

7.7. Deploying Ceph OSDs on specific devices and hosts

You can deploy all the Ceph OSDs on specific devices and hosts using the Ceph Orchestrator.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • All manager and monitor daemons are deployed.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. List the available devices to deploy OSDs:

    Syntax

    ceph orch device ls [--hostname=HOSTNAME_1 HOSTNAME_2] [--wide] [--refresh]

    Example

    [ceph: root@host01 /]# ceph orch device ls --wide --refresh

  3. Deploy OSDs on specific devices and hosts:

    Syntax

    ceph orch daemon add osd HOSTNAME:_DEVICE_PATH_

    Example

    [ceph: root@host01 /]# ceph orch daemon add osd host02:/dev/sdb

Verification

  • List the service:

    Example

    [ceph: root@host01 /]# ceph orch ls osd

  • View the details of the node and devices:

    Example

    [ceph: root@host01 /]# ceph osd tree

  • List the hosts, daemons, and processes:

    Syntax

    ceph orch ps --service_name=SERVICE_NAME

    Example

    [ceph: root@host01 /]# ceph orch ps --service_name=osd

Additional Resources

7.8. Advanced service specifications and filters for deploying OSDs

Service Specification of type OSD is a way to describe a cluster layout using the properties of disks. It gives the user an abstract way to tell Ceph which disks should turn into an OSD with the required configuration without knowing the specifics of device names and paths. For each device and each host, define a yaml file or a json file.

General settings for OSD specifications

  • service_type: 'osd': This is mandatory to create OSDS
  • service_id: Use the service name or identification you prefer. A set of OSDs is created using the specification file. This name is used to manage all the OSDs together and represent an Orchestrator service.
  • placement: This is used to define the hosts on which the OSDs needs to be deployed.

    You can use on the following options:

    • host_pattern: '*' - A host name pattern used to select hosts.
    • label: 'osd_host' - A label used in the hosts where OSD needs to be deployed.
    • hosts: 'host01', 'host02' - An explicit list of host names where OSDs needs to be deployed.
  • selection of devices: The devices where OSDs are created. This allows to separate an OSD from different devices. You can create only BlueStore OSDs which have three components:

    • OSD data: contains all the OSD data
    • WAL: BlueStore internal journal or write-ahead Log
    • DB: BlueStore internal metadata
  • data_devices: Define the devices to deploy OSD. In this case, OSDs are created in a collocated schema. You can use filters to select devices and folders.
  • wal_devices: Define the devices used for WAL OSDs. You can use filters to select devices and folders.
  • db_devices: Define the devices for DB OSDs. You can use the filters to select devices and folders.
  • encrypted: An optional parameter to encrypt information on the OSD which can set to either True or False
  • unmanaged: An optional parameter, set to False by default. You can set it to True if you do not want the Orchestrator to manage the OSD service.
  • block_wal_size: User-defined value, in bytes.
  • block_db_size: User-defined value, in bytes.
  • osds_per_device: User-defined value for deploying more than one OSD per device.

Filters for specifying devices

Filters are used in conjunction with the data_devices, wal_devices and db_devices parameters.

Name of the filter

Description

Syntax

Example

Model

Target specific disks. You can get details of the model by running lsblk -o NAME,FSTYPE,LABEL,MOUNTPOINT,SIZE,MODEL command or smartctl -i /DEVIVE_PATH

Model: DISK_MODEL_NAME

Model: MC-55-44-XZ

Vendor

Target specific disks

Vendor: DISK_VENDOR_NAME

Vendor: Vendor Cs

Size Specification

Includes disks of an exact size

size: EXACT

size: '10G'

Size Specification

Includes disks size of which is within the range

size: LOW:HIGH

size: '10G:40G'

Size Specification

Includes disks less than or equal to in size

size: :HIGH

size: ':10G'

Size Specification

Includes disks equal to or greater than in size

size: LOW:

size: '40G:'

Rotational

Rotational attribute of the disk. 1 matches all disks that are rotational and 0 matches all the disks that are non-rotational. If rotational =1, then OSD is configured with SSD or NVME. If rotational=0 then the OSD is configured with HDD.

rotational: 0 or 1

rotational: 0

All

Considers all the available disks

all: true

all: true

Limiter

When you specified valid filters but want to limit the amount of matching disks you can use the ‘limit’ directive. It should be used only as a last resort.

limit: NUMBER

limit: 2

Note

To create an OSD with non-collocated components in the same host, you have to specify the different type of devices used and the devices should be on the same host.

Note

The devices used for deploying OSDs must be supported by libstoragemgmt.

Additional Resources

7.9. Deploying Ceph OSDs using advanced service specifications

The service specification of type OSD is a way to describe a cluster layout using the properties of disks. It gives the user an abstract way to tell Ceph which disks should turn into an OSD with the required configuration without knowing the specifics of device names and paths.

You can deploy the OSD for each device and each host by defining a yml file or a json file.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • All manager and monitor daemons are deployed.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. On the monitor node, navigate the following directory:

    Syntax

    cd /var/lib/ceph/DAEMON_PATH/

    Example

    [ceph: root@host01 /]# cd /var/lib/ceph/osd/

  3. Create the osd_spec.yml file:

    Example

    [ceph: root@host01 osd]# touch osd_spec.yml

  4. Edit the osd_spec.yml file to include the following details:

    Syntax

    service_type: osd
    service_id: SERVICE_ID
    placement:
      host_pattern: '*' # optional
    data_devices: # optional
      model: DISK_MODEL_NAME # optional
      paths:
      - /DEVICE_PATH
    osds_per_device: NUMBER_OF_DEVICES # optional
    db_devices: # optional
      size: # optional
      all: true # optional
      paths:
       - /DEVICE_PATH
    encrypted: true

    1. Simple scenario: In this case, all the nodes have the same set-up.

      Example

      service_type: osd
      service_id: osd_spec_default
      placement:
        host_pattern: '*'
      data_devices:
        all: true
        paths:
        - /dev/sdb
      encrypted: true

      Example

      service_type: osd
      service_id: osd_spec_default
      placement:
        host_pattern: '*'
      data_devices:
        size: '80G'
      db_devices:
        size: '40G:'
        paths:
         - /dev/sdc

    2. Advanced scenario: This would create the desired layout by using all HDDs as data_devices with two SSD assigned as dedicated DB or WAL devices. The remaining SSDs are data_devices that have the NVMEs vendors assigned as dedicated DB or WAL devices.

      Example

      service_type: osd
      service_id: osd_spec_hdd
      placement:
        host_pattern: '*'
      data_devices:
        rotational: 0
      db_devices:
        model: Model-name
        limit: 2
      ---
      service_type: osd
      service_id: osd_spec_ssd
      placement:
        host_pattern: '*'
      data_devices:
        model: Model-name
      db_devices:
        vendor: Vendor-name

    3. Advanced scenario with non-uniform nodes: This applies different OSD specs to different hosts depending on the host_pattern key.

      Example

      service_type: osd
      service_id: osd_spec_node_one_to_five
      placement:
        host_pattern: 'node[1-5]'
      data_devices:
        rotational: 1
      db_devices:
        rotational: 0
      ---
      service_type: osd
      service_id: osd_spec_six_to_ten
      placement:
        host_pattern: 'node[6-10]'
      data_devices:
        model: Model-name
      db_devices:
        model: Model-name

    4. Advanced scenario with dedicated WAL and DB devices:

      Example

      service_type: osd
      service_id: osd_using_paths
      placement:
        hosts:
          - host01
          - host02
      data_devices:
        paths:
          - /dev/sdb
      db_devices:
        paths:
          - /dev/sdc
      wal_devices:
        paths:
          - /dev/sdd

    5. Advanced scenario with multiple OSDs per device:

      Example

      service_type: osd
      service_id: multiple_osds
      placement:
        hosts:
          - host01
          - host02
      osds_per_device: 4
      data_devices:
        paths:
          - /dev/sdb

  5. Before deploying OSDs, do a dry run:

    Note

    This step gives a preview of the deployment, without deploying the daemons.

    Example

    [ceph: root@host01 osd]# ceph orch apply -i osd_spec.yml --dry-run

  6. Deploy OSDs using service specification:

    Syntax

    ceph orch apply -i FILE_NAME.yml

    Example

    [ceph: root@host01 osd]# ceph orch apply -i osd_spec.yml

Verification

  • List the service:

    Example

    [ceph: root@host01 /]# ceph orch ls osd

  • View the details of the node and devices:

    Example

    [ceph: root@host01 /]# ceph osd tree

Additional Resources

7.10. Removing the OSD daemons using the Ceph Orchestrator

You can remove the OSD from a cluster by using Cephadm.

Removing an OSD from a cluster involves two steps:

  1. Evacuates all placement groups (PGs) from the cluster.
  2. Removes the PG-free OSDs from the cluster.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • Ceph Monitor, Ceph Manager and Ceph OSD daemons are deployed on the storage cluster.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Check the device and the node from which the OSD has to be removed:

    Example

    [ceph: root@host01 /]# ceph osd tree

  3. Remove the OSD:

    Syntax

    ceph orch osd rm OSD_ID [--replace] [--force]

    Example

    [ceph: root@host01 /]# ceph orch osd rm 0

    Note

    If you remove the OSD from the storage cluster without an option, such as --replace, the device is removed from the storage cluster completely. If you want to use the same device for deploying OSDs, you have to first zap the device before adding it to the storage cluster.

  4. Optional: To remove multiple OSDs from a specific node, run the following command:

    Syntax

    ceph orch osd rm OSD_ID OSD_ID

    Example

    [ceph: root@host01 /]# ceph orch osd rm 2 5

  5. Check the status of the OSD removal:

    Example

    [ceph: root@host01 /]# ceph orch osd rm status
    OSD_ID  HOST     STATE     PG_COUNT  REPLACE  FORCE  DRAIN_STARTED_AT
    2       host01  draining  124       False    False  2021-09-07 16:26:07.142980
    5       host01  draining  107       False    False  2021-09-07 16:26:08.330371

    When no PGs are left on the OSD, it is decommissioned and removed from the cluster.

Verification

  • Verify the details of the devices and the nodes from which the Ceph OSDs are removed:

    Example

    [ceph: root@host01 /]# ceph osd tree

Additional Resources

7.11. Replacing the OSDs using the Ceph Orchestrator

You can replace the OSDs from the cluster by preserving the OSD ID using the ceph orch rm command. The OSD is not permanently removed from the CRUSH hierarchy, but is assigned the destroyed flag. This flag is used to determine the OSD IDs that can be reused in the next OSD deployment. The ‘destroyed’ flag is used to determine which OSD ids is reused in the next OSD deployment.

If you use OSD specification for deployment, your newly added disks is assigned the OSD IDs of their replaced counterparts.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • Monitor, Manager and OSD daemons are deployed on the storage cluster.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Check the device and the node from which the OSD has to be replaced:

    Example

    [ceph: root@host01 /]# ceph osd tree

  3. Replace the OSD:

    Syntax

    ceph orch osd rm OSD_ID --replace

    Example

    [ceph: root@host01 /]# ceph orch osd rm 0 --replace

  4. Check the status of the OSD replacement:

    Example

    [ceph: root@host01 /]# ceph orch osd rm status

Verification

  • Verify the details of the devices and the nodes from which the Ceph OSDS are replaced:

    Example

    [ceph: root@host01 /]# ceph osd tree

Additional Resources

7.12. Stopping the removal of the OSDs using the Ceph Orchestrator

You can stop the removal of only the OSDs that are queued for removal. This resets the initial state of the OSD and takes it off the removal queue.

If the OSD is in the process of removal, then you cannot stop the process.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • Monitor, Manager and OSD daemons are deployed on the cluster.
  • Remove OSD process initiated.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Check the device and the node from which the OSD was initiated to be removed:

    Example

    [ceph: root@host01 /]# ceph osd tree

  3. Stop the removal of the queued OSD:

    Syntax

    ceph orch osd rm stop OSD_ID

    Example

    [ceph: root@host01 /]# ceph orch osd rm stop 0

  4. Check the status of the OSD removal:

    Example

    [ceph: root@host01 /]# ceph orch osd rm status

Verification

  • Verify the details of the devices and the nodes from which the Ceph OSDs were queued for removal:

    Example

    [ceph: root@host01 /]# ceph osd tree

Additional Resources

7.13. Activating the OSDs using the Ceph Orchestrator

You can activate the OSDs in the cluster in cases where the operating system of the host was reinstalled.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • Monitor, Manager and OSD daemons are deployed on the storage cluster.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. After the operating system of the host is reinstalled, activate the OSDs:

    Syntax

    ceph cephadm osd activate HOSTNAME

    Example

    [ceph: root@host01 /]# ceph cephadm osd activate host03

Verification

  • List the service:

    Example

    [ceph: root@host01 /]# ceph orch ls

  • List the hosts, daemons, and processes:

    Syntax

    ceph orch ps --service_name=SERVICE_NAME

    Example

    [ceph: root@host01 /]# ceph orch ps --service_name=osd

7.13.1. Observing the data migration

When you add or remove an OSD to the CRUSH map, Ceph begins rebalancing the data by migrating placement groups to the new or existing OSD(s). You can observe the data migration using ceph-w command.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Recently added or removed an OSD.

Procedure

  1. To observe the data migration:

    Example

    [ceph: root@host01 /]# ceph -w

  2. Watch as the placement group states change from active+clean to active, some degraded objects, and finally active+clean when migration completes.
  3. To exit the utility, press Ctrl + C.

7.14. Recalculating the placement groups

Placement groups (PGs) define the spread of any pool data across the available OSDs. A placement group is built upon the given redundancy algorithm to be used. For a 3-way replication, the redundancy is defined to use three different OSDs. For erasure-coded pools, the number of OSDs to use is defined by the number of chunks.

When defining a pool the number of placement groups defines the grade of granularity the data is spread with across all available OSDs. The higher the number the better the equalization of capacity load can be. However, since handling the placement groups is also important in case of reconstruction of data, the number is significant to be carefully chosen upfront. To support calculation a tool is available to produce agile environments.

During lifetime of a storage cluster a pool may grow above the initially anticipated limits. With the growing number of drives a recalculation is recommended. The number of placement groups per OSD should be around 100. When adding more OSDs to the storage cluster the number of PGs per OSD will lower over time. Starting with 120 drives initially in the storage cluster and setting the pg_num of the pool to 4000 will end up in 100 PGs per OSD, given with the replication factor of three. Over time, when growing to ten times the number of OSDs, the number of PGs per OSD will go down to ten only. Because small number of PGs per OSD will tend to an unevenly distributed capacity, consider adjusting the PGs per pool.

Adjusting the number of placement groups can be done online. Recalculating is not only a recalculation of the PG numbers, but will involve data relocation, which will be a lengthy process. However, the data availability will be maintained at any time.

Very high numbers of PGs per OSD should be avoided, because reconstruction of all PGs on a failed OSD will start at once. A high number of IOPS is required to perform reconstruction in a timely manner, which might not be available. This would lead to deep I/O queues and high latency rendering the storage cluster unusable or will result in long healing times.

Additional Resources

  • See the PG calculator for calculating the values by a given use case.
  • See the Erasure Code Pools chapter in the Red Hat Ceph Storage Strategies Guide for more information.

7.15. Using the Ceph Manager balancer module

The balancer is a module for Ceph Manager (ceph-mgr) that optimizes the placement of placement groups (PGs) across OSDs in order to achieve a balanced distribution, either automatically or in a supervised fashion.

Currently the balancer module cannot be disabled. It can only be turned off to customize the configuration.

Prerequisites

  • A running Red Hat Ceph Storage cluster.

Procedure

  1. Ensure the balancer module is enabled:

    Example

    [ceph: root@host01 /]# ceph mgr module enable balancer

  2. Turn on the balancer module:

    Example

    [ceph: root@host01 /]# ceph balancer on

  3. The default mode is crush-compat. The mode can be changed with:

    Example

    [ceph: root@host01 /]# ceph balancer mode upmap

    or

    Example

    [ceph: root@host01 /]# ceph balancer mode crush-compat

Status

The current status of the balancer can be checked at any time with:

Example

[ceph: root@host01 /]# ceph balancer status

Automatic balancing

By default, when turning on the balancer module, automatic balancing is used:

Example

[ceph: root@host01 /]# ceph balancer on

The balancer can be turned back off again with:

Example

[ceph: root@host01 /]# ceph balancer off

This will use the crush-compat mode, which is backward compatible with older clients and will make small changes to the data distribution over time to ensure that OSDs are equally utilized.

Throttling

No adjustments will be made to the PG distribution if the cluster is degraded, for example, if an OSD has failed and the system has not yet healed itself.

When the cluster is healthy, the balancer throttles its changes such that the percentage of PGs that are misplaced, or need to be moved, is below a threshold of 5% by default. This percentage can be adjusted using the max_misplaced setting. For example, to increase the threshold to 7%:

Example

[ceph: root@host01 /]# ceph config-key set mgr/balancer/target_max_misplaced .07

Supervised optimization

The balancer operation is broken into a few distinct phases:

  1. Building a plan.
  2. Evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result after executing a plan.
  3. Executing the plan.

    • To evaluate and score the current distribution:

      Example

      [ceph: root@host01 /]# ceph balancer eval

    • To evaluate the distribution for a single pool:

      Syntax

      ceph balancer eval POOL_NAME

      Example

      [ceph: root@host01 /]# ceph balancer eval rbd

    • To see greater detail for the evaluation:

      Example

      [ceph: root@host01 /]# ceph balancer eval-verbose ...

    • To generate a plan using the currently configured mode:

      Syntax

      ceph balancer optimize PLAN_NAME

      Replace PLAN_NAME with a custom plan name.

      Example

      [ceph: root@host01 /]# ceph balancer optimize rbd_123

    • To see the contents of a plan:

      Syntax

      ceph balancer show PLAN_NAME

      Example

      [ceph: root@host01 /]# ceph balancer show rbd_123

    • To discard old plans:

      Syntax

      ceph balancer rm PLAN_NAME

      Example

      [ceph: root@host01 /]# ceph balancer rm rbd_123

    • To see currently recorded plans use the status command:

      [ceph: root@host01 /]# ceph balancer status
    • To calculate the quality of the distribution that would result after executing a plan:

      Syntax

      ceph balancer eval PLAN_NAME

      Example

      [ceph: root@host01 /]# ceph balancer eval rbd_123

    • To execute the plan:

      Syntax

      ceph balancer execute PLAN_NAME

      Example

      [ceph: root@host01 /]# ceph balancer execute rbd_123

      Note

      Only execute the plan if is expected to improve the distribution. After execution, the plan will be discarded.

7.16. Using the Ceph manager crash module

Using the Ceph manager crash module, you can collect information about daemon crashdumps and store it in the Red Hat Ceph Storage cluster for further analysis.

By default, daemon crashdumps are dumped in /var/lib/ceph/crash. You can configure with the option crash dir. Crash directories are named by time, date, and a randomly-generated UUID, and contain a metadata file meta and a recent log file, with a crash_id that is the same.

You can use ceph-crash.service to submit these crash automatically and persist in the Ceph Monitors. The ceph-crash.service watches watches the crashdump directory and uploads them with ceph crash post.

The RECENT_CRASH heath message is one of the most common health messages in a Ceph cluster. This health message means that one or more Ceph daemons has crashed recently, and the crash has not yet been archived or acknowledged by the administrator. This might indicate a software bug, a hardware problem like a failing disk, or some other problem. The option mgr/crash/warn_recent_interval controls the time period of what recent means, which is two weeks by default. You can disable the warnings by running the following command:

Example

[ceph: root@host01 /]# ceph config set mgr/crash/warn_recent_interval 0

The option mgr/crash/retain_interval controls the period for which you want to retain the crash reports before they are automatically purged. The default for this option is one year.

Prerequisites

  • A running Red Hat Ceph Storage cluster.

Procedure

  1. Ensure the crash module is enabled:

    Example

    [ceph: root@host01 /]# ceph mgr module ls | more
    {
        "always_on_modules": [
            "balancer",
            "crash",
            "devicehealth",
            "orchestrator_cli",
            "progress",
            "rbd_support",
            "status",
            "volumes"
        ],
        "enabled_modules": [
            "dashboard",
            "pg_autoscaler",
            "prometheus"
        ]

  2. Save a crash dump: The metadata file is a JSON blob stored in the crash dir as meta. You can invoke the ceph command -i - option, which reads from stdin.

    Example

    [ceph: root@host01 /]# ceph crash post -i meta

  3. List the timestamp or the UUID crash IDs for all the new and archived crash info:

    Example

    [ceph: root@host01 /]# ceph crash ls

  4. List the timestamp or the UUID crash IDs for all the new crash information:

    Example

    [ceph: root@host01 /]# ceph crash ls-new

  5. List the timestamp or the UUID crash IDs for all the new crash information:

    Example

    [ceph: root@host01 /]# ceph crash ls-new

  6. List the summary of saved crash information grouped by age:

    Example

    [ceph: root@host01 /]# ceph crash stat
    8 crashes recorded
    8 older than 1 days old:
    2021-05-20T08:30:14.533316Z_4ea88673-8db6-4959-a8c6-0eea22d305c2
    2021-05-20T08:30:14.590789Z_30a8bb92-2147-4e0f-a58b-a12c2c73d4f5
    2021-05-20T08:34:42.278648Z_6a91a778-bce6-4ef3-a3fb-84c4276c8297
    2021-05-20T08:34:42.801268Z_e5f25c74-c381-46b1-bee3-63d891f9fc2d
    2021-05-20T08:34:42.803141Z_96adfc59-be3a-4a38-9981-e71ad3d55e47
    2021-05-20T08:34:42.830416Z_e45ed474-550c-44b3-b9bb-283e3f4cc1fe
    2021-05-24T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d
    2021-05-24T19:58:44.315282Z_1847afbc-f8a9-45da-94e8-5aef0738954e

  7. View the details of the saved crash:

    Syntax

    ceph crash info CRASH_ID

    Example

    [ceph: root@host01 /]# ceph crash info 2021-05-24T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d
    {
        "assert_condition": "session_map.sessions.empty()",
        "assert_file": "/builddir/build/BUILD/ceph-16.1.0-486-g324d7073/src/mon/Monitor.cc",
        "assert_func": "virtual Monitor::~Monitor()",
        "assert_line": 287,
        "assert_msg": "/builddir/build/BUILD/ceph-16.1.0-486-g324d7073/src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7f67a1aeb700 time 2021-05-24T19:58:42.545485+0000\n/builddir/build/BUILD/ceph-16.1.0-486-g324d7073/src/mon/Monitor.cc: 287: FAILED ceph_assert(session_map.sessions.empty())\n",
        "assert_thread_name": "ceph-mon",
        "backtrace": [
            "/lib64/libpthread.so.0(+0x12b30) [0x7f679678bb30]",
            "gsignal()",
            "abort()",
            "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f6798c8d37b]",
            "/usr/lib64/ceph/libceph-common.so.2(+0x276544) [0x7f6798c8d544]",
            "(Monitor::~Monitor()+0xe30) [0x561152ed3c80]",
            "(Monitor::~Monitor()+0xd) [0x561152ed3cdd]",
            "main()",
            "__libc_start_main()",
            "_start()"
        ],
        "ceph_version": "16.1.0-486.el8cp",
        "crash_id": "2021-05-24T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d",
        "entity_name": "mon.ceph-adm4",
        "os_id": "rhel",
        "os_name": "Red Hat Enterprise Linux",
        "os_version": "8.3 (Ootpa)",
        "os_version_id": "8.3",
        "process_name": "ceph-mon",
        "stack_sig": "957c21d558d0cba4cee9e8aaf9227b3b1b09738b8a4d2c9f4dc26d9233b0d511",
        "timestamp": "2021-05-24T19:58:42.549073Z",
        "utsname_hostname": "host02",
        "utsname_machine": "x86_64",
        "utsname_release": "4.18.0-240.15.1.el8_3.x86_64",
        "utsname_sysname": "Linux",
        "utsname_version": "#1 SMP Wed Feb 3 03:12:15 EST 2021"
    }

  8. Remove saved crashes older than KEEP days: Here, KEEP must be an integer.

    Syntax

    ceph crash prune KEEP

    Example

    [ceph: root@host01 /]# ceph crash prune 60

  9. Archive a crash report so that it is no longer considered for the RECENT_CRASH health check and does not appear in the crash ls-new output. It appears in the crash ls.

    Syntax

    ceph crash archive CRASH_ID

    Example

    [ceph: root@host01 /]# ceph crash archive 2021-05-24T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d

  10. Archive all crash reports:

    Example

    [ceph: root@host01 /]# ceph crash archive-all

  11. Remove the crash dump:

    Syntax

    ceph crash rm CRASH_ID

    Example

    [ceph: root@host01 /]# ceph crash rm 2021-05-24T19:58:42.549073Z_b2382865-ea89-4be2-b46f-9a59af7b7a2d

Additional Resources