Chapter 2. The Ceph File System Metadata Server

As a storage administrator, you can learn about the different states of the Ceph File System (CephFS) Metadata Server (MDS), along with learning about CephFS MDS ranking mechanics, configuring the MDS standby daemon, and cache size limits. Knowing these concepts can enable you to configure the MDS daemons for a storage environment.

2.1. Prerequisites

  • A running, and healthy Red Hat Ceph Storage cluster.
  • Installation of the Ceph Metadata Server daemons (ceph-mds). See the Management of MDS service using the Ceph Orchestrator section in the Red Hat Ceph Storage File System Guide for details on configuring MDS daemons.

2.2. Metadata Server daemon states

The Metadata Server (MDS) daemons operate in two states:

  • Active — manages metadata for files and directories stores on the Ceph File System.
  • Standby — serves as a backup, and becomes active when an active MDS daemon becomes unresponsive.

By default, a Ceph File System uses only one active MDS daemon. However, systems with many clients benefit from multiple active MDS daemons.

You can configure the file system to use multiple active MDS daemons so that you can scale metadata performance for larger workloads. The active MDS daemons dynamically share the metadata workload when metadata load patterns change. Note that systems with multiple active MDS daemons still require standby MDS daemons to remain highly available.

What Happens When the Active MDS Daemon Fails

When the active MDS becomes unresponsive, a Ceph Monitor daemon waits a number of seconds equal to the value specified in the mds_beacon_grace option. If the active MDS is still unresponsive after the specified time period has passed, the Ceph Monitor marks the MDS daemon as laggy. One of the standby daemons becomes active, depending on the configuration.

Note

To change the value of mds_beacon_grace, add this option to the Ceph configuration file and specify the new value.

2.3. Metadata Server ranks

Each Ceph File System (CephFS) has a number of ranks, one by default, which starts at zero.

Ranks define how the metadata workload is shared between multiple Metadata Server (MDS) daemons. The number of ranks is the maximum number of MDS daemons that can be active at one time. Each MDS daemon handles a subset of the CephFS metadata that is assigned to that rank.

Each MDS daemon initially starts without a rank. The Ceph Monitor assigns a rank to the daemon. The MDS daemon can only hold one rank at a time. Daemons only lose ranks when they are stopped.

The max_mds setting controls how many ranks will be created.

The actual number of ranks in the CephFS is only increased if a spare daemon is available to accept the new rank.

Rank States

Ranks can be:

  • Up - A rank that is assigned to the MDS daemon.
  • Failed - A rank that is not associated with any MDS daemon.
  • Damaged - A rank that is damaged; its metadata is corrupted or missing. Damaged ranks are not assigned to any MDS daemons until the operator fixes the problem, and uses the ceph mds repaired command on the damaged rank.

2.4. Metadata Server cache size limits

You can limit the size of the Ceph File System (CephFS) Metadata Server (MDS) cache by:

  • A memory limit: Use the mds_cache_memory_limit option. Red Hat recommends a value between 8 GB and 64 GB for mds_cache_memory_limit. Setting more cache can cause issues with recovery. This limit is approximately 66% of the desired maximum memory use of the MDS.

    Important

    Red Hat recommends using memory limits instead of inode count limits.

  • Inode count: Use the mds_cache_size option. By default, limiting the MDS cache by inode count is disabled.

In addition, you can specify a cache reservation by using the mds_cache_reservation option for MDS operations. The cache reservation is limited as a percentage of the memory or inode limit and is set to 5% by default. The intent of this parameter is to have the MDS maintain an extra reserve of memory for its cache for new metadata operations to use. As a consequence, the MDS should in general operate below its memory limit because it will recall old state from clients to drop unused metadata in its cache.

The mds_cache_reservation option replaces the mds_health_cache_threshold option in all situations, except when MDS nodes send a health alert to the Ceph Monitors indicating the cache is too large. By default, mds_health_cache_threshold is 150% of the maximum cache size.

Be aware that the cache limit is not a hard limit. Potential bugs in the CephFS client or MDS or misbehaving applications might cause the MDS to exceed its cache size. The mds_health_cache_threshold option configures the storage cluster health warning message, so that operators can investigate why the MDS cannot shrink its cache.

Additional Resources

2.5. File system affinity

You can configure a Ceph File System (CephFS) to prefer a particular Ceph Metadata Server (MDS) over another Ceph MDS. For example, you have MDS running on newer, faster hardware that you want to give preference to over a standby MDS running on older, maybe slower hardware. You can specify this preference by setting the mds_join_fs option, which enforces this file system affinity. Ceph Monitors give preference to MDS standby daemons with mds_join_fs equal to the file system name with the failed rank. The standby-replay daemons are selected before choosing another standby daemon. If no standby daemon exists with the mds_join_fs option, then the Ceph Monitors will choose an ordinary standby for replacement or any other available standby as a last resort. The Ceph Monitors will periodically examine Ceph File Systems to see if a standby with a stronger affinity is available to replace the Ceph MDS that has a lower affinity.

Additional Resources

2.6. Management of MDS service using the Ceph Orchestrator

As a storage administrator, you can use the Ceph Orchestrator with Cephadm in the backend to deploy the MDS service. By default, a Ceph File System (CephFS) uses only one active MDS daemon. However, systems with many clients benefit from multiple active MDS daemons.

This section covers the following administrative tasks:

2.6.1. Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to all the nodes.
  • Hosts are added to the cluster.
  • All manager, monitor, and OSD daemons are deployed.

2.6.2. Deploying the MDS service using the command line interface

Using the Ceph Orchestrator, you can deploy the Metadata Server (MDS) service using the placement specification in the command line interface. Ceph File System (CephFS) requires one or more MDS.

Note

Ensure you have at least two pools, one for Ceph file system (CephFS) data and one for CephFS metadata.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • All manager, monitor, and OSD daemons are deployed.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. There are two ways of deploying MDS daemons using placement specification:

Method 1

  • Use ceph fs volume to create the MDS daemons. This creates the CephFS volume and pools associated with the CephFS, and also starts the MDS service on the hosts.

    Syntax

    ceph fs volume create FILESYSTEM_NAME --placement="NUMBER_OF_DAEMONS HOST_NAME_1 HOST_NAME_2 HOST_NAME_3"

    Note

    By default, replicated pools are created for this command.

    Example

    [ceph: root@host01 /]# ceph fs volume create test --placement="2 host01 host02"

Method 2

  • Create the pools, CephFS, and then deploy MDS service using placement specification:

    1. Create the pools for CephFS:

      Syntax

      ceph osd pool create DATA_POOL [PG_NUM]
      ceph osd pool create METADATA_POOL [PG_NUM]

      Example

      [ceph: root@host01 /]# ceph osd pool create cephfs_data 64
      [ceph: root@host01 /]# ceph osd pool create cephfs_metadata 64

      Typically, the metadata pool can start with a conservative number of Placement Groups (PGs) as it generally has far fewer objects than the data pool. It is possible to increase the number of PGs if needed. The pool sizes range from 64 PGs to 512 PGs. Size the data pool is proportional to the number and sizes of files you expect in the file system.

      Important

      For the metadata pool, consider to use:

      • A higher replication level because any data loss to this pool can make the whole file system inaccessible.
      • Storage with lower latency such as Solid-State Drive (SSD) disks because this directly affects the observed latency of file system operations on clients.
    2. Create the file system for the data pools and metadata pools:

      Syntax

      ceph fs new FILESYSTEM_NAME METADATA_POOL DATA_POOL

      Example

      [ceph: root@host01 /]# ceph fs new test cephfs_metadata cephfs_data

    3. Deploy MDS service using the ceph orch apply command:

      Syntax

      ceph orch apply mds FILESYSTEM_NAME --placement="NUMBER_OF_DAEMONS HOST_NAME_1 HOST_NAME_2 HOST_NAME_3"

      Example

      [ceph: root@host01 /]# ceph orch apply mds test --placement="2 host01 host02"

Verification

  • List the service:

    Example

    [ceph: root@host01 /]# ceph orch ls

  • Check the CephFS status:

    Example

    [ceph: root@host01 /]# ceph fs ls
    [ceph: root@host01 /]# ceph fs status

  • List the hosts, daemons, and processes:

    Syntax

    ceph orch ps --daemon_type=DAEMON_NAME

    Example

    [ceph: root@host01 /]# ceph orch ps --daemon_type=mds

Additional Resources

2.6.3. Deploying the MDS service using the service specification

Using the Ceph Orchestrator, you can deploy the MDS service using the service specification.

Note

Ensure you have at least two pools, one for the Ceph File System (CephFS) data and one for the CephFS metadata.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Hosts are added to the cluster.
  • All manager, monitor, and OSD daemons are deployed.

Procedure

  1. Create the mds.yaml file:

    Example

    [root@host01 ~]# touch mds.yaml

  2. Edit the mds.yaml file to include the following details:

    Syntax

    service_type: mds
    service_id: FILESYSTEM_NAME
    placement:
      hosts:
      - HOST_NAME_1
      - HOST_NAME_2
      - HOST_NAME_3

    Example

    service_type: mds
    service_id: fs_name
    placement:
      hosts:
      - host01
      - host02

  3. Mount the YAML file under a directory in the container:

    Example

    [root@host01 ~]# cephadm shell --mount mds.yaml:/var/lib/ceph/mds/mds.yaml

  4. Navigate to the directory:

    Example

    [ceph: root@host01 /]# cd /var/lib/ceph/mds/

  5. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  6. Navigate to the following directory:

    Example

    [ceph: root@host01 /]# cd /var/lib/ceph/mds/

  7. Deploy MDS service using service specification:

    Syntax

    ceph orch apply -i FILE_NAME.yaml

    Example

    [ceph: root@host01 mds]# ceph orch apply -i mds.yaml

  8. Once the MDS services is deployed and functional, create the CephFS:

    Syntax

    ceph fs new CEPHFS_NAME METADATA_POOL DATA_POOL

    Example

    [ceph: root@host01 /]# ceph fs new test metadata_pool data_pool

Verification

  • List the service:

    Example

    [ceph: root@host01 /]# ceph orch ls

  • List the hosts, daemons, and processes:

    Syntax

    ceph orch ps --daemon_type=DAEMON_NAME

    Example

    [ceph: root@host01 /]# ceph orch ps --daemon_type=mds

Additional Resources

2.6.4. Removing the MDS service using the Ceph Orchestrator

You can remove the service using the ceph orch rm command. Alternatively, you can remove the file system and the associated pools.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to all the nodes.
  • Hosts are added to the cluster.
  • At least one MDS daemon deployed on the hosts.

Procedure

  • There are two ways of removing MDS daemons from the cluster:

Method 1

  • Remove the CephFS volume, associated pools, and the services:

    1. Log into the Cephadm shell:

      Example

      [root@host01 ~]# cephadm shell

    2. Set the configuration parameter mon_allow_pool_delete to true:

      Example

      [ceph: root@host01 /]# ceph config set mon mon_allow_pool_delete true

    3. Remove the file system:

      Syntax

      ceph fs volume rm FILESYSTEM_NAME --yes-i-really-mean-it

      Example

      [ceph: root@host01 /]# ceph fs volume rm cephfs-new --yes-i-really-mean-it

      This command will remove the file system, its data, and metadata pools. It also tries to remove the MDS using the enabled ceph-mgr Orchestrator module.

Method 2

  • Use the ceph orch rm command to remove the MDS service from the entire cluster:

    1. List the service:

      Example

      [ceph: root@host01 /]# ceph orch ls

    2. Remove the service

      Syntax

      ceph orch rm SERVICE_NAME

      Example

      [ceph: root@host01 /]# ceph orch rm mds.test

Verification

  • List the hosts, daemons, and processes:

    Syntax

    ceph orch ps

    Example

    [ceph: root@host01 /]# ceph orch ps

Additional Resources

2.7. Configuring file system affinity

Set the Ceph File System (CephFS) affinity for a particular Ceph Metadata Server (MDS).

Prerequisites

  • A healthy, and running Ceph File System.
  • Root-level access to a Ceph Monitor node.

Procedure

  1. Check the current state of a Ceph File System:

    Example

    [root@mon ~]# ceph fs dump
    dumped fsmap epoch 399
    ...
    Filesystem 'cephfs01' (27)
    ...
    e399
    max_mds 1
    in      0
    up      {0=20384}
    failed
    damaged
    stopped
    ...
    [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]
    
    Standby daemons:
    
    [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]

  2. Set the file system affinity:

    Syntax

    ceph config set STANDBY_DAEMON mds_join_fs FILE_SYSTEM_NAME

    Example

    [root@mon ~]# ceph config set mds.b mds_join_fs cephfs01

    After a Ceph MDS failover event, the file system favors the standby daemon for which the affinity is set.

    Example

    [root@mon ~]# ceph fs dump
    dumped fsmap epoch 405
    e405
    ...
    Filesystem 'cephfs01' (27)
    ...
    max_mds 1
    in      0
    up      {0=10420}
    failed
    damaged
    stopped
    ...
    [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]] 1
    
    Standby daemons:
    
    [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]

    1
    The mds.b daemon now has the join_fscid=27 in the file system dump output.
    Important

    If a file system is in a degraded or undersized state, then no failover will occur to enforce the file system affinity.

Additional Resources

  • See the File system affinity section in the Red Hat Ceph Storage File System Guide for more details.

2.8. Configuring multiple active Metadata Server daemons

Configure multiple active Metadata Server (MDS) daemons to scale metadata performance for large systems.

Important

Do not convert all standby MDS daemons to active ones. A Ceph File System (CephFS) requires at least one standby MDS daemon to remain highly available.

Prerequisites

  • Ceph administration capabilities on the MDS node.
  • Root-level access to a Ceph Monitor node.

Procedure

  1. Set the max_mds parameter to the desired number of active MDS daemons:

    Syntax

    ceph fs set NAME max_mds NUMBER

    Example

    [root@mon ~]# ceph fs set cephfs max_mds 2

    This example increases the number of active MDS daemons to two in the CephFS called cephfs

    Note

    Ceph only increases the actual number of ranks in the CephFS if a spare MDS daemon is available to take the new rank.

  2. Verify the number of active MDS daemons:

    Syntax

    ceph fs status NAME

    Example

    [root@mon ~]# ceph fs status cephfs
    cephfs - 0 clients
    ======
    +------+--------+-------+---------------+-------+-------+--------+--------+
    | RANK | STATE  |  MDS  |    ACTIVITY   |  DNS  |  INOS |  DIRS  |  CAPS  |
    +------+--------+-------+---------------+-------+-------+--------+--------+
    |  0   | active | node1 | Reqs:    0 /s |   10  |   12  |   12   |   0    |
    |  1   | active | node2 | Reqs:    0 /s |   10  |   12  |   12   |   0    |
    +------+--------+-------+---------------+-------+-------+--------+--------+
    +-----------------+----------+-------+-------+
    |       POOL      |   TYPE   |  USED | AVAIL |
    +-----------------+----------+-------+-------+
    | cephfs_metadata | metadata | 4638  | 26.7G |
    |   cephfs_data   |   data   |    0  | 26.7G |
    +-----------------+----------+-------+-------+
    
    +-------------+
    | STANDBY MDS |
    +-------------+
    |    node3    |
    +-------------+

Additional Resources

2.9. Configuring the number of standby daemons

Each Ceph File System (CephFS) can specify the required number of standby daemons to be considered healthy. This number also includes the standby-replay daemon waiting for a rank failure.

Prerequisites

  • Root-level access to a Ceph Monitor node.

Procedure

  • Set the expected number of standby daemons for a particular CephFS:

    Syntax

    ceph fs set FS_NAME standby_count_wanted NUMBER

    Note

    Setting the NUMBER to zero disables the daemon health check.

    Example

    [root@mon ~]# ceph fs set cephfs standby_count_wanted 2

    This example sets the expected standby daemon count to two.

2.10. Configuring the standby-replay Metadata Server

Configure each Ceph File System (CephFS) by adding a standby-replay Metadata Server (MDS) daemon. Doing this reduces failover time if the active MDS becomes unavailable.

This specific standby-replay daemon follows the active MDS’s metadata journal. The standby-replay daemon is only used by the active MDS of the same rank, and is not available to other ranks.

Important

If using standby-replay, then every active MDS must have a standby-replay daemon.

Prerequisites

  • Root-level access to a Ceph Monitor node.

Procedure

  • Set the standby-replay for a particular CephFS:

    Syntax

    ceph fs set FS_NAME allow_standby_replay 1

    Example

    [root@mon ~]# ceph fs set cephfs allow_standby_replay 1

    In this example, the Boolean value is 1, which enables the standby-replay daemons to be assigned to the active Ceph MDS daemons.

Additional Resources

2.11. Ephemeral pinning policies

An ephemeral pin is a static partition of subtrees, and can be set with a policy using extended attributes. A policy can automatically set ephemeral pins to directories. When setting an ephemeral pin to a directory, it is automatically assigned to a particular rank, as to be uniformly distributed across all Ceph MDS ranks. Determining which rank gets assigned is done by a consistent hash and the directory’s inode number. Ephemeral pins do not persist when the directory’s inode is dropped from file system cache. When failing over a Ceph Metadata Server (MDS), the ephemeral pin is recorded in its journal so the Ceph MDS standby server does not lose this information. There are two types of policies for using ephemeral pins:

Note: Installation of the attr package is a prerequisite for the ephemeral pinning policies.

Distributed

This policy enforces that all of a directory’s immediate children must be ephemerally pinned. For example, use a distributed policy to spread a user’s home directory across the entire Ceph File System cluster. Enable this policy by setting the ceph.dir.pin.distributed extended attribute.

setfattr -n ceph.dir.pin.distributed -v 1 DIRECTORY_PATH
Random

This policy enforces a chance that any descendent subdirectory might be ephemerally pinned. You can customize the percent of directories that can be ephemerally pinned. Enable this policy by setting the ceph.dir.pin.random and setting a percentage. Red Hat recommends setting this percentage to a value smaller than 1% (0.01). Having too many subtree partitions can cause slow performance. You can set the maximum percentage by setting the mds_export_ephemeral_random_max Ceph MDS configuration option. The parameters mds_export_ephemeral_distributed and mds_export_ephemeral_random are already enabled.

setfattr -n ceph.dir.pin.random -v PERCENTAGE DIRECTORY_PATH

Additional Resources

2.12. Manually pinning directory trees to a particular rank

Sometimes it might be desirable to override the dynamic balancer with explicit mappings of metadata to a particular Ceph Metadata Server (MDS) rank. You can do this manually to evenly spread the load of an application or to limit the impact of users' metadata requests on the Ceph File System cluster. Manually pinning directories is also known as an export pin by setting the ceph.dir.pin extended attribute.

A directory’s export pin is inherited from its closest parent directory, but can be overwritten by setting an export pin on that directory. Setting an export pin on a directory affects all of its sub-directories, for example:

[root@client ~]# mkdir -p a/b 1
[root@client ~]# setfattr -n ceph.dir.pin -v 1 a/ 2
[root@client ~]# setfattr -n ceph.dir.pin -v 0 a/b 3
1
Directories a/ and a/b both start without an export pin set.
2
Directories a/ and a/b are now pinned to rank 1.
3
Directory a/b is now pinned to rank 0 and directory a/ and the rest of its sub-directories are still pinned to rank 1.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A running Ceph File System.
  • Root-level access to the CephFS client.
  • Installation of the attr package.

Procedure

  • Set the export pin on a directory:

    Syntax

    setfattr -n ceph.dir.pin -v RANK PATH_TO_DIRECTORY

    Example

    [root@client ~]# setfattr -n ceph.dir.pin -v 2 cephfs/home

Additional Resources

  • See the Ephemeral pinning policies section in the Red Hat Ceph Storage File System Guide for details on automatically setting pins.

2.13. Decreasing the number of active Metadata Server daemons

How to decrease the number of active Ceph File System (CephFS) Metadata Server (MDS) daemons.

Prerequisites

  • The rank that you will remove must be active first, meaning that you must have the same number of MDS daemons as specified by the max_mds parameter.
  • Root-level access to a Ceph Monitor node.

Procedure

  1. Set the same number of MDS daemons as specified by the max_mds parameter:

    Syntax

    ceph fs status NAME

    Example

    [root@mon ~]# ceph fs status cephfs
    cephfs - 0 clients
    
    +------+--------+-------+---------------+-------+-------+--------+--------+
    | RANK | STATE  |  MDS  |    ACTIVITY   |  DNS  |  INOS |  DIRS  |  CAPS  |
    +------+--------+-------+---------------+-------+-------+--------+--------+
    |  0   | active | node1 | Reqs:    0 /s |   10  |   12  |   12   |   0    |
    |  1   | active | node2 | Reqs:    0 /s |   10  |   12  |   12   |   0    |
    +------+--------+-------+---------------+-------+-------+--------+--------+
    +-----------------+----------+-------+-------+
    |       POOL      |   TYPE   |  USED | AVAIL |
    +-----------------+----------+-------+-------+
    | cephfs_metadata | metadata | 4638  | 26.7G |
    |   cephfs_data   |   data   |    0  | 26.7G |
    +-----------------+----------+-------+-------+
    
    +-------------+
    | Standby MDS |
    +-------------+
    |    node3    |
    +-------------+

  2. On a node with administration capabilities, change the max_mds parameter to the desired number of active MDS daemons:

    Syntax

    ceph fs set NAME max_mds NUMBER

    Example

    [root@mon ~]# ceph fs set cephfs max_mds 1

  3. Wait for the storage cluster to stabilize to the new max_mds value by watching the Ceph File System status.
  4. Verify the number of active MDS daemons:

    Syntax

    ceph fs status NAME

    Example

    [root@mon ~]# ceph fs status cephfs
    cephfs - 0 clients
    
    +------+--------+-------+---------------+-------+-------+--------+--------+
    | RANK | STATE  |  MDS  |    ACTIVITY   |  DNS  |  INOS |  DIRS  |  CAPS  |
    +------+--------+-------+---------------+-------+-------+--------+--------+
    |  0   | active | node1 | Reqs:    0 /s |   10  |   12  |   12   |   0    |
    +------+--------+-------+---------------+-------+-------+--------|--------+
    +-----------------+----------+-------+-------+
    |       POOl      |   TYPE   |  USED | AVAIL |
    +-----------------+----------+-------+-------+
    | cephfs_metadata | metadata | 4638  | 26.7G |
    |   cephfs_data   |   data   |    0  | 26.7G |
    +-----------------+----------+-------+-------+
    
    +-------------+
    | Standby MDS |
    +-------------+
    |    node3    |
    |    node2    |
    +-------------+

Additional Resources

2.14. Additional Resources