Chapter 2. The Ceph File System Metadata Server

As a storage administrator, you can learn about the different states of the Ceph File System (CephFS) Metadata Server (MDS), along with learning about CephFS MDS ranking mechanic, configuring the MDS standby daemon, and cache size limits. Knowing these concepts can enable you to configure the MDS daemons for a storage environment.

2.1. Prerequisites

  • A running, and healthy Red Hat Ceph Storage cluster.
  • Installation of the Ceph Metadata Server daemons (ceph-mds).

2.2. Metadata Server daemon states

The Metadata Server (MDS) daemons operate in two states:

  • Active — manages metadata for files and directories stores on the Ceph File System.
  • Standby — serves as a backup, and becomes active when an active MDS daemon becomes unresponsive.

By default, a Ceph File System uses only one active MDS daemon. However, systems with many clients benefit from multiple active MDS daemons.

You can configure the file system to use multiple active MDS daemons so that you can scale metadata performance for larger workloads. The active MDS daemons dynamically share the metadata workload when metadata load patterns change. Note that systems with multiple active MDS daemons still require standby MDS daemons to remain highly available.

What Happens When the Active MDS Daemon Fails

When the active MDS becomes unresponsive, a Ceph Monitor daemon waits a number of seconds equal to the value specified in the mds_beacon_grace option. If the active MDS is still unresponsive after the specified time period has passed, the Ceph Monitor marks the MDS daemon as laggy. One of the standby daemons becomes active, depending on the configuration.

Note

To change the value of mds_beacon_grace, add this option to the Ceph configuration file and specify the new value.

2.3. Metadata Server ranks

Each Ceph File System (CephFS) has a number of ranks, one by default, which starts at zero.

Ranks define the way how the metadata workload is shared between multiple Metadata Server (MDS) daemons. The number of ranks is the maximum number of MDS daemons that can be active at one time. Each MDS daemon handles a subset of the CephFS metadata that is assigned to that rank.

Each MDS daemon initially starts without a rank. The Ceph Monitor assigns a rank to the daemon. The MDS daemon can only hold one rank at a time. Daemons only lose ranks when they are stopped.

The max_mds setting controls how many ranks will be created.

The actual number of ranks in the CephFS is only increased if a spare daemon is available to accept the new rank.

Rank States

Ranks can be:

  • Up - A rank that is assigned to the MDS daemon.
  • Failed - A rank that is not associated with any MDS daemon.
  • Damaged - A rank that is damaged; its metadata is corrupted or missing. Damaged ranks are not assigned to any MDS daemons until the operator fixes the problem, and uses the ceph mds repaired command on the damaged rank.

2.4. Metadata Server cache size limits

You can limit the size of the Ceph File System (CephFS) Metadata Server (MDS) cache by:

  • A memory limit: Use the mds_cache_memory_limit option. IMPORTANT: Red Hat recommends to use memory limits instead of inode count limits.
  • Inode count: Use the mds_cache_size option. By default, limiting the MDS cache by inode count is disabled.

In addition, you can specify a cache reservation by using the mds_cache_reservation option for MDS operations. The cache reservation is limited as a percentage of the memory or inode limit and is set to 5% by default. The intent of this parameter is to have the MDS maintain an extra reserve of memory for its cache for new metadata operations to use. As a consequence, the MDS should in general operate below its memory limit because it will recall old state from clients in order to drop unused metadata in its cache.

The mds_cache_reservation option replaces the mds_health_cache_threshold option in all situations, except when MDS nodes sends a health alert to the Ceph Monitors indicating the cache is too large. By default, mds_health_cache_threshold is 150% of the maximum cache size.

Be aware that the cache limit is not a hard limit. Potential bugs in the CephFS client or MDS or misbehaving applications might cause the MDS to exceed its cache size. The mds_health_cache_threshold option configures the storage cluster health warning message, so that operators can investigate why the MDS cannot shrink its cache.

Additional Resources

2.5. Configuring multiple active Metadata Server daemons

Configure multiple active Metadata Server (MDS) daemons to scale metadata performance for large systems.

Important

Do not convert all standby MDS daemons to active ones. A Ceph File System (CephFS)requires at least one standby MDS daemon to remain highly available.

Important

The scrubbing process is not currently supported when multiple active MDS daemons are configured.

Prerequisites

  • Ceph administration capabilities on the MDS node.

Procedure

  1. Set the max_mds parameter to the desired number of active MDS daemons:

    Syntax

    ceph fs set NAME max_mds NUMBER

    Example

    [root@mon ~]# ceph fs set cephfs max_mds 2

    This example increases the number of active MDS daemons to two in the CephFS called cephfs

    Note

    Ceph only increases the actual number of ranks in the CephFS if a spare MDS daemon is available to take the new rank.

  2. Verify the number of active MDS daemons:

    Syntax

    ceph fs status NAME

    Example

    [root@mon ~]# ceph fs status cephfs
    cephfs - 0 clients
    ======
    +------+--------+-------+---------------+-------+-------+
    | Rank | State  |  MDS  |    Activity   |  dns  |  inos |
    +------+--------+-------+---------------+-------+-------+
    |  0   | active | node1 | Reqs:    0 /s |   10  |   12  |
    |  1   | active | node2 | Reqs:    0 /s |   10  |   12  |
    +------+--------+-------+---------------+-------+-------+
    +-----------------+----------+-------+-------+
    |       Pool      |   type   |  used | avail |
    +-----------------+----------+-------+-------+
    | cephfs_metadata | metadata | 4638  | 26.7G |
    |   cephfs_data   |   data   |    0  | 26.7G |
    +-----------------+----------+-------+-------+
    
    +-------------+
    | Standby MDS |
    +-------------+
    |    node3    |
    +-------------+

Additional Resources

2.6. Configuring the number of standby daemons

Each Ceph File System (CephFS) can specify the required number of standby daemons to be considered healthy. This number also includes the standby-replay daemon waiting for a rank failure.

Prerequisites

  • User access to the Ceph Monitor node.

Procedure

  1. Set the expected number of standby daemons for a particular CephFS:

    Syntax

    ceph fs set FS_NAME standby_count_wanted NUMBER

    Note

    Setting the NUMBER to zero disables the daemon health check.

    Example

    [root@mon]# ceph fs set cephfs standby_count_wanted 2

    This example sets the expected standby daemon count to two.

2.7. Configuring the standby-replay Metadata Server

Configure each Ceph File System (CephFS) by adding a standby-replay Metadata Server (MDS) daemon. Doing this reduces failover time if the active MDS becomes unavailable.

This specific standby-replay daemon follows the active MDS’s metadata journal. The standby-replay daemon is only used by the active MDS of the same rank, and is not available to other ranks.

Important

If using standby-replay, then every active MDS must have a standby-replay daemon.

Prerequisites

  • User access to the Ceph Monitor node.

Procedure

  1. Set the standby-replay for a particular CephFS:

    Syntax

    ceph fs set FS_NAME allow_standby_replay 1

    Example

    [root@mon]# ceph fs set cephfs allow_standby_replay 1

    In this example, the Boolean value is 1, which enables the standby-replay daemons to be assigned to the active Ceph MDS daemons.

    Note

    Setting the allow_standby_replay Boolean value back to 0 only prevents new standby-replay daemons from being assigned. To also stop the running daemons, mark them as failed with the ceph mds fail command.

Additional Resources

2.8. Decreasing the number of active Metadata Server daemons

How to decrease the number of active Ceph File System (CephFS) Metadata Server (MDS) daemons.

Prerequisites

  • The rank that you will remove must be active first, meaning that you must have the same number of MDS daemons as specified by the max_mds parameter.

Procedure

  1. Set the same number of MDS daemons as specified by the max_mds parameter:

    Syntax

    ceph fs status NAME

    Example

    [root@mon ~]# ceph fs status cephfs
    cephfs - 0 clients
    
    +------+--------+-------+---------------+-------+-------+
    | Rank | State  |  MDS  |    Activity   |  dns  |  inos |
    +------+--------+-------+---------------+-------+-------+
    |  0   | active | node1 | Reqs:    0 /s |   10  |   12  |
    |  1   | active | node2 | Reqs:    0 /s |   10  |   12  |
    +------+--------+-------+---------------+-------+-------+
    +-----------------+----------+-------+-------+
    |       Pool      |   type   |  used | avail |
    +-----------------+----------+-------+-------+
    | cephfs_metadata | metadata | 4638  | 26.7G |
    |   cephfs_data   |   data   |    0  | 26.7G |
    +-----------------+----------+-------+-------+
    
    +-------------+
    | Standby MDS |
    +-------------+
    |    node3    |
    +-------------+

  2. On a node with administration capabilities, change the max_mds parameter to the desired number of active MDS daemons:

    Syntax

    ceph fs set NAME max_mds NUMBER

    Example

    [root@mon ~]# ceph fs set cephfs max_mds 1

  3. Wait for the storage cluster to stabilize to the new max_mds value by watching the Ceph File System status.
  4. Verify the number of active MDS daemons:

    Syntax

    ceph fs status NAME

    Example

    [root@mon ~]# ceph fs status cephfs
    cephfs - 0 clients
    
    +------+--------+-------+---------------+-------+-------+
    | Rank | State  |  MDS  |    Activity   |  dns  |  inos |
    +------+--------+-------+---------------+-------+-------+
    |  0   | active | node1 | Reqs:    0 /s |   10  |   12  |
    +------+--------+-------+---------------+-------+-------+
    +-----------------+----------+-------+-------+
    |       Pool      |   type   |  used | avail |
    +-----------------+----------+-------+-------+
    | cephfs_metadata | metadata | 4638  | 26.7G |
    |   cephfs_data   |   data   |    0  | 26.7G |
    +-----------------+----------+-------+-------+
    
    +-------------+
    | Standby MDS |
    +-------------+
    |    node3    |
    |    node2    |
    +-------------+

Additional Resources

2.9. Additional Resources