Ceph: 1 MDSs behind on trimming and MDS stuck in "Client Replay".

Solution Verified - Updated -

Issue

1 MDSs behind on trimming and MDS stuck in Client Replay.

If the MDS is persistently stuck in Client Replay, the Ceph FS will not service any requests. The status will be as shown below, but be careful to not act on a transient status of Client Replay.

Other symptoms will be persistently no Ops In Flight for the MDS and the output of session ls will only show completed requests.

System Status:

$ ceph status
  cluster:
    id:     e8abfxxx-Redacted-Cluster-FSID-yyy68a61ce18
    health: HEALTH_WARN
            2 MDSs behind on trimming

  services:
    mon: 3 daemons, quorum a,b,c (age 6w)
    mgr: a(active, since 6w)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 8d), 3 in (since 2y)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 6.03M objects, 1.3 TiB
    usage:   4.0 TiB used, 2.0 TiB / 6.0 TiB avail
    pgs:     97 active+clean

  io:
    client:   22 MiB/s rd, 28 MiB/s wr, 544 op/s rd, 1.08k op/s wr

$ ceph health detail
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-103-131:csi-cephfs-node failing to respond to cache pressure client_id: 31330220
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-114-50:csi-cephfs-node failing to respond to cache pressure client_id: 34512838
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Behind on trimming (2026/256) max_segments: 256, num_segments: 2026

MDS Status:

$ ceph tell mds.ocs-storagecluster-cephfilesystem:0 status
{
    "cluster_fsid": "XXX",
    "whoami": 0,
    "id": 19987341,
    "want_state": "up:clientreplay",
    "state": "up:clientreplay",    ***
    "fs_name": "ocs-storagecluster-cephfilesystem",
    "clientreplay_status": {    ***
        "clientreplay_queue": 125048,
        "active_replay": 0
    },
    "rank_uptime": 191060.81145907301,
    "mdsmap_epoch": 8735,
    "osdmap_epoch": 4421,
    "osdmap_epoch_barrier": 3296,
    "uptime": 191061.807527136
}

MDS Ops In Flight, notice: there are none.

{
    "ops": [],
    "num_ops": 0  <-- No Ops In Flight
}

MDS Session LS, search for one of the Client ID's failing to respond to cache pressure This example: 34512838 ***

},
    {
        "id": 34512838,    *** 
        "entity": {
            "name": {
                "type": "client",
                "num": 34512838    ***
            },
            "addr": {
                "type": "v1",
                "addr": "10.130.26.1:0",
                "nonce": 2742038109
            }
        },
        "state": "open",
        "num_leases": 0,
        "num_caps": 185261,
        "request_load_avg": 0,
        "uptime": 238059.837427689,
        "requests_in_flight": 0,           ***
        "num_completed_requests": 173119,  ***
        "num_completed_flushes": 0,
        "reconnecting": false,
        "recall_caps": {
            "value": 2541549.9746376052,
            "halflife": 60
        },
        "release_caps": {
            "value": 0,
            "halflife": 60
        },
        "recall_caps_throttle": {
            "value": 55969.648738104304,
            "halflife": 1.5
        },
        "recall_caps_throttle2o": {
            "value": 14508.98512486678,
            "halflife": 0.5
        },
        "session_cache_liveness": {
            "value": 0,
            "halflife": 300
        },
        "cap_acquisition": {
            "value": 0,
            "halflife": 10
        },
        "delegated_inos": [],
        "inst": "client.34512838 v1:10.130.26.1:0/2742038109",    ***

If these symptoms are a match, proceed to the resolution.

Environment

Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Red Hat Ceph Storage (RHCS) 7.x
Ceph File System (CephFS)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content