Ceph: 1 MDSs behind on trimming and MDS stuck in "Client Replay".

Issue

1 MDSs behind on trimming and MDS stuck in Client Replay.

If the MDS is persistently stuck in Client Replay, the Ceph FS will not service any requests. The status will be as shown below, but be careful to not act on a transient status of Client Replay.

Other symptoms will be persistently no Ops In Flight for the MDS and the output of session ls will only show completed requests.

System Status:

$ ceph status
  cluster:
    id:     e8abfxxx-Redacted-Cluster-FSID-yyy68a61ce18
    health: HEALTH_WARN
            2 MDSs behind on trimming

  services:
    mon: 3 daemons, quorum a,b,c (age 6w)
    mgr: a(active, since 6w)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 8d), 3 in (since 2y)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 6.03M objects, 1.3 TiB
    usage:   4.0 TiB used, 2.0 TiB / 6.0 TiB avail
    pgs:     97 active+clean

  io:
    client:   22 MiB/s rd, 28 MiB/s wr, 544 op/s rd, 1.08k op/s wr

$ ceph health detail
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-103-131:csi-cephfs-node failing to respond to cache pressure client_id: 31330220
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-114-50:csi-cephfs-node failing to respond to cache pressure client_id: 34512838
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Behind on trimming (2026/256) max_segments: 256, num_segments: 2026

MDS Status:

$ ceph tell mds.ocs-storagecluster-cephfilesystem:0 status
{
    "cluster_fsid": "XXX",
    "whoami": 0,
    "id": 19987341,
    "want_state": "up:clientreplay",
    "state": "up:clientreplay",    ***
    "fs_name": "ocs-storagecluster-cephfilesystem",
    "clientreplay_status": {    ***
        "clientreplay_queue": 125048,
        "active_replay": 0
    },
    "rank_uptime": 191060.81145907301,
    "mdsmap_epoch": 8735,
    "osdmap_epoch": 4421,
    "osdmap_epoch_barrier": 3296,
    "uptime": 191061.807527136
}

MDS Ops In Flight, notice: there are none.

{
    "ops": [],
    "num_ops": 0  <-- No Ops In Flight
}

MDS Session LS, search for one of the Client ID's failing to respond to cache pressure This example: 34512838 ***

},
    {
        "id": 34512838,    *** 
        "entity": {
            "name": {
                "type": "client",
                "num": 34512838    ***
            },
            "addr": {
                "type": "v1",
                "addr": "10.130.26.1:0",
                "nonce": 2742038109
            }
        },
        "state": "open",
        "num_leases": 0,
        "num_caps": 185261,
        "request_load_avg": 0,
        "uptime": 238059.837427689,
        "requests_in_flight": 0,           ***
        "num_completed_requests": 173119,  ***
        "num_completed_flushes": 0,
        "reconnecting": false,
        "recall_caps": {
            "value": 2541549.9746376052,
            "halflife": 60
        },
        "release_caps": {
            "value": 0,
            "halflife": 60
        },
        "recall_caps_throttle": {
            "value": 55969.648738104304,
            "halflife": 1.5
        },
        "recall_caps_throttle2o": {
            "value": 14508.98512486678,
            "halflife": 0.5
        },
        "session_cache_liveness": {
            "value": 0,
            "halflife": 300
        },
        "cap_acquisition": {
            "value": 0,
            "halflife": 10
        },
        "delegated_inos": [],
        "inst": "client.34512838 v1:10.130.26.1:0/2742038109",    ***

If these symptoms are a match, proceed to the resolution.

Environment

Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Red Hat Ceph Storage (RHCS) 7.x
Ceph File System (CephFS)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Select Your Language

Ceph: 1 MDSs behind on trimming and MDS stuck in "Client Replay".

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links