Ceph: 1 MDSs behind on trimming and MDS stuck in "Client Replay".
Issue
1 MDSs behind on trimming and MDS stuck in Client Replay
.
If the MDS is persistently stuck in Client Replay
, the Ceph FS will not service any requests. The status will be as shown below, but be careful to not act on a transient status of Client Replay
.
Other symptoms will be persistently no Ops In Flight
for the MDS and the output of session ls
will only show completed requests
.
System Status:
$ ceph status
cluster:
id: e8abfxxx-Redacted-Cluster-FSID-yyy68a61ce18
health: HEALTH_WARN
2 MDSs behind on trimming
services:
mon: 3 daemons, quorum a,b,c (age 6w)
mgr: a(active, since 6w)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 8d), 3 in (since 2y)
data:
volumes: 1/1 healthy
pools: 4 pools, 97 pgs
objects: 6.03M objects, 1.3 TiB
usage: 4.0 TiB used, 2.0 TiB / 6.0 TiB avail
pgs: 97 active+clean
io:
client: 22 MiB/s rd, 28 MiB/s wr, 544 op/s rd, 1.08k op/s wr
$ ceph health detail
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-103-131:csi-cephfs-node failing to respond to cache pressure client_id: 31330220
mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-114-50:csi-cephfs-node failing to respond to cache pressure client_id: 34512838
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.ocs-storagecluster-cephfilesystem-b(mds.0): Behind on trimming (2026/256) max_segments: 256, num_segments: 2026
MDS Status:
$ ceph tell mds.ocs-storagecluster-cephfilesystem:0 status
{
"cluster_fsid": "XXX",
"whoami": 0,
"id": 19987341,
"want_state": "up:clientreplay",
"state": "up:clientreplay", ***
"fs_name": "ocs-storagecluster-cephfilesystem",
"clientreplay_status": { ***
"clientreplay_queue": 125048,
"active_replay": 0
},
"rank_uptime": 191060.81145907301,
"mdsmap_epoch": 8735,
"osdmap_epoch": 4421,
"osdmap_epoch_barrier": 3296,
"uptime": 191061.807527136
}
MDS Ops In Flight
, notice: there are none.
{
"ops": [],
"num_ops": 0 <-- No Ops In Flight
}
MDS Session LS
, search for one of the Client ID's failing to respond to cache pressure
This example: 34512838
***
},
{
"id": 34512838, ***
"entity": {
"name": {
"type": "client",
"num": 34512838 ***
},
"addr": {
"type": "v1",
"addr": "10.130.26.1:0",
"nonce": 2742038109
}
},
"state": "open",
"num_leases": 0,
"num_caps": 185261,
"request_load_avg": 0,
"uptime": 238059.837427689,
"requests_in_flight": 0, ***
"num_completed_requests": 173119, ***
"num_completed_flushes": 0,
"reconnecting": false,
"recall_caps": {
"value": 2541549.9746376052,
"halflife": 60
},
"release_caps": {
"value": 0,
"halflife": 60
},
"recall_caps_throttle": {
"value": 55969.648738104304,
"halflife": 1.5
},
"recall_caps_throttle2o": {
"value": 14508.98512486678,
"halflife": 0.5
},
"session_cache_liveness": {
"value": 0,
"halflife": 300
},
"cap_acquisition": {
"value": 0,
"halflife": 10
},
"delegated_inos": [],
"inst": "client.34512838 v1:10.130.26.1:0/2742038109", ***
If these symptoms are a match, proceed to the resolution.
Environment
Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Red Hat Ceph Storage (RHCS) 7.x
Ceph File System (CephFS)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.