Ceph: MDS has the cluster in a "HEALTH_WARN" with "1 clients failing to respond to capability release".
Issue
- MDS has the cluster in a
HEALTH_WARN
with1 clients failing to respond to capability release
- Operations like ls -latrh on a directory do not finish, but in other directories do work fine.
- Applications are hanging when accessing a directory or file.
- You see client.123456 isn't responding to mclientcaps(revoke), ino 0x00faed9f92a6 pending pLsFc issued pAsLsXsFscr log entries.
- You see client_request(client.123456:789012 getattr AsLsXsFs #0x00faed9f92a6 [...]) currently failed to rdlock, waiting log entries.
Example:
-bash 5.1 $ ceph -s
cluster:
id: 7153xxxx-Redacted-Cluster-ID-yyyy49cf8043
health: HEALTH_WARN
1 clients failing to respond to capability release
services:
mon: 5 daemons, quorum edon-1,edon-2,edon-3,edon-4,edon-5 (age 23h)
mgr: edon-5(active, since 11d), standbys: edon-1, edon-4, edon-2, edon-3
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 768 osds: 768 up (since 2d), 768 in (since 2d); 268 remapped pgs
rgw: 6 daemons active (6 hosts, 2 zones)
data:
volumes: 1/1 healthy
pools: 45 pools, 25841 pgs
objects: 4.59G objects, 507 TiB
usage: 1.8 PiB used, 2.0 PiB / 3.8 PiB avail
pgs: 25836 active+clean
4 active+clean+scrubbing+deep
1 active+clean+scrubbing
io:
client: 221 MiB/s rd, 88 MiB/s wr, 7.41k op/s rd, 5.36k op/s wr
-bash 5.1 $ ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
mds.edon-4(mds.0): Client client-46.MyBigCo.org failing to respond to capability release client_id: 146372720
For this KCS to be a match to one's issue, also examine the Active MDS logs
to see if errors like these are observed.
{Timestamp} 0 log_channel(cluster) log [WRN] : slow request 61.089276 seconds old, received at 2023-10-30T13:51:50.613596-0300: client_request(client.151050512:13077400 getattr AsLsXsFs #0x100090b7c94 2023-10-30T13:51:50.610669-0300 caller_uid=1001, caller_gid=1001{}) currently failed to rdlock, waiting
{Timestamp} 0 log_channel(cluster) log [WRN] : client.151135072 isn't responding to mclientcaps(revoke), ino 0x100090b7c94 pending pAsLsXsFs issued pAsLsXsFsx, sent 61.621197 seconds ago
It may be useful to also look at the client logs, (var/log/messages
and/or dmesg -T
)
Environment
Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Ceph File System (CephFS)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.