Ceph: MDS has the cluster in a "HEALTH_WARN" with "1 clients failing to respond to capability release".

Solution Verified - Updated -

Issue

  • MDS has the cluster in a HEALTH_WARN with 1 clients failing to respond to capability release
  • Operations like ls -latrh on a directory do not finish, but in other directories do work fine.
  • Applications are hanging when accessing a directory or file.
  • You see client.123456 isn't responding to mclientcaps(revoke), ino 0x00faed9f92a6 pending pLsFc issued pAsLsXsFscr log entries.
  • You see client_request(client.123456:789012 getattr AsLsXsFs #0x00faed9f92a6 [...]) currently failed to rdlock, waiting log entries.

Example:

-bash 5.1 $ ceph -s
  cluster:
    id:     7153xxxx-Redacted-Cluster-ID-yyyy49cf8043
    health: HEALTH_WARN
            1 clients failing to respond to capability release

  services:
    mon: 5 daemons, quorum edon-1,edon-2,edon-3,edon-4,edon-5 (age 23h)
    mgr: edon-5(active, since 11d), standbys: edon-1, edon-4, edon-2, edon-3
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 768 osds: 768 up (since 2d), 768 in (since 2d); 268 remapped pgs
    rgw: 6 daemons active (6 hosts, 2 zones)

  data:
    volumes: 1/1 healthy
    pools:   45 pools, 25841 pgs
    objects: 4.59G objects, 507 TiB
    usage:   1.8 PiB used, 2.0 PiB / 3.8 PiB avail
    pgs:     25836 active+clean
             4     active+clean+scrubbing+deep
             1     active+clean+scrubbing

  io:
    client:   221 MiB/s rd, 88 MiB/s wr, 7.41k op/s rd, 5.36k op/s wr


-bash 5.1 $ ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
    mds.edon-4(mds.0): Client client-46.MyBigCo.org failing to respond to capability release client_id: 146372720

For this KCS to be a match to one's issue, also examine the Active MDS logs to see if errors like these are observed.

{Timestamp}  0 log_channel(cluster) log [WRN] : slow request 61.089276 seconds old, received at 2023-10-30T13:51:50.613596-0300: client_request(client.151050512:13077400 getattr AsLsXsFs #0x100090b7c94 2023-10-30T13:51:50.610669-0300 caller_uid=1001, caller_gid=1001{}) currently failed to rdlock, waiting
{Timestamp}  0 log_channel(cluster) log [WRN] : client.151135072 isn't responding to mclientcaps(revoke), ino 0x100090b7c94 pending pAsLsXsFs issued pAsLsXsFsx, sent 61.621197 seconds ago

It may be useful to also look at the client logs, (var/log/messages and/or dmesg -T)

Environment

Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Ceph File System (CephFS)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content