Ceph: Cluster in HEALTH_WARN with 1 MDSs report slow requests, object in deadlock between unlink and rename.

Solution Verified - Updated -

Issue

Cluster in HEALTH_WARN with 1 MDSs report slow requests, object in deadlock between unlink and rename.

Example:

[root@edon-0 ~]# ceph -s
  cluster:
    id:     b613dxxx-redacted-cluster-ID-xxx9a4423a75
    health: HEALTH_WARN
            1 MDSs report slow requests
            1 MDSs behind on trimming      <--- This will only appear if the issue exists unresolved for many hours

  services:
    mon: 3 daemons, quorum edon-8,edon-4,edon-7 (age 13d)
    mgr: edon-3.xxxyyy(active, since 13d), standbys: edon-2.xxxyyy
    mds: 5/5 daemons up, 3 standby
    osd: 91 osds: 91 up (since 13d), 91 in (since 9M)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 2306 pgs
    objects: 33.58M objects, 39 TiB
    usage:   159 TiB used, 839 TiB / 998 TiB avail
    pgs:     2306 active+clean

  io:
    client:   3.7 MiB/s rd, 2.5 MiB/s wr, 8 op/s rd, 257 op/s wr

[root@edon-0 ~]# ceph health detail
HEALTH_WARN 1 MDSs report slow requests
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.root.edon-3.wnboxv(mds.1): 3 slow requests are blocked > 30 secs

[user@edon-0 ~]# ceph fs status
root - 88 clients
====
RANK  STATE             MDS                ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  root.edon-1.amvgfe  Reqs:    8 /s  76.1k  75.0k  1501   38.9k  
 1    active  root.edon-3.wnboxv  Reqs:  121 /s  5321k  5309k  12.1k   494k  
 2    active  root.edon-7.oqqvka  Reqs:    0 /s  5370   3575    289   3710   
 3    active  root.edon-9.qvfdrs  Reqs:    1 /s  8184k  8178k  84.2k   434k  
 4    active  root.edon-5.lbrqru  Reqs:  556 /s   319k   307k  23.2k  45.9k  
    POOL       TYPE     USED  AVAIL  
cephfs.meta  metadata  46.2G  12.8T  
cephfs.data    data     124T   246T  
      STANDBY MDS        
root.edon-2.rdnzhn  
root.edon-6.kfqjor  
root.edon-4.pjrzki  
MDS version: ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable)

[root@edon-0 ~]# ceph tell mds.1 dump_blocked_ops | sort   (If the output is extremely long pipe it to head -20)
            "initiated_at": "2023-08-07T11:43:12.239105+0000",
            "initiated_at": "2023-08-07T11:43:12.239425+0000",
            "initiated_at": "2023-08-07T11:43:12.239488+0000",

[root@edon-0 ~]# ceph tell mds.1 dump_blocked_ops | grep description
            "description": "client_request(client.61253774:24406803 unlink #0x100013e7b0d/file17 2023-08-07T11:43:12.237640+0000 caller_uid=842788, caller_gid=667140{})",
            "description": "client_request(mds.1:36824 rename #0x100013e7b0d/file17 #0x60c/20009ac9901 caller_uid=0, caller_gid=0{})",
            "description": "client_request(mds.1:36825 rename #0x100013e7b0d/file17 #0x60c/20009ac9901 caller_uid=0, caller_gid=0{})",

Please note the same objects (file17) for all 3 requests and they were all initiated at the same time.

Environment

Red Hat Ceph Storage (RHCS) 5.3.4
Red Hat Ceph Storage (RHCS) 5.3.5
Red Hat Ceph Storage (RHCS) 6.0.0
Red Hat Ceph Storage (RHCS) 6.1.0
Red Hat Ceph Storage (RHCS) 6.1.1

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content