Ceph: MGR process hung - pybind/cephfs: holds GIL during rmdir

Solution Verified - Updated -

Issue

MGR process hung - pybind/cephfs: holds GIL during rmdir
MGR process missing from ceph status output.

When a Ceph FS client issues a large recursive delete, the active MGR falls into an extremely long MUTEX (becomes wedged) and disappears from the output of ceph status.

Expected:

$ ceph -s
  cluster:
    id:     8d232xxx-Redacted-Cluster-ID-yyyba00794f2
    health: OK

  services:
    mon: 3 daemons, quorum edon3,edon2,edon1 (age 2d)
    mgr: edon3(active, since 63m), standbys: edon2, edon1
    mds: 2/2 daemons up, 1 standby
    osd: 324 osds: 324 up (since 23h), 324 in (since 23h)

  data:
    volumes: 1/1 healthy
    pools:   13 pools, 10625 pgs
    objects: 106.42M objects, 41 TiB
    usage:   131 TiB used, 485 TiB / 616 TiB avail
    pgs:     10625 active+clean

In the output above, we see that edon3 is the active MGR and the other 2 MGRs are in standby, which is expected for this system.
In the output below, we see that edon2 is the active MGR and only edon1 is a standby MGR.
The MGR in node edon3 somehow went unresponsive, (became wedged).
This caused the MGR in edon2 to take over as the active MGR.
Because the MGR edon3 is unresponsive, (wedged),it also disappeared from ceph status output.

$ ceph -s
  cluster:
    id:     8d232xxx-Redacted-Cluster-ID-yyyba00794f2
    health: OK

  services:
    mon: 3 daemons, quorum edon3,edon2,edon1 (age 2d)
    mgr: edon2(active, since 63m), standbys: edon1      <- Where is edon3?
    mds: 2/2 daemons up, 1 standby
    osd: 324 osds: 324 up (since 23h), 324 in (since 23h)

  data:
    volumes: 1/1 healthy
    pools:   13 pools, 10625 pgs
    objects: 106.42M objects, 41 TiB
    usage:   131 TiB used, 485 TiB / 616 TiB avail
    pgs:     10625 active+clean

  io:
    client:   16 MiB/s rd, 94 MiB/s wr, 865 op/s rd, 1.87k op/s wr

A secondary symptom: With one MGR unresponsive, the Ceph will erroneously report a MON Clock Skew condition.

Environment

Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content