Ceph: MGR process hung - pybind/cephfs: holds GIL during rmdir
Issue
MGR process hung - pybind/cephfs: holds GIL during rmdir
MGR process missing from ceph status
output.
When a Ceph FS client issues a large recursive delete, the active MGR falls into an extremely long MUTEX (becomes wedged) and disappears from the output of ceph status
.
Expected:
$ ceph -s
cluster:
id: 8d232xxx-Redacted-Cluster-ID-yyyba00794f2
health: OK
services:
mon: 3 daemons, quorum edon3,edon2,edon1 (age 2d)
mgr: edon3(active, since 63m), standbys: edon2, edon1
mds: 2/2 daemons up, 1 standby
osd: 324 osds: 324 up (since 23h), 324 in (since 23h)
data:
volumes: 1/1 healthy
pools: 13 pools, 10625 pgs
objects: 106.42M objects, 41 TiB
usage: 131 TiB used, 485 TiB / 616 TiB avail
pgs: 10625 active+clean
In the output above, we see that edon3
is the active MGR and the other 2 MGRs are in standby, which is expected for this system.
In the output below, we see that edon2
is the active MGR and only edon1
is a standby MGR.
The MGR in node edon3
somehow went unresponsive, (became wedged).
This caused the MGR in edon2
to take over as the active MGR.
Because the MGR edon3
is unresponsive, (wedged),it also disappeared from ceph status
output.
$ ceph -s
cluster:
id: 8d232xxx-Redacted-Cluster-ID-yyyba00794f2
health: OK
services:
mon: 3 daemons, quorum edon3,edon2,edon1 (age 2d)
mgr: edon2(active, since 63m), standbys: edon1 <- Where is edon3?
mds: 2/2 daemons up, 1 standby
osd: 324 osds: 324 up (since 23h), 324 in (since 23h)
data:
volumes: 1/1 healthy
pools: 13 pools, 10625 pgs
objects: 106.42M objects, 41 TiB
usage: 131 TiB used, 485 TiB / 616 TiB avail
pgs: 10625 active+clean
io:
client: 16 MiB/s rd, 94 MiB/s wr, 865 op/s rd, 1.87k op/s wr
A secondary symptom: With one MGR unresponsive, the Ceph will erroneously report a MON Clock Skew
condition.
Environment
Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.