Ceph: OSD Slow Requests and OSDs flapping after deleting large RGW objects
Issue
OSD Slow Requests and OSDs flapping after deleting large RGW objects
After deleting a large amount of S3/Swift data in a short window of time, a Ceph Cluster may experience the following:
- OSD slow op warnings
- Laggy PGs
- OSDs flapping
radosgw-admin gc liststalls.- The OSD logs will have these errors
TIMESTAMP THREAD_NAME /builddir/build/BUILD/ceph-16.2.0/src/cls/queue/cls_queue_src.cc:243: ERROR: No space left in queue
TIMESTAMP THREAD_NAME osd.927 260489 get health metrics reporting 223 slow ops, oldest is osd_op (client.64814084,0173667 5.bt 5:fde4dd55:gc::gc.261head [call version.check_conds in 74b, call rgw_gc.rgw_gc_queue_enqueue_in#653b] anape 011 RETRY-4 ondisk+retry+write+known if redirected e260479)
The issue is reproduced by creating 3 900 GB S3 objects and later deleting them. Some time after (minutes, couple of hours) the deletion, the above symptoms will be observed. Creating 3000 900 MB objects and later deleting them all at once would also trigger the same issue.
Environment
Red Hat Ceph Storage (RHCS) 4
Red Hat Ceph Storage (RHCS) 5
Red Hat Ceph Storage (RHCS) 6
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.