Ceph: OSD Slow Requests and OSDs flapping after deleting large RGW objects

Solution Verified - Updated -

Issue

OSD Slow Requests and OSDs flapping after deleting large RGW objects

After deleting a large amount of S3/Swift data in a short window of time, a Ceph Cluster may experience the following:

  • OSD slow op warnings
  • Laggy PGs
  • OSDs flapping
  • radosgw-admin gc list stalls.
  • The OSD logs will have these errors
TIMESTAMP THREAD_NAME /builddir/build/BUILD/ceph-16.2.0/src/cls/queue/cls_queue_src.cc:243: ERROR: No space left in queue
TIMESTAMP THREAD_NAME osd.927 260489 get health metrics reporting 223 slow ops, oldest is osd_op (client.64814084,0173667 5.bt 5:fde4dd55:gc::gc.261head [call version.check_conds in 74b, call rgw_gc.rgw_gc_queue_enqueue_in#653b] anape 011 RETRY-4 ondisk+retry+write+known if redirected e260479)

The issue is reproduced by creating 3 900 GB S3 objects and later deleting them. Some time after (minutes, couple of hours) the deletion, the above symptoms will be observed. Creating 3000 900 MB objects and later deleting them all at once would also trigger the same issue.

Environment

Red Hat Ceph Storage (RHCS) 4
Red Hat Ceph Storage (RHCS) 5
Red Hat Ceph Storage (RHCS) 6

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content