After upgrade from RHCS 3 to RHCS 4, cannot add pools, or add/remove OSDs - PGs stuck peering / unknown

Solution Verified - Updated -

Issue

After an upgrade from Red Hat Ceph Storage (RHCS) 3.x (based on upstream Luminous) to RHCS 4.x (based on upstream Nautilus), typical cluster maintenance activities do not go as planned.

Attempts to:

  • add/remove OSDs ( or OSD nodes )
  • add new Pools

Result in:

  • PGs stuck in peering, activating, or unknown states that never clear
  • Cluster 'recovery' traffic shown in ceph -s is very low compared to the expected throughput.
  • Excessive Ceph process memory consumption and potential 'Out of Memory' kill events
  • Ceph OSD Log entries showing BADAUTHORIZER
  • Ceph MGR Log entries showing 'could not get service secret for service'
  • Ceph Cluster Logs and ceph -s show high volume of 'slow request' entries of type delayed, queued for pg, or started

Environment

Red Hat Ceph Storage cluster, after being upgraded from RHCS 3.x to RHCS 4.x.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content