After upgrade from RHCS 3 to RHCS 4, cannot add pools, or add/remove OSDs - PGs stuck peering / unknown
Issue
After an upgrade from Red Hat Ceph Storage (RHCS) 3.x (based on upstream Luminous) to RHCS 4.x (based on upstream Nautilus), typical cluster maintenance activities do not go as planned.
Attempts to:
- add/remove OSDs ( or OSD nodes )
- add new Pools
Result in:
- PGs stuck in
peering
,activating
, orunknown
states that never clear - Cluster 'recovery' traffic shown in
ceph -s
is very low compared to the expected throughput. - Excessive Ceph process memory consumption and potential 'Out of Memory' kill events
- Ceph OSD Log entries showing
BADAUTHORIZER
- Ceph MGR Log entries showing 'could not get service secret for service'
- Ceph Cluster Logs and
ceph -s
show high volume of 'slow request' entries of typedelayed
,queued for pg
, orstarted
Environment
Red Hat Ceph Storage cluster, after being upgraded from RHCS 3.x to RHCS 4.x.
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.