Chapter 7. Known issues
This section describes known issues in Red Hat OpenShift Container Storage 4.7.
RGW metrics are no longer available if an active mgr changes in the RHCS cluster
When an active MGR goes down in an external cluster mode, OpenShift Container Platform (OCP) stops collecting any further metrics from the Red Hat Ceph Storage (RHCS) cluster, even when the MGR comes back on. This means RADOS Object Gateway (RGW) metrics are no longer collected once the connection to the present active MGR is lost.
For Red Hat OpenShift Container Storage 4.7, the workaround is as follows:
Once the external RHCS gets back an active MGR, run the python script
ceph-external-cluster-details-exporter.py once again and collect the JSON output file. At the OCP side, update the external secret named:
rook-ceph-external-cluster-details with the output of the previously collected JSON file. This triggers a reconciliation and OCP starts picking up the metrics again.
OSD keys in Vault are not deleted during OpenShift Container Storage cluster uninstallation
Currently, Key Encryption Keys for OSDs are soft-deleted from Vault during Openshift Container Storage cluster deletion when Vault Key/Value (K/V) Secret engine API, version 2 is used for cluster-wide encryption with KMS. This means the key metadata is still visible, and any version of the key can be retrieved.
Workaround: Manually delete the metadata for the key using
vault kv metadata delete command.
MDS report oversized cache
Rook has not previously applied
mds_cache_memory_limit upon upgrades. This means OpenShift Container Storage 4.2 clusters that did not have that option applied were not updated with the correct value, which is typically half the size of the pod’s memory limit. Therefore, MDSs in standby-replay may report oversized cache.
Storage cluster phase is Ready when both flexibleScaling and arbiter are enabled
There are incorrect specifications of the storage cluster CR when arbiter and flexible scaling are enabled. This means the user sees the storage cluster in
READY state even though there are logs or messages with the error
arbiter and flexibleScaling both can not be enabled. This does not affect functionality.
Arbiter nodes can not be labelled with the OpenShift Container Storage node label
Arbiter nodes are considered as valid non-arbiter nodes if they are labelled with the OpenShift Container Storage node label,
cluster.ocs.openshift.io/openshift-storage. This means the placement for the non-arbiter resources becomes undetermined. To work around this issue, do not label the arbiter nodes with the OpenShift Container Storage node label so that only arbiter resources are placed on the arbiter nodes.
noobaa-db-pg-0 pod does not migrate to other nodes when the hosting node goes down. NooBaa will not work when a node is down as migration of
noobaa-db-pg-0 pod is blocked.
Clone operations with greater size than parent PVC results in endless loop
Ceph CSI does not support restoring a snapshot or creating clones with a size greater than the parent PVC. Therefore,
Clone operations with a greater size results in an endless loop. To workaround this issue, delete the pending PVC. In order to get a larger PVC, complete one of the following based on the operation you are using:
- If using Snapshots, restore the existing snapshot to create a volume of the same size as the parent PVC, then attach it to a pod and expand the PVC to the required size. For more information, see Volume snapshots.
- If using Clone, clone the parent PVC to create a volume of the same size as the parent PVC, then attach it to a pod and expand the PVC to the required size. For more information, see Volume cloning.
Ceph status is
HEALTH_WARN after disk replacement
After disk replacement, a warning
1 daemons have recently crashed is seen even if all OSD pods are up and running. This warning causes a change in Ceph’s status. The Ceph status should be
HEALTH_OK instead of
HEALTH_WARN. To workaround this issue,
rsh to the
ceph-tools pod and silence the warning, the Ceph health will then be back to
Device replacement action cannot be performed through the user interface for an encrypted OpenShift Container Storage cluster
On an encrypted OpenShift Container Storage cluster, the discovery result CR discovers the device backed by a Ceph OSD (Object Storage Daemon) differently from the one reported in the Ceph alerts. When clicking the alert, the user is presented with
Disk not found message. Due to the mismatch, console UI cannot enable the disk replacement option for an OpenShift Container Storage user. To workaround this issue, use the CLI procedure for failed device replacement in the Replacing Devices guide.
Newly restored PVC can not be mounted
Newly restored PVC can not be mounted, if some of the OCP nodes are running on a Red Hat Enterprise Linux version of less than 8.2 and the snapshot from which it was restored is deleted. To avoid this issue, do not delete the snapshot from which the PVC is restored until the restored PVC is deleted.
The status of the disk is
replacement ready before
start replacement is clicked
The user interface can not differentiate between a new disk failure on a different or same node and the previously failed disk if both the disks have the same name. Due to this same name issue, disk replacement is not allowed as the user interface considers that this newly failed disk is already replaced. To work around this issue, follow the below steps:
- On OpenShift Container Platform Web Console → click Administrator.
- Click Home → Search.
In resources dropdown → search for
TemplateInstanceand make sure to choose openshift-storage namespace.
- Delete all template instances.