Chapter 4. Bug fixes

This section describes notable bug fixes introduced in Red Hat OpenShift Container Storage 4.7.

MGR pod restarts even if the MONs are down

Previously, when the nodes restarted the MGR pod might get stuck in a pod initialisation state which resulted in the inability to create new persistent volumes (PVs). With this update, the MGR pod restarts even if the MONs are down.

(BZ#2005515)

Multicloud Object Gateway is now available when hugepages are enabled on OpenShift Container Platform

Previously, Multicloud Object Gateway (MCG) db pod crashed as the Postgres failed to run on kubernetes when hugepages were enabled. With the current update, the hugepages for the MCG Postgres pods are disabled, and hence the MCG db pods do not crash.

(BZ#1968438)

PodDisruptionBudget alert no longer continiously shown

Previously, the PodDisruptionBudget alert, which is an OpenShift Container Platform alert, was continuously shown for object storage devices (OSDs). The underlying issue has been fixed, and the alert no longer shows.

(BZ#1788126)

must-gather log collection fail

Earlier, the copy pod did not try to re-flush the data at regular intervals causing the must-gather command to fail after the default 10 minutes time out. With this update, the copy pod keeps trying to collect the data at regular intervals generated by the must-gather command and now the must-gather commands run to completion.

(BZ#1884546)

You cannot create a PVC from a volume snapshot in the absence of volumesnapshotclass

A PVC can not be created from a volume snapshot in the absence of volumesnapshotclass. This issue is caused because the status of the volume snapshot changes to a not ready state on deleting the volumesnapshotclass. This issue has been fixed in OCP 4.7.0 and higher.

(BZ#1902711)

Core dump not propogated if a process crashed

Previously, core dumps were not propagated if a process crashed. With this release, a log-collector - a sidecar running next to the main ceph daemon has been introduced. On this, a ShareProcessNamespace flag is enabled and with this flag signals can be intercepted between containers allowing the coredumps to be generated.

(BZ#1904917)

Mulitple OSD removal job no longer fails

Previously, when triggering the job for multiple OSD removal, the template included a comma with the OSD IDs in the job name. This was causing the job template to fail. With this update, the OSD IDs have been removed from the job name to maintain a valid format. The job names have been changed from ocs-osd-removal-${FAILED_OSD_IDS} to ocs-osd-removal-job.

(BZ#1908678)

Increased mon failover timeout

With this update mon failover timeout has been increased to 15 minutes on IBM Cloud. Previously, the mons would begin to failover while they were still coming up.

(BZ#1922421)

Rook now refuses to deploy OSD with a message on detecting unclean disks from previous OpenShift Container Storage installation

Previously, if a disk that had not been cleaned from a previous installation of OpenShift Container Storage was reused, Rook failed abruptly. With this update, Rook can now detect that the disk belongs to a different cluster and reject OSD deployment in that disk with an error message (BZ#1922954)

mon failover no longer makes Ceph inaccessible

Previously, if a mon went down while another mon was failing over, it caused the mons to lose quorum. When mons lose quorum Ceph becomes inaccessible. This update prevents voluntary mon drains while a mon is failing over so that Ceph never becomes inaccessible.

(BZ#1935065)

cpehcsi node plugin pods preoccupying ports for GRPC metrics

Previously, the cephcsi pods exposed GRPC metrics for debugging purposes, and hence the cephcsi node plugin pods used ports 9090 for RBD and 9091 for CephFS. As a result, the cephsi pods failed to come up due to the unavailability of the ports. With this release, GRPC metrics are disabled by default as it only required for debugging purposes and now cephcsi does not use the ports 9091 and 9090 on the node where node plugin pods are running.

(BZ#1937245)

rook-ceph-mds did not register the pod IP on monitor servers

Earlier, the rook-ceph-mds did not register the pod IP on the monitor servers and hence every mount on the filesystem timed out and PVCs could not be provisioned resulting in CephFS volume provisioning failure. With this release, an argument --public-addr=podIP is added to the MDS pod when the host network is not enabled. Hence, now the CephFS volume provisioning does not fail.

(BZ#1939272)

Errors in must gather due to failed rule evaluation

Earlier, the recording rule record: cluster:ceph_disk_latency:join_ceph_node_disk_irate1m did not get evaluated because many-to-many match is not allowed in Prometheus. As a result, there were errors in the must gather and in the deployment due to this failed rule evaluation. With this release, the query for recording rule has been updated to eliminate the many-to-many match scenarios, and hence now the Prometheus rule evaluations should not fail and there should not be any errors seen in the deployment.

(BZ#1904302)