Ocs-operator pod fails to reconcile the cluster due to StorageClassName set to null

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP) 4
  • Red Hat OpenShift Container Storage (OCS) 4.3+

Issue

  • noobaa-endpoint keeps starting and ending at CrashLoopBackOff state after upgrading the OCS cluster.
  • OCS cluster is upgraded to the next version, but the Ceph cluster is not upgraded.
  • ocs-operator fails to reconcile the cluster due to StorageClassName set to null in StorageClassDeviceSet.

Resolution

  • Take a backup of StorageCluster:
$ oc get StorageCluster ocs-storagecluster -oyaml > StorageCluster.yaml
  • Modify the StorageCluster to explicitly specify the desired StorageClass for creating OSD PVCs.
$ oc edit StorageCluster ocs-storagecluster
         <...>
         spec:
           storageDeviceSets:
           - config: {}
             count: 1
             dataPVCTemplate:
               metadata:
                 creationTimestamp: null
               spec:
                 accessModes:
                 - ReadWriteOnce
                 resources:
                   requests:
                     storage: 2Ti
                 storageClassName: null                   <----------------- Change here
                 volumeMode: Block
               status: {}
             name: ocs-deviceset
             placement: {}
             portable: true
             replica: 3
             resources: {}
           version: 4.3.0
         <...>

Root Cause

  • The OCS v4.2 did not have any check and it allowed the StorageCluster creation to go through. When upgrading to OCS v4.3, the check was introduced in the ocs-operator, and it refused to reconcile.

  • OCS management-console was inappropriately setting an empty string when selecting a StorageClass at the time of deploying OCS cluster A check was introduced in the bug fix to not allow an empty string as the StorageClassName for the StorageClassDeviceSet.

  • The issue has been identified as a bug in RHOCP v4.3 and was being tracked by the Red Hat Engineering team under BZ-1812448.

  • The bug has been fixed in the RHOCP v4.4 and later backported to RHOCP v4.3 as per Errata RHBA-2020:1437. If this issue still occurs after updating, open a support case in the Red Hat Customer Portal referring to this solution.

Diagnostic Steps

  • Check the ocs-operator logs and see the following error:
$ oc logs ocs-operator-<pod-suffix> | grep 'no StorageClass specified'
2020-06-11T16:16:53.986834199Z{"level":"error","ts":"2020-06-11T16:16:53.986Z","logger":"controller_storagecluster","msg":"Failed to validate StorageDeviceSets","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","error":"failed to validate StorageDeviceSet 0: no StorageClass specified", ...}

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments