OCS operator in CrashLoopBackOff state after deployment

Solution Verified - Updated -

Environment

  • Red Hat Openshift Data Foundation 4.x

Issue

  • ocs operator pod in CrashLoopBackOff state after new ODF deployment
  • ocs-operator csv in Installing state

Resolution

  • Enable the CSISnapshot Capability in the cluster
  • Review yaml of clusterversion after adding CSISnapshot to additionalEnabledCapabilities.
oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2023-10-22T21:22:59Z"
    generation: 6
    name: version
    resourceVersion: "179277544"
    uid: 2xxxx3-exx5-4xxf-axx6-exxxxxxxx5
  spec:
    capabilities:
      additionalEnabledCapabilities:
      - CSISnapshot
      baselineCapabilitySet: None
...
...

  status:
    availableUpdates: null
    capabilities:
      enabledCapabilities:
      - CSISnapshot
      - Console
      - marketplace
      - openshift-samples
      knownCapabilities:
      - CSISnapshot
..
      - marketplace
      - openshift-samples
  • Review the status of ocs-operator pod post enabling CSISnapshot Capability
$ oc get po | grep ocs-operator
ocs-operator-5f7ffb7765-7r4l7                                     1/1     Running             3088 (3d22h ago)   19d

Root Cause

  • The api 'VolumeSnapshotClass` was not accessible in the cluster and by the ocs-operator.
  • volumesnapshot CRD was missing from the cluster since csisnapshot was disabled at install.
  • This caused the operator going to CLBO state.

Diagnostic Steps

  • Check the state of the ocs-operator pod
$ oc get pods -n openshift-storage | grep operator
noobaa-operator-d484cdd-574lf                                     1/1     Running             0              17h
ocs-operator-6cd7cc845c-2kvz8                                     0/1     ContainerCreating   0              3s
odf-operator-controller-manager-7656c9d4fb-tmjrh                  2/2     Running             0              12d
rook-ceph-operator-6f8c69f9bf-5vnrl                         
  • Review ocs-operator pod logs for the error "VolumeSnapshotClass.snapshot.storage.k8s.io","error":"no matches for kind \"VolumeSnapshotClass
{"level":"error","ts":"2023-12-12T21:50:52Z","logger":"controllers.StorageCluster","msg":"Failed to 'Get' SnapshotClass.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","SnapshotClass":{"name":"ocs-storagecluster-cephfsplugin-snapclass"},"error":"no matches for kind \"VolumeSnapshotClass\" in version \"snapshot.storage.k8s.io/v1\"","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).createSnapshotClasses\n\t/remote-source/app/controllers/storagecluster/volumesnapshotterclasses.go:116\ngithub.com/red-hat-storage/ocs-
...
(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"}

{"level":"error","ts":"2023-12-12T21:50:54Z","logger":"controller-runtime.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":"VolumeSnapshotClass.snapshot.storage.k8s.io","error":"no matches for kind \"VolumeSnapshotClass\" in version \"snapshot.storage.k8s.io/v1\""
..
source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:547\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:136"}
  • Review yaml of ocs-operator pod to check the state
oc get ocs-operator -n openshift-storage -o yaml
..

Command:
      ocs-operator
    Args:
      --enable-leader-election
      --health-probe-bind-address=:8081
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
...
Events:
  Type     Reason      Age                     From     Message
  ----     ------      ----                    ----     -------
  Warning  BackOff     8m (x13983 over 2d23h)  kubelet  Back-off restarting failed container ocs-operator in pod ocs-operator-6cd7cc845c-2kvz8_openshift-storage(4d98fe56-59e3-4f66-8840-f0a1d5984260)
  Warning  ProbeError  53s (x398 over 2d23h)   kubelet  Readiness probe error: Get "http://10.129.0.115:8081/readyz": dial tcp 10.129.0.115:8081: connect: connection refused
body:
  • Review yaml of clusterversion to check which capabilities are enabled
oc get clusterversion -o yaml

---snip---
  status:
    availableUpdates: null
    capabilities:
      enabledCapabilities:
      - Console
      - marketplace
      - openshift-samples
      knownCapabilities:
      - CSISnapshot

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments