Troubleshooting OpenShift Container Storage
How to troubleshoot errors and issues in OpenShift Container Storage
storage-docs@redhat.com
Abstract
Chapter 1. Overview
Troubleshooting OpenShift Container Storage is written to help administrators understand how to troubleshoot and fix their Red Hat OpenShift Container Storage cluster.
Most troubleshooting tasks focus on either a fix or a workaround. This document is divided into chapters based on the errors that an administrator may encounter:
- Chapter 2, Downloading log files and diagnostic information using must-gather shows you how to use the must-gather utility in OpenShift Container Storage.
- Chapter 3, Commonly required logs for troubleshooting shows you how to obtain commonly required log files for OpenShift Container Storage.
- Chapter 5, Troubleshooting alerts and errors in OpenShift Container Storage shows you how to identify the encountered error and perform required actions.
Chapter 2. Downloading log files and diagnostic information using must-gather
If Red Hat OpenShift Container Storage is unable to automatically resolve a problem, use the must-gather tool to collect log files and diagnostic information so that you or Red Hat support can review the problem and determine a solution.
Procedure
Run the
must-gather
command from the client connected to the Openshift Container Storage cluster:$ oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.4 --dest-dir=<directory-name> --node-name=<node-name>
where <node-name> is any master node in Ready state.
Note--node-name
is optional and needs to be used specifically for situations when either one or more worker nodes are not in Ready state.This collects the following information in the specified directory:
- Collects all OpenShift Container Storage cluster related Custom Resources (CRs) with their namespaces.
- Collects pod logs of all the OpenShift Container Storage related pods.
- Collects output of some standard Ceph commands like Status, Cluster health, and others.
Chapter 3. Commonly required logs for troubleshooting
Some of the commonly used logs for troubleshooting OpenShift Container Storage are listed, along with the commands to generate them.
Generating logs for a specific pod:
$ oc logs <pod-name> -n <namespace>
Generating logs for Ceph or OpenShift Container Storage cluster:
$ oc logs rook-ceph-operator-<ID> -n openshift-storage
Generating logs for plugin pods like cephfs or rbd to detect any problem in the PVC mount of the app-pod:
$ oc logs csi-cephfsplugin-<ID> -n openshift-storage -c csi-cephfsplugin
$ oc logs csi-rbdplugin-<ID> -n openshift-storage -c csi-rbdplugin
To generate logs for all the containers in the CSI pod:
$ oc logs csi-cephfsplugin-<ID> -n openshift-storage --all-containers
$ oc logs csi-rbdplugin-<ID> -n openshift-storage --all-containers
Generating logs for cephfs or rbd provisioner pods to detect problems if PVC is not in BOUND state:
$ oc logs csi-cephfsplugin-provisioner-<ID> -n openshift-storage -c csi-cephfsplugin
$ oc logs csi-rbdplugin-provisioner-<ID> -n openshift-storage -c csi-rbdplugin
To generate logs for all the containers in the CSI pod:
$ oc logs csi-cephfsplugin-provisioner-<ID> -n openshift-storage --all-containers
$ oc logs csi-rbdplugin-provisioner-<ID> -n openshift-storage --all-containers
Generating OpenShift Container Storage logs using cluster-info command:
$ oc cluster-info dump -n openshift-storage --output-directory=<directory-name>
Additional resources
Chapter 4. Overriding the cluster-wide default node selector for OpenShift Container Storage post deployment
When a cluster-wide default node selector is used for Openshift Container Storage, the pods generated by CSI daemonsets are able to start only on the nodes that match the selector. To be able to use Openshift Container Storage from nodes which do not match the selector, override the cluster-wide default node selector
by performing the following steps in the command line interface :
Procedure
Specify a blank node selector for the openshift-storage namespace.
$ oc annotate namespace openshift-storage openshift.io/node-selector=
Delete the original pods generated by the DaemonSets.
oc delete pod -l app=csi-cephfsplugin -n openshift-storage oc delete pod -l app=csi-rbdplugin -n openshift-storage
Chapter 5. Troubleshooting alerts and errors in OpenShift Container Storage
5.1. Resolving alerts and errors
Red Hat OpenShift Container Storage can detect and automatically resolve a number of common failure scenarios. However, some problems require administrator intervention.
To know the errors currently firing, check one of the following locations:
- Monitoring → Alerting → Firing option
- Home → Overview → Cluster tab
- Home → Overview → Persistent Storage tab
- Home → Overview → Object Service tab
Copy the error displayed and search it in the following section to know its severity and resolution:
Name:
Message:
Description: Severity: Warning Resolution: Fix Procedure: Inspect the user interface and log, and verify if an update is in progress.
|
Name:
Message:
Description: Severity: Warning Resolution: Fix Procedure: Inspect the user interface and log, and verify if an update is in progress.
|
Name:
Message:
Description: Severity: Crtical Resolution: Fix Procedure: Remove unnecessary data or expand the cluster. |
Name:
Fixed:
Description: Severity: Warning Resolution: Fix Procedure: Remove unnecessary data or expand the cluster. |
Name:
Message:
Description: Severity: Warning Resolution: Workaround Procedure: Resolving NooBaa Bucket Error State |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Workaround Procedure: Resolving NooBaa Bucket Error State |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message: Description: `Minimum required replicas for storage metadata service not available. Might affect the working of storage cluster.` Severity: Warning Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
5.2. Resolving NooBaa Bucket Error State
Procedure
- Log in to OpenShift Web Console and click Object Service.
- In the Details card, click the link under System Name field.
- In the left pane, click Buckets option and search for the bucket in error state.
- Click on it’s Bucket Name. Error encountered in bucket is displayed.
Depending on the specific error of the bucket, perform one or both of the following:
For space related errors:
- In the left pane, click Resources option.
- Click on the resource in error state.
- Scale the resource by adding more agents.
For resource health errors:
- In the left pane, click Resources option.
- Click on the resource in error state.
- Connectivity error means the backing service is not available and needs to be restored.
- For access/permissions errors, update the connection’s Access Key and Secret Key.
5.3. Resolving NooBaa Bucket Exceeding Quota State
To resolve A NooBaa Bucket Is In Exceeding Quota State error perform one of the following:
- Cleanup some of the data on the bucket.
Increase the bucket quota by performing the following steps:
- Log in to OpenShift Web Console and click Object Service.
- In the Details card, click the link under System Name field.
- In the left pane, click Buckets option and search for the bucket in error state.
- Click on it’s Bucket Name. Error encountered in bucket is displayed.
- Click Bucket Policies → Edit Quota and increase the quota.
5.4. Resolving NooBaa Bucket Capacity or Quota State
Procedure
- Log in to OpenShift Web Console and click Object Service.
- In the Details card, click the link under System Name field.
- In the left pane, click Resources option and search for the PV pool resource.
- For the PV pool resource with low capacity status, click on it’s Resource Name.
- Edit the pool configuration and increase the number of agents.
5.5. Recovering pods
When a first node (say NODE1
) goes to NotReady state because of some issue, the hosted pods that are using PVC with ReadWriteOnce (RWO) access mode try to move to the second node (say NODE2
) but get stuck due to multi-attach error. In such a case, you can recover MON, OSD, and application pods by using the following steps.
Procedure
-
Power off
NODE1
(from AWS or vSphere side) and ensure thatNODE1
is completely down. Force delete the pods on
NODE1
by using the following command:$ oc delete pod <pod-name> --grace-period=0 --force
5.6. Recovering from EBS volume detach
When an OSD or MON elastic block storage (EBS) volume where the OSD disk resides is detached from the worker Amazon EC2 instance, the volume gets reattached automatically within one or two minutes. However, the OSD pod gets into a CrashLoopBackOff
state. To recover and bring back the pod to Running
state, you must restart the EC2 instance.
Chapter 6. Checking for Local Storage Operator deployments
OpenShift Container Storage clusters with Local Storage Operator are deployed using local storage devices. To find out if your existing cluster with OpenShift Container Storage was deployed using local storage devices, use the following procedure:
Prerequisites
-
OpenShift Container Storage is installed and running in the
openshift-storage
namespace.
Procedure
By checking the storage class associated with your OpenShift Container Storage cluster’s persistent volume claims (PVCs), you can tell if your cluster was deployed using local storage devices.
Check the storage class associated with OpenShift Container Storage cluster’s PVCs with the following command:
$ oc get pvc -n openshift-storage
Check the output. For clusters with Local Storage Operator, the PVCs associated with
ocs-deviceset
use the storage classlocalblock
. The output looks similar to the following:NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-0 Bound pvc-d96c747b-2ab5-47e2-b07e-1079623748d8 50Gi RWO ocs-storagecluster-ceph-rbd 114s ocs-deviceset-0-0-lzfrd Bound local-pv-7e70c77c 1769Gi RWO localblock 2m10s ocs-deviceset-1-0-7rggl Bound local-pv-b19b3d48 1769Gi RWO localblock 2m10s ocs-deviceset-2-0-znhk8 Bound local-pv-e9f22cdc 1769Gi RWO localblock 2m10s
Additional Resources