Menu Close
Settings Close

Language and Page Formatting Options

Chapter 6. Troubleshooting alerts and errors in OpenShift Data Foundation

6.1. Resolving alerts and errors

Red Hat OpenShift Data Foundation can detect and automatically resolve a number of common failure scenarios. However, some problems require administrator intervention.

To know the errors currently firing, check one of the following locations:

  • ObserveAlertingFiring option
  • HomeOverviewCluster tab
  • StorageOpenshift Data FoundationStorage Systemstorage system link in the pop up → OverviewBlock and File tab
  • StorageOpenshift Data FoundationStorage Systemstorage system link in the pop up → OverviewObject tab

Copy the error displayed and search it in the following section to know its severity and resolution:

Name: CephMonVersionMismatch

Message: There are multiple versions of storage services running.

Description: There are {{ $value }} different versions of Ceph Mon components running.

Severity: Warning

Resolution: Fix

Procedure: Inspect the user interface and log, and verify if an update is in progress.

  • If an update in progress, this alert is temporary.
  • If an update is not in progress, restart the upgrade process.

Name: CephOSDVersionMismatch

Message: There are multiple versions of storage services running.

Description: There are {{ $value }} different versions of Ceph OSD components running.

Severity: Warning

Resolution: Fix

Procedure: Inspect the user interface and log, and verify if an update is in progress.

  • If an update in progress, this alert is temporary.
  • If an update is not in progress, restart the upgrade process.

Name: CephClusterCriticallyFull

Message: Storage cluster is critically full and needs immediate expansion

Description: Storage cluster utilization has crossed 85%.

Severity: Crtical

Resolution: Fix

Procedure: Remove unnecessary data or expand the cluster.

Name: CephClusterNearFull

Fixed: Storage cluster is nearing full. Expansion is required.

Description: Storage cluster utilization has crossed 75%.

Severity: Warning

Resolution: Fix

Procedure: Remove unnecessary data or expand the cluster.

Name: NooBaaBucketErrorState

Message: A NooBaa Bucket Is In Error State

Description: A NooBaa bucket {{ $labels.bucket_name }} is in error state for more than 6m

Severity: Warning

Resolution: Workaround

Procedure: Resolving NooBaa Bucket Error State

Name: NooBaaNamespaceResourceErrorState

Message: A NooBaa Namespace Resource Is In Error State

Description: A NooBaa namespace resource {{ $labels.namespace_resource_name }} is in error state for more than 5m

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Error State

Name: NooBaaNamespaceBucketErrorState

Message: A NooBaa Namespace Bucket Is In Error State

Description: A NooBaa namespace bucket {{ $labels.bucket_name }} is in error state for more than 5m

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Error State

Name: NooBaaBucketExceedingQuotaState

Message: A NooBaa Bucket Is In Exceeding Quota State

Description: A NooBaa bucket {{ $labels.bucket_name }} is exceeding its quota - {{ printf "%0.0f" $value }}% used message: A NooBaa Bucket Is In Exceeding Quota State

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Exceeding Quota State

Name: NooBaaBucketLowCapacityState

Message: A NooBaa Bucket Is In Low Capacity State

Description: A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its capacity

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaBucketNoCapacityState

Message: A NooBaa Bucket Is In No Capacity State

Description: A NooBaa bucket {{ $labels.bucket_name }} is using all of its capacity

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaBucketReachingQuotaState

Message: A NooBaa Bucket Is In Reaching Quota State

Description: A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its quota

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaResourceErrorState

Message: A NooBaa Resource Is In Error State

Description: A NooBaa resource {{ $labels.resource_name }} is in error state for more than 6m

Severity: Warning

Resolution: Workaround

Procedure: Resolving NooBaa Bucket Error State

Name: NooBaaSystemCapacityWarning100

Message: A NooBaa System Approached Its Capacity

Description: A NooBaa system approached its capacity, usage is at 100%

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaSystemCapacityWarning85

Message: A NooBaa System Is Approaching Its Capacity

Description: A NooBaa system is approaching its capacity, usage is more than 85%

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaSystemCapacityWarning95

Message: A NooBaa System Is Approaching Its Capacity

Description: A NooBaa system is approaching its capacity, usage is more than 95%

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: CephMdsMissingReplicas

Message: Insufficient replicas for storage metadata service.

Description: `Minimum required replicas for storage metadata service not available.

Might affect the working of storage cluster.`

Severity: Warning

Resolution: Contact Red Hat support

Procedure:

  1. Check for alerts and operator status.
  2. If the issue cannot be identified, contact Red Hat support.

Name: CephMgrIsAbsent

Message: Storage metrics collector service not available anymore.

Description: Ceph Manager has disappeared from Prometheus target discovery.

Severity: Critical

Resolution: Contact Red Hat support

Procedure:

  1. Inspect the user interface and log, and verify if an update is in progress.

    • If an update in progress, this alert is temporary.
    • If an update is not in progress, restart the upgrade process.
  2. Once the upgrade is complete, check for alerts and operator status.
  3. If the issue persistents or cannot be identified, contact Red Hat support.

Name: CephNodeDown

Message: Storage node {{ $labels.node }} went down

Description: Storage node {{ $labels.node }} went down. Please check the node immediately.

Severity: Critical

Resolution: Contact Red Hat support

Procedure:

  1. Check which node stopped functioning and its cause.
  2. Take appropriate actions to recover the node. If node cannot be recovered:

Name: CephClusterErrorState

Message: Storage cluster is in error state

Description: Storage cluster is in error state for more than 10m.

Severity: Critical

Resolution: Contact Red Hat support

Procedure:

  1. Check for alerts and operator status.
  2. If the issue cannot be identified, download log files and diagnostic information using must-gather.
  3. Open a Support Ticket with Red Hat Support with an attachment of the output of must-gather.

Name: CephClusterWarningState

Message: Storage cluster is in degraded state

Description: Storage cluster is in warning state for more than 10m.

Severity: Warning

Resolution: Contact Red Hat support

Procedure:

  1. Check for alerts and operator status.
  2. If the issue cannot be identified, download log files and diagnostic information using must-gather.
  3. Open a Support Ticket with Red Hat Support with an attachment of the output of must-gather.

Name: CephDataRecoveryTakingTooLong

Message: Data recovery is slow

Description: Data recovery has been active for too long.

Severity: Warning

Resolution: Contact Red Hat support

Name: CephOSDDiskNotResponding

Message: Disk not responding

Description: Disk device {{ $labels.device }} not responding, on host {{ $labels.host }}.

Severity: Critical

Resolution: Contact Red Hat support

Name: CephOSDDiskUnavailable

Message: Disk not accessible

Description: Disk device {{ $labels.device }} not accessible on host {{ $labels.host }}.

Severity: Critical

Resolution: Contact Red Hat support

Name: CephPGRepairTakingTooLong

Message: Self heal problems detected

Description: Self heal operations taking too long.

Severity: Warning

Resolution: Contact Red Hat support

Name: CephMonHighNumberOfLeaderChanges

Message: Storage Cluster has seen many leader changes recently.

Description: 'Ceph Monitor "{{ $labels.job }}": instance {{ $labels.instance }} has seen {{ $value printf "%.2f" }} leader changes per minute recently.'

Severity: Warning

Resolution: Contact Red Hat support

Name: CephMonQuorumAtRisk

Message: Storage quorum at risk

Description: Storage cluster quorum is low.

Severity: Critical

Resolution: Contact Red Hat support

Name: ClusterObjectStoreState

Message: Cluster Object Store is in unhealthy state. Please check Ceph cluster health.

Description: Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health.

Severity: Critical

Resolution: Contact Red Hat support

Procedure:

Name: CephOSDFlapping

Message: Storage daemon osd.x has restarted 5 times in the last 5 minutes. Please check the pod events or Ceph status to find out the cause.

Description: Storage OSD restarts more than 5 times in 5 minutes.

Severity: Critical

Resolution: Contact Red Hat support

Name: OdfPoolMirroringImageHealth

Message: Mirroring image(s) (PV) in the pool <pool-name> are in Warning state for more than a 1m. Mirroring might not work as expected.

Description: Disaster recovery is failing for one or a few applications.

Severity: Warning

Resolution: Contact Red Hat support

Name: OdfMirrorDaemonStatus

Message: Mirror daemon is unhealthy.

Description: Disaster recovery is failing for the entire cluster. Mirror daemon is in unhealthy status for more than 1m. Mirroring on this cluster is not working as expected.

Severity: Critical

Resolution: Contact Red Hat support

6.2. Resolving cluster health issues

There is a finite set of possible health messages that a Red Hat Ceph Storage cluster can raise that show in the OpenShift Data Foundation user interface. These are defined as health checks which have unique identifiers. The identifier is a terse pseudo-human-readable string that is intended to enable tools to make sense of health checks, and present them in a way that reflects their meaning. Click the health code below for more information and troubleshooting.

Health codeDescription

MON_DISK_LOW

One or more Ceph Monitors are low on disk space.

6.2.1. MON_DISK_LOW

This alert triggers if the available space on the file system storing the monitor database as a percentage, drops below mon_data_avail_warn (default: 15%). This may indicate that some other process or user on the system is filling up the same file system used by the monitor. It may also indicate that the monitor’s database is large.

Note

The paths to the file system differ depending on the deployment of your mons. You can find the path to where the mon is deployed in storagecluster.yaml.

Example paths:

  • Mon deployed over PVC path: /var/lib/ceph/mon
  • Mon deployed over hostpath: /var/lib/rook/mon

In order to clear up space, view the high usage files in the file system and choose which to delete. To view the files, run:

# du -a <path-in-the-mon-node> |sort -n -r |head -n10

Replace <path-in-the-mon-node> with the path to the file system where mons are deployed.

6.3. Resolving NooBaa Bucket Error State

Procedure

  1. In the OpenShift Web Console, click StorageOpenShift Data Foundation.
  2. In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
  3. Click the Object tab.
  4. In the Details card, click the link under System Name field.
  5. In the left pane, click Buckets option and search for the bucket in error state. If the bucket in error state is a namespace bucket, be sure to click the Namespace Buckets pane.
  6. Click on it’s Bucket Name. Error encountered in bucket is displayed.
  7. Depending on the specific error of the bucket, perform one or both of the following:

    1. For space related errors:

      1. In the left pane, click Resources option.
      2. Click on the resource in error state.
      3. Scale the resource by adding more agents.
    2. For resource health errors:

      1. In the left pane, click Resources option.
      2. Click on the resource in error state.
      3. Connectivity error means the backing service is not available and needs to be restored.
      4. For access/permissions errors, update the connection’s Access Key and Secret Key.

6.4. Resolving NooBaa Bucket Exceeding Quota State

To resolve A NooBaa Bucket Is In Exceeding Quota State error perform one of the following:

  • Cleanup some of the data on the bucket.
  • Increase the bucket quota by performing the following steps:

    1. In the OpenShift Web Console, click StorageOpenShift Data Foundation.
    2. In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
    3. Click the Object tab.
    4. In the Details card, click the link under System Name field.
    5. In the left pane, click Buckets option and search for the bucket in error state.
    6. Click on its Bucket Name. Error encountered in bucket is displayed.
    7. Click Bucket PoliciesEdit Quota and increase the quota.

6.5. Resolving NooBaa Bucket Capacity or Quota State

Procedure

  1. In the OpenShift Web Console, click StorageOpenShift Data Foundation.
  2. In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
  3. Click the Object tab.
  4. In the Details card, click the link under System Name field.
  5. In the left pane, click the Resources option and search for the PV pool resource.
  6. For the PV pool resource with low capacity status, click on it’s Resource Name.
  7. Edit the pool configuration and increase the number of agents.

6.6. Recovering pods

When a first node (say NODE1) goes to NotReady state because of some issue, the hosted pods that are using PVC with ReadWriteOnce (RWO) access mode try to move to the second node (say NODE2) but get stuck due to multi-attach error. In such a case, you can recover MON, OSD, and application pods by using the following steps.

Procedure

  1. Power off NODE1 (from AWS or vSphere side) and ensure that NODE1 is completely down.
  2. Force delete the pods on NODE1 by using the following command:

    $ oc delete pod <pod-name> --grace-period=0 --force

6.7. Recovering from EBS volume detach

When an OSD or MON elastic block storage (EBS) volume where the OSD disk resides is detached from the worker Amazon EC2 instance, the volume gets reattached automatically within one or two minutes. However, the OSD pod gets into a CrashLoopBackOff state. To recover and bring back the pod to Running state, you must restart the EC2 instance.

6.8. Enabling and disabling debug logs for rook-ceph-operator

Enable the debug logs for the rook-ceph-operator to obtain information about failures that help in troubleshooting issues.

Procedure

Enabling the debug logs
  1. Edit the configmap of the rook-ceph-operator.

    $ oc edit configmap rook-ceph-operator-config
  2. Add the ROOK_LOG_LEVEL: DEBUG parameter in the rook-ceph-operator-config yaml file to enable the debug logs for rook-ceph-operator.

    …
    data:
      # The logging level for the operator: INFO | DEBUG
      ROOK_LOG_LEVEL: DEBUG

    Now, the rook-ceph-operator logs consist of the debug information.

Disabling the debug logs
  1. Edit the configmap of the rook-ceph-operator.

    $ oc edit configmap rook-ceph-operator-config
  2. Add the ROOK_LOG_LEVEL: INFO parameter in the rook-ceph-operator-config yaml file to disable the debug logs for rook-ceph-operator.

    …
    data:
      # The logging level for the operator: INFO | DEBUG
      ROOK_LOG_LEVEL: INFO