第 6 章 对 OpenShift Data Foundation 中的警报和错误进行故障排除

6.1. 解决警报和错误

Red Hat OpenShift Data Foundation 可以检测并自动解决许多常见的故障情形。但是,有些问题需要管理员介入。

要了解当前触发的错误,请查看以下位置之一:

  • ObserveAlertingFiring 选项
  • HomeOverviewCluster 标签页
  • StorageData FoundationStorage Systemstorage system 链接,在弹出的 → OverviewBlock and File 标签页
  • StorageData FoundationStorage System → Storage system 链接,在弹出 → OverviewObject 标签页

复制显示的错误并在以下部分搜索它以了解其严重性和解决方案:

Name:CephMonVersionMismatch

Message:There are multiple versions of storage services running.

Description:There are {{ $value }} different versions of Ceph Mon components running.

严重性 :警告

解决方案 :修复

流程 :检查用户界面并记录,并验证更新是否正在进行。

  • 如果更新正在进行,则此警报是临时的。
  • 如果更新没有进行,重启升级过程。

Name:CephOSDVersionMismatch

Message:There are multiple versions of storage services running.

Description:There are {{ $value }} different versions of Ceph OSD components running.

严重性 :警告

解决方案 :修复

流程 :检查用户界面并记录,并验证更新是否正在进行。

  • 如果更新正在进行,则此警报是临时的。
  • 如果更新没有进行,重启升级过程。

Name:CephClusterCriticallyFull

Message:Storage cluster is critically full and needs immediate expansion

Description:Storage cluster utilization has crossed 85%.

严重性 :Crtical

解决方案 :修复

流程 :删除不必要的数据或扩展集群。

Name:CephClusterNearFull

修复存储集群的空间接近满。需要进行扩展。

Description:Storage cluster utilization has crossed 75%.

严重性 :警告

解决方案 :修复

流程 :删除不必要的数据或扩展集群。

Name:NooBaaBucketErrorState

Message:A NooBaa Bucket Is In Error State

Description:A NooBaa bucket {{ $labels.bucket_name }} is in error state for more than 6m

严重性 :警告

解决方案 :临时解决方案

流程解决 NooBaa Bucket 错误状态

Name:NooBaaNamespaceResourceErrorState

Message:A NooBaa Namespace Resource Is In Error State

Description:A NooBaa namespace resource {{ $labels.namespace_resource_name }} is in error state for more than 5m

严重性 :警告

解决方案 :修复

流程解决 NooBaa Bucket 错误状态

Name:NooBaaNamespaceBucketErrorState

Message:A NooBaa Namespace Bucket Is In Error State

Description:A NooBaa namespace bucket {{ $labels.bucket_name }} is in error state for more than 5m

严重性 :警告

解决方案 :修复

流程解决 NooBaa Bucket 错误状态

Name:NooBaaBucketExceedingQuotaState

Message:A NooBaa Bucket Is In Exceeding Quota State

Description:A NooBaa bucket {{ $labels.bucket_name }} is exceeding its quota - {{ printf "%0.0f" $value }}% used message:A NooBaa Bucket Is In Exceeding Quota State

严重性 :警告

解决方案 :修复

流程解决 NooBaa Bucket Exceeding Quota State 问题

Name:NooBaaBucketLowCapacityState

Message:A NooBaa Bucket Is In Low Capacity State

Description:A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its capacity

严重性 :警告

解决方案 :修复

流程解决 NooBaa Bucket Capacity 或 Quota State 问题

Name:NooBaaBucketNoCapacityState

Message:A NooBaa Bucket Is In No Capacity State

Description:A NooBaa bucket {{ $labels.bucket_name }} is using all of its capacity

严重性 :警告

解决方案 :修复

流程解决 NooBaa Bucket Capacity 或 Quota State 问题

Name:NooBaaBucketReachingQuotaState

Message:A NooBaa Bucket Is In Reaching Quota State

Description:A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its quota

严重性 :警告

解决方案 :修复

流程解决 NooBaa Bucket Capacity 或 Quota State 问题

Name:NooBaaResourceErrorState

Message:A NooBaa Resource Is In Error State

Description:A NooBaa resource {{ $labels.resource_name }} is in error state for more than 6m

严重性 :警告

解决方案 :临时解决方案

流程解决 NooBaa Bucket 错误状态

Name:NooBaaSystemCapacityWarning100

Message:A NooBaa System Approached Its Capacity

Description:A NooBaa system approached its capacity, usage is at 100%

严重性 :警告

解决方案 :修复

流程解决 NooBaa Bucket Capacity 或 Quota State 问题

Name:NooBaaSystemCapacityWarning85

Message:A NooBaa System Is Approaching Its Capacity

Description:A NooBaa system is approaching its capacity, usage is more than 85%

严重性 :警告

解决方案 :修复

流程解决 NooBaa Bucket Capacity 或 Quota State 问题

Name:NooBaaSystemCapacityWarning95

Message:A NooBaa System Is Approaching Its Capacity

Description:A NooBaa system is approaching its capacity, usage is more than 95%

严重性 :警告

解决方案 :修复

流程解决 NooBaa Bucket Capacity 或 Quota State 问题

Name:CephMdsMissingReplicas

Message:Insufficient replicas for storage metadata service.

Description: `Minimum required replicas for storage metadata service not available.

可能会影响存储集群的工作。

严重性 :警告

解决方案联系红帽支持

流程

  1. 检查警报和操作器状态。
  2. 如果无法识别该问题,请联系红帽支持团队

Name:CephMgrIsAbsent

Message:存储指标收集器服务不再可用。

Description:Ceph Manager has disappeared from Prometheus target discovery.

严重性 :Critical

解决方案联系红帽支持

流程

  1. 检查用户界面并记录,并验证更新是否正在进行。

    • 如果更新正在进行,则此警报是临时的。
    • 如果更新没有进行,重启升级过程。
  2. 升级完成后,检查警报和 Operator 状态。
  3. 如果问题持久或无法识别,请联系红帽支持

Name:CephNodeDown

Message:Storage node {{ $labels.node }} went down

Description:Storage node {{ $labels.node }} went down.请立即检查节点。

严重性 :Critical

解决方案联系红帽支持

流程

  1. 检查哪个节点停止正常运行,并检查其原因。
  2. 采取适当的操作来恢复节点。如果无法恢复节点:

Name:CephClusterErrorState

Message:Storage cluster is in error state

Description:Storage cluster is in error state for more than 10m.

严重性 :Critical

解决方案联系红帽支持

流程

  1. 检查警报和操作器状态。
  2. 如果无法识别该问题,请使用 must-gather 下载日志文件和诊断信息
  3. 红帽支持创建一个支持问题单,并附加 must-gather 的输出。

Name:CephClusterWarningState

Message:Storage cluster is in degraded state

Description:Storage cluster is in warning state for more than 10m.

严重性 :警告

解决方案联系红帽支持

流程

  1. 检查警报和操作器状态。
  2. 如果无法识别该问题,请使用 must-gather 下载日志文件和诊断信息
  3. 红帽支持创建一个支持问题单,并附加 must-gather 的输出。

Name:CephDataRecoveryTakingTooLong

Message:Data recovery is slow

Description:Data recovery has been active for too long.

严重性 :警告

解决方案联系红帽支持

Name:CephOSDDiskNotResponding

Message:Disk not responding

Description:Disk device {{ $labels.device }} not responding, on host {{ $labels.host }}.

严重性 :Critical

解决方案联系红帽支持

Name:CephOSDDiskUnavailable

Message:Disk not accessible

Description:Disk device {{ $labels.device }} not accessible on host {{ $labels.host }}.

严重性 :Critical

解决方案联系红帽支持

Name:CephPGRepairTakingTooLong

Message:Self heal problems detected

Description:Self heal operations taking too long.

严重性 :警告

解决方案联系红帽支持

Name:CephMonHighNumberOfLeaderChanges

Message:Storage Cluster has seen many leader changes recently.

Description:'Ceph Monitor "{{ $labels.job }}": instance {{ $labels.instance }} has seen {{ $value printf "%.2f" }} leader changes per minute recently.'

严重性 :警告

解决方案联系红帽支持

Name:CephMonQuorumAtRisk

Message:Storage quorum at risk

Description:Storage cluster quorum is low.

严重性 :Critical

解决方案联系红帽支持

Name:ClusterObjectStoreState

Message:Cluster Object Store is in unhealthy state.Please check Ceph cluster health.

Description:Cluster Object Store is in unhealthy state for more than 15s.Please check Ceph cluster health.

严重性 :Critical

解决方案联系红帽支持

流程

Name:CephOSDFlapping

Message:Storage daemon osd.x has restarted 5 times in the last 5 minutes.Please check the pod events or Ceph status to find out the cause.

Description:Storage OSD restarts more than 5 times in 5 minutes.

严重性 :Critical

解决方案联系红帽支持

Name:OdfPoolMirroringImageHealth

Message:Mirroring image(s) (PV) in the pool <pool-name> are in Warning state for more than a 1m.Mirroring might not work as expected.

Description:Disaster recovery is failing for one or a few applications.

严重性 :警告

解决方案联系红帽支持

Name:OdfMirrorDaemonStatus

Message:Mirror daemon is unhealthy.

Description:Disaster recovery is failing for the entire cluster.Mirror daemon is in unhealthy status for more than 1m.Mirroring on this cluster is not working as expected.

严重性 :Critical

解决方案联系红帽支持