Chapter 1. Initial Troubleshooting

As a storage administrator, you can do the initial troubleshooting of a Red Hat Ceph Storage cluster before contacting Red Hat support. This chapter includes the following information:

Prerequisites

  • A running Red Hat Ceph Storage cluster.

1.1. Identifying problems

To determine possible causes of the error with the Red Hat Ceph Storage cluster, answer the questions in the Procedure section.

Prerequisites

  • A running Red Hat Ceph Storage cluster.

Procedure

  1. Certain problems can arise when using unsupported configurations. Ensure that your configuration is supported.
  2. Do you know what Ceph component causes the problem?

    1. No. Follow Diagnosing the health of a Ceph storage cluster procedure in the Red Hat Ceph Storage Troubleshooting Guide.
    2. Ceph Monitors. See Troubleshooting Ceph Monitors section in the Red Hat Ceph Storage Troubleshooting Guide.
    3. Ceph OSDs. See Troubleshooting Ceph OSDs section in the Red Hat Ceph Storage Troubleshooting Guide.
    4. Ceph placement groups. See Troubleshooting Ceph placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
    5. Multi-site Ceph Object Gateway. See Troubleshooting a multi-site Ceph Object Gateway section in the Red Hat Ceph Storage Troubleshooting Guide.

Additional Resources

1.2. Diagnosing the health of a storage cluster

This procedure lists basic steps to diagnose the health of a Red Hat Ceph Storage cluster.

Prerequisites

  • A running Red Hat Ceph Storage cluster.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Check the overall status of the storage cluster:

    Example

    [ceph: root@host01 /]# ceph health detail

    If the command returns HEALTH_WARN or HEALTH_ERR see Understanding Ceph health for details.

  3. Monitor the logs of the storage cluster:

    Example

    [ceph: root@host01 /]# ceph -W cephadm

  4. To capture the logs of the cluster to a file, run the following commands:

    Example

    [ceph: root@host01 /]# ceph config set global log_to_file true
    [ceph: root@host01 /]# ceph config set global mon_cluster_log_to_file true

    The logs are located by default in the /var/log/ceph/CLUSTER_FSID/ directory. Check the Ceph logs for any error messages listed in Understanding Ceph logs.

  5. If the logs do not include a sufficient amount of information, increase the debugging level and try to reproduce the action that failed. See Configuring logging for details.

1.3. Understanding Ceph health

The ceph health command returns information about the status of the Red Hat Ceph Storage cluster:

  • HEALTH_OK indicates that the cluster is healthy.
  • HEALTH_WARN indicates a warning. In some cases, the Ceph status returns to HEALTH_OK automatically. For example when Red Hat Ceph Storage cluster finishes the rebalancing process. However, consider further troubleshooting if a cluster is in the HEALTH_WARN state for longer time.
  • HEALTH_ERR indicates a more serious problem that requires your immediate attention.

Use the ceph health detail and ceph -s commands to get a more detailed output.

Note

A health warning is displayed if there is no mgr daemon running. In case the last mgr daemon of a Red Hat Ceph Storage cluster was removed, you can manually deploy a mgr daemon, on a random host of the Red Hat Storage cluster. See the Manually deploying a mgr daemon in the Red Hat Ceph Storage 6 Administration Guide.

Additional Resources

1.4. Muting health alerts of a Ceph cluster

In certain scenarios, users might want to temporarily mute some warnings, because they are already aware of the warning and cannot act on it right away. You can mute health checks so that they do not affect the overall reported status of the Ceph cluster.

Alerts are specified using the health check codes. One example is, when an OSD is brought down for maintenance, OSD_DOWN warnings are expected. You can choose to mute the warning until the maintenance is over because those warnings put the cluster in HEALTH_WARN instead of HEALTH_OK for the entire duration of maintenance.

Most health mutes also disappear if the extent of an alert gets worse. For example, if there is one OSD down, and the alert is muted, the mute disappears if one or more additional OSDs go down. This is true for any health alert that involves a count indicating how much or how many of something is triggering the warning or error.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level of access to the nodes.
  • A health warning message.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Check the health of the Red Hat Ceph Storage cluster by running the ceph health detail command:

    Example

    [ceph: root@host01 /]# ceph health detail
    
    HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
    [WRN] OSD_DOWN: 1 osds down
        osd.1 (root=default,host=host01) is down
    [WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
        osd.1 has flags noup

    You can see that the storage cluster is in HEALTH_WARN status as one of the OSDs is down.

  3. Mute the alert:

    Syntax

    ceph health mute HEALTH_MESSAGE

    Example

    [ceph: root@host01 /]# ceph health mute OSD_DOWN

  4. Optional: A health check mute can have a time to live (TTL) associated with it, such that the mute automatically expires after the specified period of time has elapsed. Specify the TTL as an optional duration argument in the command:

    Syntax

    ceph health mute HEALTH_MESSAGE DURATION

    DURATION can be specified in s, sec, m, min, h, or hour.

    Example

    [ceph: root@host01 /]# ceph health mute OSD_DOWN 10m

    In this example, the alert OSD_DOWN is muted for 10 minutes.

  5. Verify if the Red Hat Ceph Storage cluster status has changed to HEALTH_OK:

    Example

    [ceph: root@host01 /]# ceph -s
      cluster:
        id:     81a4597a-b711-11eb-8cb8-001a4a000740
        health: HEALTH_OK
                (muted: OSD_DOWN(9m) OSD_FLAGS(9m))
    
      services:
        mon: 3 daemons, quorum host01,host02,host03 (age 33h)
        mgr: host01.pzhfuh(active, since 33h), standbys: host02.wsnngf, host03.xwzphg
        osd: 11 osds: 10 up (since 4m), 11 in (since 5d)
    
      data:
        pools:   1 pools, 1 pgs
        objects: 13 objects, 0 B
        usage:   85 MiB used, 165 GiB / 165 GiB avail
        pgs:     1 active+clean

    In this example, you can see that the alert OSD_DOWN and OSD_FLAG is muted and the mute is active for nine minutes.

  6. Optional: You can retain the mute even after the alert is cleared by making it sticky.

    Syntax

    ceph health mute HEALTH_MESSAGE DURATION --sticky

    Example

    [ceph: root@host01 /]# ceph health mute OSD_DOWN 1h --sticky

  7. You can remove the mute by running the following command:

    Syntax

    ceph health unmute HEALTH_MESSAGE

    Example

    [ceph: root@host01 /]# ceph health unmute OSD_DOWN

Additional Resources

1.5. Understanding Ceph logs

Ceph stores its logs in the /var/log/ceph/CLUSTER_FSID/ directory after the logging to files is enabled.

The CLUSTER_NAME.log is the main storage cluster log file that includes global events. By default, the log file name is ceph.log. Only the Ceph Monitor nodes include the main storage cluster log.

Each Ceph OSD and Monitor has its own log file, named CLUSTER_NAME-osd.NUMBER.log and CLUSTER_NAME-mon.HOSTNAME.log.

When you increase debugging level for Ceph subsystems, Ceph generates new log files for those subsystems as well.

Additional Resources

1.6. Generating an sos report

You can run the sos report command to collect the configuration details, system information, and diagnostic information of a Red Hat Ceph Storage cluster from a Red Hat Enterprise Linux. Red Hat Support team uses this information for further troubleshooting of the storage cluster.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the nodes.

Procedure

  1. Install the sos package:

    Example

    [root@host01 ~]# dnf install sos

  2. Run the sos report to get the system information of the storage cluster:

    Example

    [root@host01 ~]# sosreport -a --all-logs

    The report is saved in the /var/tmp file.

    Run the following command for specific Ceph daemon information:

    Example

    [root@host01 ~]# sos report --all-logs -e ceph_mgr,ceph_common,ceph_mon,ceph_osd,ceph_ansible,ceph_mds,ceph_rgw

Additional Resources