Chapter 8. Troubleshooting Ceph placement groups

This section contains information about fixing the most common errors related to the Ceph Placement Groups (PGs).

8.1. Prerequisites

  • Verify your network connection.
  • Ensure that Monitors are able to form a quorum.
  • Ensure that all healthy OSDs are up and in, and the backfilling and recovery processes are finished.

8.2. Most common Ceph placement groups errors

The following table lists the most common error messages that are returned by the ceph health detail command. The table provides links to corresponding sections that explain the errors and point to specific procedures to fix the problems.

In addition, you can list placement groups that are stuck in a state that is not optimal. See Section 8.3, “Listing placement groups stuck in stale, inactive, or unclean state” for details.

8.2.1. Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A running Ceph Object Gateway.

8.2.2. Placement group error messages

A table of common placement group error messages, and a potential fix.

Error messageSee

HEALTH_ERR

pgs down

Placement groups are down

pgs inconsistent

Inconsistent placement groups

scrub errors

Inconsistent placement groups

HEALTH_WARN

pgs stale

Stale placement groups

unfound

Unfound objects

8.2.3. Stale placement groups

The ceph health command lists some Placement Groups (PGs) as stale:

HEALTH_WARN 24 pgs stale; 3/300 in osds are down

What This Means

The Monitor marks a placement group as stale when it does not receive any status update from the primary OSD of the placement group’s acting set or when other OSDs reported that the primary OSD is down.

Usually, PGs enter the stale state after you start the storage cluster and until the peering process completes. However, when the PGs remain stale for longer than expected, it might indicate that the primary OSD for those PGs is down or not reporting PG statistics to the Monitor. When the primary OSD storing stale PGs is back up, Ceph starts to recover the PGs.

The mon_osd_report_timeout setting determines how often OSDs report PGs statistics to Monitors. By default, this parameter is set to 0.5, which means that OSDs report the statistics every half a second.

To Troubleshoot This Problem

  1. Identify which PGs are stale and on what OSDs they are stored. The error message includes information similar to the following example:

    Example

    [ceph: root@host01 /]# ceph health detail
    HEALTH_WARN 24 pgs stale; 3/300 in osds are down
    ...
    pg 2.5 is stuck stale+active+remapped, last acting [2,0]
    ...
    osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
    osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
    osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861

  2. Troubleshoot any problems with the OSDs that are marked as down. For details, see Down OSDs.

Additional Resources

8.2.4. Inconsistent placement groups

Some placement groups are marked as active + clean + inconsistent and the ceph health detail returns an error message similar to the following one:

HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors

What This Means

When Ceph detects inconsistencies in one or more replicas of an object in a placement group, it marks the placement group as inconsistent. The most common inconsistencies are:

  • Objects have an incorrect size.
  • Objects are missing from one replica after a recovery finished.

In most cases, errors during scrubbing cause inconsistency within placement groups.

To Troubleshoot This Problem

  1. Log in to the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Determine which placement group is in the inconsistent state:

    [ceph: root@host01 /]# ceph health detail
    HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
    pg 0.6 is active+clean+inconsistent, acting [0,1,2]
    2 scrub errors
  3. Determine why the placement group is inconsistent.

    1. Start the deep scrubbing process on the placement group:

      Syntax

      ceph pg deep-scrub ID

      Replace ID with the ID of the inconsistent placement group, for example:

      [ceph: root@host01 /]# ceph pg deep-scrub 0.6
      instructing pg 0.6 on osd.0 to deep-scrub
    2. Search the output of the ceph -w for any messages related to that placement group:

      Syntax

      ceph -w | grep ID

      Replace ID with the ID of the inconsistent placement group, for example:

      [ceph: root@host01 /]# ceph -w | grep 0.6
      2022-05-26 01:35:36.778215 osd.106 [ERR] 0.6 deep-scrub stat mismatch, got 636/635 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 1855455/1854371 bytes.
      2022-05-26 01:35:36.788334 osd.106 [ERR] 0.6 deep-scrub 1 errors
  4. If the output includes any error messages similar to the following ones, you can repair the inconsistent placement group. See Repairing inconsistent placement groups for details.

    Syntax

    PG.ID shard OSD: soid OBJECT missing attr , missing attr _ATTRIBUTE_TYPE
    PG.ID shard OSD: soid OBJECT digest 0 != known digest DIGEST, size 0 != known size SIZE
    PG.ID shard OSD: soid OBJECT size 0 != known size SIZE
    PG.ID deep-scrub stat mismatch, got MISMATCH
    PG.ID shard OSD: soid OBJECT candidate had a read error, digest 0 != known digest DIGEST

  5. If the output includes any error messages similar to the following ones, it is not safe to repair the inconsistent placement group because you can lose data. Open a support ticket in this situation. See Contacting Red Hat support for details.

    PG.ID shard OSD: soid OBJECT digest DIGEST != known digest DIGEST
    PG.ID shard OSD: soid OBJECT omap_digest DIGEST != known omap_digest DIGEST

Additional Resources

8.2.5. Unclean placement groups

The ceph health command returns an error message similar to the following one:

HEALTH_WARN 197 pgs stuck unclean

What This Means

Ceph marks a placement group as unclean if it has not achieved the active+clean state for the number of seconds specified in the mon_pg_stuck_threshold parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold is 300 seconds.

If a placement group is unclean, it contains objects that are not replicated the number of times specified in the osd_pool_default_size parameter. The default value of osd_pool_default_size is 3, which means that Ceph creates three replicas.

Usually, unclean placement groups indicate that some OSDs might be down.

To Troubleshoot This Problem

  1. Determine which OSDs are down:

    [ceph: root@host01 /]# ceph osd tree
  2. Troubleshoot and fix any problems with the OSDs. See Down OSDs for details.

8.2.6. Inactive placement groups

The ceph health command returns an error message similar to the following one:

HEALTH_WARN 197 pgs stuck inactive

What This Means

Ceph marks a placement group as inactive if it has not be active for the number of seconds specified in the mon_pg_stuck_threshold parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold is 300 seconds.

Usually, inactive placement groups indicate that some OSDs might be down.

To Troubleshoot This Problem

  1. Determine which OSDs are down:

    # ceph osd tree
  2. Troubleshoot and fix any problems with the OSDs.

8.2.7. Placement groups are down

The ceph health detail command reports that some placement groups are down:

HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
pg 0.5 is down+peering
pg 1.4 is down+peering
...
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651

What This Means

In certain cases, the peering process can be blocked, which prevents a placement group from becoming active and usable. Usually, a failure of an OSD causes the peering failures.

To Troubleshoot This Problem

Determine what blocks the peering process:

Syntax

ceph pg ID query

Replace ID with the ID of the placement group that is down:

Example

[ceph: root@host01 /]#  ceph pg 0.5 query

{ "state": "down+peering",
  ...
  "recovery_state": [
       { "name": "Started\/Primary\/Peering\/GetInfo",
         "enter_time": "2021-08-06 14:40:16.169679",
         "requested_info_from": []},
       { "name": "Started\/Primary\/Peering",
         "enter_time": "2021-08-06 14:40:16.169659",
         "probing_osds": [
               0,
               1],
         "blocked": "peering is blocked due to down osds",
         "down_osds_we_would_probe": [
               1],
         "peering_blocked_by": [
               { "osd": 1,
                 "current_lost_at": 0,
                 "comment": "starting or marking this osd lost may let us proceed"}]},
       { "name": "Started",
         "enter_time": "2021-08-06 14:40:16.169513"}
   ]
}

The recovery_state section includes information on why the peering process is blocked.

Additional Resources

  • The Ceph OSD peering section in the Red Hat Ceph Storage Administration Guide.

8.2.8. Unfound objects

The ceph health command returns an error message similar to the following one, containing the unfound keyword:

HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)

What This Means

Ceph marks objects as unfound when it knows these objects or their newer copies exist but it is unable to find them. As a consequence, Ceph cannot recover such objects and proceed with the recovery process.

An Example Situation

A placement group stores data on osd.1 and osd.2.

  1. osd.1 goes down.
  2. osd.2 handles some write operations.
  3. osd.1 comes up.
  4. A peering process between osd.1 and osd.2 starts, and the objects missing on osd.1 are queued for recovery.
  5. Before Ceph copies new objects, osd.2 goes down.

As a result, osd.1 knows that these objects exist, but there is no OSD that has a copy of the objects.

In this scenario, Ceph is waiting for the failed node to be accessible again, and the unfound objects blocks the recovery process.

To Troubleshoot This Problem

  1. Log in to the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Determine which placement group contains unfound objects:

    [ceph: root@host01 /]# ceph health detail
    HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; recovery 5/937611 objects degraded (0.001%); 1/312537 unfound (0.000%)
    pg 3.8a5 is stuck unclean for 803946.712780, current state active+recovering, last acting [320,248,0]
    pg 3.8a5 is active+recovering, acting [320,248,0], 1 unfound
    recovery 5/937611 objects degraded (0.001%); **1/312537 unfound (0.000%)**
  3. List more information about the placement group:

    Syntax

    ceph pg ID query

    Replace ID with the ID of the placement group containing the unfound objects:

    Example

    [ceph: root@host01 /]# ceph pg 3.8a5 query
    { "state": "active+recovering",
      "epoch": 10741,
      "up": [
            320,
            248,
            0],
      "acting": [
            320,
            248,
            0],
    <snip>
      "recovery_state": [
            { "name": "Started\/Primary\/Active",
              "enter_time": "2021-08-28 19:30:12.058136",
              "might_have_unfound": [
                    { "osd": "0",
                      "status": "already probed"},
                    { "osd": "248",
                      "status": "already probed"},
                    { "osd": "301",
                      "status": "already probed"},
                    { "osd": "362",
                      "status": "already probed"},
                    { "osd": "395",
                      "status": "already probed"},
                    { "osd": "429",
                      "status": "osd is down"}],
              "recovery_progress": { "backfill_targets": [],
                  "waiting_on_backfill": [],
                  "last_backfill_started": "0\/\/0\/\/-1",
                  "backfill_info": { "begin": "0\/\/0\/\/-1",
                      "end": "0\/\/0\/\/-1",
                      "objects": []},
                  "peer_backfill_info": [],
                  "backfills_in_flight": [],
                  "recovering": [],
                  "pg_backend": { "pull_from_peer": [],
                      "pushing": []}},
              "scrub": { "scrubber.epoch_start": "0",
                  "scrubber.active": 0,
                  "scrubber.block_writes": 0,
                  "scrubber.finalizing": 0,
                  "scrubber.waiting_on": 0,
                  "scrubber.waiting_on_whom": []}},
            { "name": "Started",
              "enter_time": "2021-08-28 19:30:11.044020"}],

    The might_have_unfound section includes OSDs where Ceph tried to locate the unfound objects:

    • The already probed status indicates that Ceph cannot locate the unfound objects in that OSD.
    • The osd is down status indicates that Ceph cannot contact that OSD.
  4. Troubleshoot the OSDs that are marked as down. See Down OSDs for details.
  5. If you are unable to fix the problem that causes the OSD to be down, open a support ticket. See Contacting Red Hat Support for service for details.

8.3. Listing placement groups stuck in stale, inactive, or unclean state

After a failure, placement groups enter states like degraded or peering. This states indicate normal progression through the failure recovery process.

However, if a placement group stays in one of these states for a longer time than expected, it can be an indication of a larger problem. The Monitors report when placement groups get stuck in a state that is not optimal.

The mon_pg_stuck_threshold option in the Ceph configuration file determines the number of seconds after which placement groups are considered inactive, unclean, or stale.

The following table lists these states together with a short explanation.

StateWhat it meansMost common causesSee

inactive

The PG has not been able to service read/write requests.

  • Peering problems

Inactive placement groups

unclean

The PG contains objects that are not replicated the desired number of times. Something is preventing the PG from recovering.

  • unfound objects
  • OSDs are down
  • Incorrect configuration

Unclean placement groups

stale

The status of the PG has not been updated by a ceph-osd daemon.

  • OSDs are down

Stale placement groups

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the node.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. List the stuck PGs:

    Example

    [ceph: root@host01 /]# ceph pg dump_stuck inactive
    [ceph: root@host01 /]# ceph pg dump_stuck unclean
    [ceph: root@host01 /]# ceph pg dump_stuck stale

Additional Resources

8.4. Listing placement group inconsistencies

Use the rados utility to list inconsistencies in various replicas of objects. Use the --format=json-pretty option to list a more detailed output.

This section covers the listing of:

  • Inconsistent placement group in a pool
  • Inconsistent objects in a placement group
  • Inconsistent snapshot sets in a placement group

Prerequisites

  • A running Red Hat Ceph Storage cluster in a healthy state.
  • Root-level access to the node.

Procedure

  • List all the inconsistent placement groups in a pool:

    Syntax

    rados list-inconsistent-pg POOL --format=json-pretty

    Example

    [ceph: root@host01 /]# rados list-inconsistent-pg data --format=json-pretty
    [0.6]

  • List inconsistent objects in a placement group with ID:

    Syntax

    rados list-inconsistent-obj PLACEMENT_GROUP_ID

    Example

    [ceph: root@host01 /]# rados list-inconsistent-obj 0.6
    {
        "epoch": 14,
        "inconsistents": [
            {
                "object": {
                    "name": "image1",
                    "nspace": "",
                    "locator": "",
                    "snap": "head",
                    "version": 1
                },
                "errors": [
                    "data_digest_mismatch",
                    "size_mismatch"
                ],
                "union_shard_errors": [
                    "data_digest_mismatch_oi",
                    "size_mismatch_oi"
                ],
                "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
                "shards": [
                    {
                        "osd": 0,
                        "errors": [],
                        "size": 968,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0xe978e67f"
                    },
                    {
                        "osd": 1,
                        "errors": [],
                        "size": 968,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0xe978e67f"
                    },
                    {
                        "osd": 2,
                        "errors": [
                            "data_digest_mismatch_oi",
                            "size_mismatch_oi"
                        ],
                        "size": 0,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0xffffffff"
                    }
                ]
            }
        ]
    }

    The following fields are important to determine what causes the inconsistency:

    • name: The name of the object with inconsistent replicas.
    • nspace: The namespace that is a logical separation of a pool. It’s empty by default.
    • locator: The key that is used as the alternative of the object name for placement.
    • snap: The snapshot ID of the object. The only writable version of the object is called head. If an object is a clone, this field includes its sequential ID.
    • version: The version ID of the object with inconsistent replicas. Each write operation to an object increments it.
    • errors: A list of errors that indicate inconsistencies between shards without determining which shard or shards are incorrect. See the shard array to further investigate the errors.

      • data_digest_mismatch: The digest of the replica read from one OSD is different from the other OSDs.
      • size_mismatch: The size of a clone or the head object does not match the expectation.
      • read_error: This error indicates inconsistencies caused most likely by disk errors.
    • union_shard_error: The union of all errors specific to shards. These errors are connected to a faulty shard. The errors that end with oi indicate that you have to compare the information from a faulty object to information with selected objects. See the shard array to further investigate the errors.

      In the above example, the object replica stored on osd.2 has different digest than the replicas stored on osd.0 and osd.1. Specifically, the digest of the replica is not 0xffffffff as calculated from the shard read from osd.2, but 0xe978e67f. In addition, the size of the replica read from osd.2 is 0, while the size reported by osd.0 and osd.1 is 968.

  • List inconsistent sets of snapshots:

    Syntax

    rados list-inconsistent-snapset PLACEMENT_GROUP_ID

    Example

    [ceph: root@host01 /]# rados list-inconsistent-snapset 0.23 --format=json-pretty
    {
        "epoch": 64,
        "inconsistents": [
            {
                "name": "obj5",
                "nspace": "",
                "locator": "",
                "snap": "0x00000001",
                "headless": true
            },
            {
                "name": "obj5",
                "nspace": "",
                "locator": "",
                "snap": "0x00000002",
                "headless": true
            },
            {
                "name": "obj5",
                "nspace": "",
                "locator": "",
                "snap": "head",
                "ss_attr_missing": true,
                "extra_clones": true,
                "extra clones": [
                    2,
                    1
                ]
            }
        ]

    The command returns the following errors:

    • ss_attr_missing: One or more attributes are missing. Attributes are information about snapshots encoded into a snapshot set as a list of key-value pairs.
    • ss_attr_corrupted: One or more attributes fail to decode.
    • clone_missing: A clone is missing.
    • snapset_mismatch: The snapshot set is inconsistent by itself.
    • head_mismatch: The snapshot set indicates that head exists or not, but the scrub results report otherwise.
    • headless: The head of the snapshot set is missing.
    • size_mismatch: The size of a clone or the head object does not match the expectation.

Additional Resources

8.5. Repairing inconsistent placement groups

Due to an error during deep scrubbing, some placement groups can include inconsistencies. Ceph reports such placement groups as inconsistent:

HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors
Warning

You can repair only certain inconsistencies.

Do not repair the placement groups if the Ceph logs include the following errors:

_PG_._ID_ shard _OSD_: soid _OBJECT_ digest _DIGEST_ != known digest _DIGEST_
_PG_._ID_ shard _OSD_: soid _OBJECT_ omap_digest _DIGEST_ != known omap_digest _DIGEST_

Open a support ticket instead. See Contacting Red Hat Support for service for details.

Prerequisites

  • Root-level access to the Ceph Monitor node.

Procedure

  • Repair the inconsistent placement groups:

    Syntax

    ceph pg repair ID

    Replace ID with the ID of the inconsistent placement group.

Additional Resources

8.6. Increasing the placement group

Insufficient Placement Group (PG) count impacts the performance of the Ceph cluster and data distribution. It is one of the main causes of the nearfull osds error messages.

The recommended ratio is between 100 and 300 PGs per OSD. This ratio can decrease when you add more OSDs to the cluster.

The pg_num and pgp_num parameters determine the PG count. These parameters are configured per each pool, and therefore, you must adjust each pool with low PG count separately.

Important

Increasing the PG count is the most intensive process that you can perform on a Ceph cluster. This process might have a serious performance impact if not done in a slow and methodical way. Once you increase pgp_num, you will not be able to stop or reverse the process and you must complete it. Consider increasing the PG count outside of business critical processing time allocation, and alert all clients about the potential performance impact. Do not change the PG count if the cluster is in the HEALTH_ERR state.

Prerequisites

  • A running Red Hat Ceph Storage cluster in a healthy state.
  • Root-level access to the node.

Procedure

  1. Reduce the impact of data redistribution and recovery on individual OSDs and OSD hosts:

    1. Lower the value of the osd max backfills, osd_recovery_max_active, and osd_recovery_op_priority parameters:

      [ceph: root@host01 /]# ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 --osd_recovery_op_priority 1'
    2. Disable the shallow and deep scrubbing:

      [ceph: root@host01 /]# ceph osd set noscrub
      [ceph: root@host01 /]# ceph osd set nodeep-scrub
  2. Use the Ceph Placement Groups (PGs) per Pool Calculator to calculate the optimal value of the pg_num and pgp_num parameters.
  3. Increase the pg_num value in small increments until you reach the desired value.

    1. Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
    2. Increment the pg_num value:

      Syntax

      ceph osd pool set POOL pg_num VALUE

      Specify the pool name and the new value, for example:

      Example

      [ceph: root@host01 /]# ceph osd pool set data pg_num 4

    3. Monitor the status of the cluster:

      Example

      [ceph: root@host01 /]# ceph -s

      The PGs state will change from creating to active+clean. Wait until all PGs are in the active+clean state.

  4. Increase the pgp_num value in small increments until you reach the desired value:

    1. Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
    2. Increment the pgp_num value:

      Syntax

      ceph osd pool set POOL pgp_num VALUE

      Specify the pool name and the new value, for example:

      [ceph: root@host01 /]# ceph osd pool set data pgp_num 4
    3. Monitor the status of the cluster:

      [ceph: root@host01 /]# ceph -s

      The PGs state will change through peering, wait_backfill, backfilling, recover, and others. Wait until all PGs are in the active+clean state.

  5. Repeat the previous steps for all pools with insufficient PG count.
  6. Set osd max backfills, osd_recovery_max_active, and osd_recovery_op_priority to their default values:

    [ceph: root@host01 /]# ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'
  7. Enable the shallow and deep scrubbing:

    [ceph: root@host01 /]# ceph osd unset noscrub
    [ceph: root@host01 /]# ceph osd unset nodeep-scrub

Additional Resources

8.7. Additional Resources