Chapter 7. Troubleshooting Placement Groups

This section contains information about fixing the most common errors related to the Ceph Placement Groups (PGs).

Before You Start

7.2. Listing Placement Groups in stale, inactive, or unclean State

After a failure, placement groups enter states like degraded or peering. This states indicate normal progression through the failure recovery process.

However, if a placement group stays in one of these states for a longer time than expected, it can be an indication of a larger problem. The Monitors reports when placement groups get stuck in a state that is not optimal.

The following table lists these states together with a short explanation.

StateWhat it meansMost common causesSee

inactive

The PG has not been able to service read/write requests.

  • Peering problems

Section 7.1.4, “Inactive Placement Groups”

unclean

The PG contains objects that are not replicated the desired number of times. Something is preventing the PG from recovering.

  • unfound objects
  • OSDs are down
  • Incorrect configuration

Section 7.1.3, “Unclean Placement Groups”

stale

The status of the PG has not been updated by a ceph-osd daemon.

  • OSDs are down

Section 7.1.1, “Stale Placement Groups”

The mon_pg_stuck_threshold parameter in the Ceph configuration file determines the number of seconds after which placement groups are considered inactive, unclean, or stale.

List the stuck PGs:

# ceph pg dump_stuck inactive
# ceph pg dump_stuck unclean
# ceph pg dump_stuck stale

See Also

7.3. Listing Inconsistencies

Use the rados utility to list inconsistencies in various replicas of an objects. Use the --format=json-pretty option to list a more detailed output.

You can list:

Listing Inconsistent Placement Groups in a Pool

rados list-inconsistent-pg <pool> --format=json-pretty

For example, list all inconsistent placement groups in a pool named data:

# rados list-inconsistent-pg data --format=json-pretty
[0.6]

Listing Inconsistent Objects in a Placement Group

rados list-inconsistent-obj <placement-group-id>

For example, list inconsistent objects in a placement group with ID 0.6:

# rados list-inconsistent-obj 0.6
{
    "epoch": 14,
    "inconsistents": [
        {
            "object": {
                "name": "image1",
                "nspace": "",
                "locator": "",
                "snap": "head",
                "version": 1
            },
            "errors": [
                "data_digest_mismatch",
                "size_mismatch"
            ],
            "union_shard_errors": [
                "data_digest_mismatch_oi",
                "size_mismatch_oi"
            ],
            "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
            "shards": [
                {
                    "osd": 0,
                    "errors": [],
                    "size": 968,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xe978e67f"
                },
                {
                    "osd": 1,
                    "errors": [],
                    "size": 968,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xe978e67f"
                },
                {
                    "osd": 2,
                    "errors": [
                        "data_digest_mismatch_oi",
                        "size_mismatch_oi"
                    ],
                    "size": 0,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xffffffff"
                }
            ]
        }
    ]
}

The following fields are important to determine what causes the inconsistency:

  • name: The name of the object with inconsistent replicas.
  • nspace: The namespace that is a logical separation of a pool. It’s empty by default.
  • locator: The key that is used as the alternative of the object name for placement.
  • snap: The snapshot ID of the object. The only writable version of the object is called head. If an object is a clone, this field includes its sequential ID.
  • version: The version ID of the object with inconsistent replicas. Each write operation to an object increments it.
  • errors: A list of errors that indicate inconsistencies between shards without determining which shard or shards are incorrect. See the shard array to further investigate the errors.

    • data_digest_mismatch: The digest of the replica read from one OSD is different from the other OSDs.
    • size_mismatch: The size of a clone or the head object does not match the expectation.
    • read_error: This error indicates inconsistencies caused most likely by disk errors.
  • union_shard_error: The union of all errors specific to shards. These errors are connected to a faulty shard. The errors that end with oi indicate that you have to compare the information from a faulty object to information with selected objects. See the shard array to further investigate the errors.

    In the above example, the object replica stored on osd.2 has different digest than the replicas stored on osd.0 and osd.1. Specifically, the digest of the replica is not 0xffffffff as calculated from the shard read from osd.2, but 0xe978e67f. In addition, the size of the replica read from osd.2 is 0, while the size reported by osd.0 and osd.1 is 968.

Listing Inconsistent Snapshot Sets in a Placement Group

rados list-inconsistent-snapset <placement-group-id>

For example, list inconsistent sets of snapshots (snapsets) in a placement group with ID 0.23:

# rados list-inconsistent-snapset 0.23 --format=json-pretty
{
    "epoch": 64,
    "inconsistents": [
        {
            "name": "obj5",
            "nspace": "",
            "locator": "",
            "snap": "0x00000001",
            "headless": true
        },
        {
            "name": "obj5",
            "nspace": "",
            "locator": "",
            "snap": "0x00000002",
            "headless": true
        },
        {
            "name": "obj5",
            "nspace": "",
            "locator": "",
            "snap": "head",
            "ss_attr_missing": true,
            "extra_clones": true,
            "extra clones": [
                2,
                1
            ]
        }
    ]

The command returns the following errors:

  • ss_attr_missing: One or more attributes are missing. Attributes are information about snapshots encoded into a snapshot set as a list of key-value pairs.
  • ss_attr_corrupted: One or more attributes fail to decode.
  • clone_missing: A clone is missing.
  • snapset_mismatch: The snapshot set is inconsistent by itself.
  • head_mismatch: The snapshot set indicates that head exists or not, but the scrub results report otherwise.
  • headless: The head of the snapshot set is missing.
  • size_mismatch: The size of a clone or the head object does not match the expectation.

See Also

7.4. Repairing Inconsistent Placement Groups

Due to an error during deep scrubbing, some placement groups can include inconsistencies. Ceph reports such placement groups as inconsistent:

HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors
Warning

You can repair only certain inconsistencies. Do not repair the placement groups if the Ceph logs include the following errors:

<pg.id> shard <osd>: soid <object> digest <digest> != known digest <digest>
<pg.id> shard <osd>: soid <object> omap_digest <digest> != known omap_digest <digest>

Open a support ticket instead. See Chapter 9, Contacting Red Hat Support Service for details.

Repair the inconsistent placement groups:

ceph pg repair <id>

Replace <id> with the ID of the inconsistent placement group.

See Also

7.5. Increasing the PG Count

Insufficient Placement Group (PG) count impacts the performance of the Ceph cluster and data distribution. It is one of the main causes of the nearfull osds error messages.

The recommended ratio is between 100 and 300 PGs per OSD. This ratio can decrease when you add more OSDs to the cluster.

The pg_num and pgp_num parameters determine the PG count. These parameters are configured per each pool, and therefore, you must adjust each pool with low PG count separately.

Important

Increasing the PG count is the most intensive process that you can perform on a Ceph cluster. This process might have serious performance impact if not done in a slow and methodical way. Once you increase pgp_num, you will not be able to stop or reverse the process and you must complete it.

Consider increasing the PG count outside of business critical processing time allocation, and alert all clients about the potential performance impact.

Do not change the PG count if the cluster is in the HEALTH_ERR state.

Procedure: Increasing the PG Count

  1. Reduce the impact of data redistribution and recovery on individual OSDs and OSD hosts:

    1. Lower the value of the osd max backfills, osd_recovery_max_active, and osd_recovery_op_priority parameters:

      # ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 --osd_recovery_op_priority 1'
    2. Disable the shallow and deep scrubbing:

      # ceph osd set noscrub
      # ceph osd set nodeep-scrub
  2. Use the Ceph Placement Groups (PGs) per Pool Calculator to calculate the optimal value of the pg_num and pgp_num parameters.
  3. Increase the pg_num value in small increments until you reach the desired value.

    1. Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
    2. Increment the pg_num value:

      ceph osd pool set <pool> pg_num <value>

      Specify the pool name and the new value, for example:

      # ceph osd pool set data pg_num 4
    3. Monitor the status of the cluster:

      # ceph -s

      The PGs state will change from creating to active+clean. Wait until all PGs are in the active+clean state.

  4. Increase the pgp_num value in small increments until you reach the desired value:

    1. Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
    2. Increment the pgp_num value:

      ceph osd pool set <pool> pgp_num <value>

      Specify the pool name and the new value, for example:

      # ceph osd pool set data pgp_num 4
    3. Monitor the status of the cluster:

      # ceph -s

      The PGs state will change through peering, wait_backfill, backfilling, recover, and others. Wait until all PGs are in the active+clean state.

  5. Repeat the previous steps for all pools with insufficient PG count.
  6. Set osd max backfills, osd_recovery_max_active, and osd_recovery_op_priority to their default values:

    # ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'
  7. Enable the shallow and deep scrubbing:

    # ceph osd unset noscrub
    # ceph osd unset nodeep-scrub

See also