A degraded ceph cluster (on Firefly) stops recovering and gets stuck degraded PGs after an OSD goes down, why?

Solution In Progress - Updated -

Issue

  • A degraded ceph cluster (on Firefly) stops recovering and gets stuck degraded PGs after an OSD goes down, why?

  • After removing a failed OSD on a three node Ceph cluster, the data movement/balance started between the existing OSDs, but stalled. This causes the Ceph cluster to get stuck with degraded PGs.

  • The 'osd_pool_default_size' is set to 3 and 'osd_pool_default_min_size' to 2.

  • A 'ceph -s' shows the following:

# ceph -s
    cluster 16ce9ce1-aa5f-445f-b994-5699730f364a
     health HEALTH_WARN 326 pgs degraded; 366 pgs stuck unclean; recovery 975/83301 objects degraded (1.170%)
     monmap e1: 3 mons at {mon-01=172.28.225.72:6789/0,mon-02=172.28.225.73:6789/0,mon-03=172.28.225.74:6789/0}, election epoch 18, quorum 0,1,2 mon-01,mon-02,mon-03
     osdmap e540: 29 osds: 29 up, 29 in
      pgmap v2727568: 9408 pgs, 19 pools, 135 GB data, 27767 objects
            403 GB used, 80437 GB / 80840 GB avail
            975/83301 objects degraded (1.170%)
                9042 active+clean
                 326 active+degraded
                  40 active+remapped
  client io 20363 B/s wr, 1 op/s
  • The above is the current state, and there is no more recovery occurring.

  • A 'ceph osd tree' shows:

# ceph osd tree
# id    weight  type name       up/down reweight
-1      81.6    root default
-2      27.2            host node-c01
0       2.72                    osd.0   DNE
1       2.72                    osd.1   up      1
2       2.72                    osd.2   up      1
3       2.72                    osd.3   up      1
4       2.72                    osd.4   up      1
5       2.72                    osd.5   up      1
6       2.72                    osd.6   up      1
7       2.72                    osd.7   up      1
8       2.72                    osd.8   up      1
9       2.72                    osd.9   up      1
-3      27.2            host node-02
10      2.72                    osd.10  up      1
11      2.72                    osd.11  up      1
12      2.72                    osd.12  up      1
13      2.72                    osd.13  up      1
14      2.72                    osd.14  up      1
15      2.72                    osd.15  up      1
16      2.72                    osd.16  up      1
17      2.72                    osd.17  up      1
18      2.72                    osd.18  up      1
19      2.72                    osd.19  up      1
-4      27.2            host node3-03
20      2.72                    osd.20  up      1
21      2.72                    osd.21  up      1
22      2.72                    osd.22  up      1
23      2.72                    osd.23  up      1
24      2.72                    osd.24  up      1
25      2.72                    osd.25  up      1
26      2.72                    osd.26  up      1
27      2.72                    osd.27  up      1
28      2.72                    osd.28  up      1
29      2.72                    osd.29  up      1

Environment

  • Red Hat Ceph Enterprise 1.2.3

  • Inktank Ceph Enterprise 1.2

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content