A degraded ceph cluster (on Firefly) stops recovering and gets stuck degraded PGs after an OSD goes down, why?

Solution In Progress - Updated -

Issue

  • A degraded ceph cluster (on Firefly) stops recovering and gets stuck degraded PGs after an OSD goes down, why?

  • After removing a failed OSD on a three node Ceph cluster, the data movement/balance started between the existing OSDs, but stalled. This causes the Ceph cluster to get stuck with degraded PGs.

  • The 'osd_pool_default_size' is set to 3 and 'osd_pool_default_min_size' to 2.

  • A 'ceph -s' shows the following:

# ceph -s
    cluster 16ce9ce1-aa5f-445f-b994-5699730f364a
     health HEALTH_WARN 326 pgs degraded; 366 pgs stuck unclean; recovery 975/83301 objects degraded (1.170%)
     monmap e1: 3 mons at {mon-01=172.28.225.72:6789/0,mon-02=172.28.225.73:6789/0,mon-03=172.28.225.74:6789/0}, election epoch 18, quorum 0,1,2 mon-01,mon-02,mon-03
     osdmap e540: 29 osds: 29 up, 29 in
      pgmap v2727568: 9408 pgs, 19 pools, 135 GB data, 27767 objects
            403 GB used, 80437 GB / 80840 GB avail
            975/83301 objects degraded (1.170%)
                9042 active+clean
                 326 active+degraded
                  40 active+remapped
  client io 20363 B/s wr, 1 op/s
  • The above is the current state, and there is no more recovery occurring.

  • A 'ceph osd tree' shows:

# ceph osd tree
# id    weight  type name       up/down reweight
-1      81.6    root default
-2      27.2            host node-c01
0       2.72                    osd.0   DNE
1       2.72                    osd.1   up      1
2       2.72                    osd.2   up      1
3       2.72                    osd.3   up      1
4       2.72                    osd.4   up      1
5       2.72                    osd.5   up      1
6       2.72                    osd.6   up      1
7       2.72                    osd.7   up      1
8       2.72                    osd.8   up      1
9       2.72                    osd.9   up      1
-3      27.2            host node-02
10      2.72                    osd.10  up      1
11      2.72                    osd.11  up      1
12      2.72                    osd.12  up      1
13      2.72                    osd.13  up      1
14      2.72                    osd.14  up      1
15      2.72                    osd.15  up      1
16      2.72                    osd.16  up      1
17      2.72                    osd.17  up      1
18      2.72                    osd.18  up      1
19      2.72                    osd.19  up      1
-4      27.2            host node3-03
20      2.72                    osd.20  up      1
21      2.72                    osd.21  up      1
22      2.72                    osd.22  up      1
23      2.72                    osd.23  up      1
24      2.72                    osd.24  up      1
25      2.72                    osd.25  up      1
26      2.72                    osd.26  up      1
27      2.72                    osd.27  up      1
28      2.72                    osd.28  up      1
29      2.72                    osd.29  up      1

Environment

  • Red Hat Ceph Enterprise 1.2.3

  • Inktank Ceph Enterprise 1.2

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In