OpenStack FFU from 10 to 13 times out when Ceph allocated in one or more OSDs more than 250 PGs

Solution Verified - Updated -

Environment

OpenStack FFU from OSP10 to OSP13.

Issue

When running Fast Forward Upgrade for OpenStack from 10 to 13, the Ceph upgrade might time out in case the max PGs per OSD warning limit is surpassed.

Resolution

Create a Heat environment file with the following snippet:

parameter_defaults:
  CephConfigOverrides:
    global:
      mon_max_pg_per_osd: 400

The new limit needs to be higher than the highest number of PGs allocated on any given OSD. In this example, it was set to 400.

The newly created environment file should be appended with -e to the overcloud ffwd-upgrade prepare, overcloud ceph-upgrade run and overcloud ffwd-upgrade converge commands and also reused for any subsequent overcloud deploy command given in the future or merged into any of the pre-existing custom environment files created to configure the Ceph deployment.

Root Cause

The number of PGs allocated on one or more of the OSDs in the Ceph Jewel cluster, initially deployed with OSP10, is higher than 250, which is the default limit set in Ceph Luminous.

Diagnostic Steps

The following command can be used to gather the exact number of PGs allocated in every OSD:

# ceph osd df tree

It will produce an output similar to the following:

# ceph osd df tree
ID CLASS WEIGHT  REWEIGHT SIZE    USE     DATA    OMAP META    AVAIL   %USE  VAR  PGS TYPE NAME       
-1       0.17532        -  180GiB 73.5GiB 99.9MiB   0B 73.4GiB  106GiB 40.84 1.00   - root default    
-7       0.05844        - 60.0GiB 24.3GiB 33.3MiB   0B 24.3GiB 35.6GiB 40.59 0.99   -     host ceph-0 
 0   hdd 0.01169  1.00000 12.0GiB 3.82GiB 6.56MiB   0B 3.81GiB 8.18GiB 31.82 0.78  60         osd.0   
 4   hdd 0.01169  1.00000 12.0GiB 8.62GiB 6.75MiB   0B 8.61GiB 3.38GiB 71.84 1.76  71         osd.4   
 6   hdd 0.01169  1.00000 12.0GiB 3.06GiB 6.56MiB   0B 3.05GiB 8.94GiB 25.48 0.62  43         osd.6   
10   hdd 0.01169  1.00000 12.0GiB 5.65GiB 6.69MiB   0B 5.64GiB 6.35GiB 47.08 1.15  53         osd.10  
13   hdd 0.01169  1.00000 12.0GiB 3.20GiB 6.75MiB   0B 3.20GiB 8.79GiB 26.71 0.65  61         osd.13  
-3       0.05844        - 60.0GiB 25.2GiB 33.3MiB   0B 25.1GiB 34.8GiB 41.97 1.03   -     host ceph-1 
 2   hdd 0.01169  1.00000 12.0GiB 1.33GiB 6.56MiB   0B 1.32GiB 10.7GiB 11.09 0.27  41         osd.2   
 5   hdd 0.01169  1.00000 12.0GiB 5.18GiB 6.88MiB   0B 5.17GiB 6.82GiB 43.17 1.06  50         osd.5   
 8   hdd 0.01169  1.00000 12.0GiB 10.7GiB 6.56MiB   0B 10.7GiB 1.30GiB 89.20 2.18  73         osd.8   
11   hdd 0.01169  1.00000 12.0GiB 3.65GiB 6.69MiB   0B 3.64GiB 8.34GiB 30.43 0.75  66         osd.11  
14   hdd 0.01169  1.00000 12.0GiB 4.32GiB 6.62MiB   0B 4.31GiB 7.68GiB 35.99 0.88  58         osd.14  
-5       0.05844        - 60.0GiB 24.0GiB 33.3MiB   0B 23.9GiB 36.0GiB 39.95 0.98   -     host ceph-2 
 1   hdd 0.01169  1.00000 12.0GiB 5.02GiB 6.56MiB   0B 5.01GiB 6.98GiB 41.83 1.02  58         osd.1   
 3   hdd 0.01169  1.00000 12.0GiB 7.06GiB 6.62MiB   0B 7.05GiB 4.94GiB 58.85 1.44  65         osd.3   
 7   hdd 0.01169  1.00000 12.0GiB 3.81GiB 6.75MiB   0B 3.81GiB 8.18GiB 31.79 0.78  62         osd.7   
 9   hdd 0.01169  1.00000 12.0GiB 3.97GiB 6.62MiB   0B 3.96GiB 8.03GiB 33.06 0.81  58         osd.9   
12   hdd 0.01169  1.00000 12.0GiB 4.10GiB 6.75MiB   0B 4.10GiB 7.89GiB 34.20 0.84  45         osd.12  
                    TOTAL  180GiB 73.5GiB 99.9MiB   0B 73.4GiB  106GiB 40.84                          
MIN/MAX VAR: 0.27/2.18  STDDEV: 18.96

Where for each OSD the number of PGs hosted is shown under the PGS column.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments