Why bluestore OSDs are getting full due to metadata usage in RHCS ?

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 7.x
  • Red Hat Ceph Storage 3.3.x
  • Ceph 12.2.12-79.el7cp
  • Ceph 12.2.12-84.el7cp
  • Ceph 12.2.12-101.el7cp
  • Ceph 12.2.12-115.el7cp
  • container tags 3-37 - 3-45

Issue

  • We are seeing high metadata usage for OSDs.
  • ceph osd df tree output showing high disk usage even though no or very less data on OSD pools.

Resolution

Upgrade cluster to RHCS 3.3z6 release to fix bluefs log growing exponentially as well as not being compacted in RHCS 3.x build.

For a workaround :

  • Periodically issue online compaction for OSDs which are affected using ceph tell command

    # ceph tell osd.<osdid> compact
    

    Similarly can instruct online compaction for all OSDs using below command :

    • Non-containerized environment :
    # ceph tell osd.\* compact
    
    • Containerized environment :
    # docker exec <mon-container> ceph tell osd.\* compact
    
  • For RGW workload, we recommend to use minimum 500 GB disk per OSD as rockdb uses large amount of space before compaction.
  • Configure OSD pools with appropriate PG count.
  • Make sure RGW buckets are properly sharded.

Root Cause

  • The bluefs log is not being compacted due to a known ceph bug. The upstream patches PR#17354 and PR#35473 are addressing this issue.
  • The bluefs log growing rapidly to bigger size due to some of the below factors. Make sure your cluster does not have any of these inconsistencies.
    • The block device is on faster device (SSD/NVMe) whereas block.db on slow device (HDD)
    • Smaller block.db (1GB) disk size
    • The block.db size is less than 4% of block device
    • RGW buckets are not (properly) sharded
    • Lower PG count for OSD pools
    • No data on pools

Diagnostic Steps

  • Verify if OSDs are having high metadata usage and almost reaching full capacity

    # ceph osd df tree
    
    ID CLASS WEIGHT  REWEIGHT SIZE    USE     DATA    OMAP META    AVAIL   %USE  VAR  PGS TYPE NAME                      
    -1       0.13715        -  110GiB  100GiB  165MiB   0B 99.8GiB 9.97GiB 90.93 1.00   - root default                    
    -7       0.03918        - 40.0GiB 36.1GiB 59.9MiB   0B 36.0GiB 3.91GiB 90.22 0.99   -     host pd-cephcontainer-osd01
     3   hdd 0.00980  1.00000 10.0GiB 9.89GiB 12.8MiB   0B 9.87GiB  112MiB 98.91 1.09   0         osd.3                  
     7   hdd 0.00980  1.00000 10.0GiB 9.88GiB 14.8MiB   0B 9.87GiB  117MiB 98.85 1.09   5         osd.7                  
    11   hdd 0.00980  1.00000 10.0GiB 9.88GiB 15.9MiB   0B 9.86GiB  119MiB 98.84 1.09   7         osd.11                  
    15   hdd 0.00980  1.00000 10.0GiB 6.42GiB 16.4MiB   0B 6.41GiB 3.57GiB 64.27 0.71   6         osd.15                  
    -3       0.03918        - 30.0GiB 27.2GiB 44.3MiB   0B 27.2GiB 2.76GiB 90.80 1.00   -     host pd-cephcontainer-osd02
     0   hdd 0.00980  1.00000 10.0GiB 7.47GiB 16.4MiB   0B 7.45GiB 2.53GiB 74.69 0.82  11         osd.0                  
     4   hdd 0.00980  1.00000 10.0GiB 9.88GiB 13.9MiB   0B 9.87GiB  117MiB 98.86 1.09   1         osd.4                  
     8   hdd 0.00980  1.00000 10.0GiB 9.88GiB   14MiB   0B 9.87GiB  119MiB 98.84 1.09   0         osd.8                  
    12   hdd 0.00980        0      0B      0B      0B   0B      0B      0B     0    0   0         osd.12                  
    -9       0.01959        - 10.0GiB 7.04GiB 16.6MiB   0B 7.02GiB 2.96GiB 70.38 0.77   -     host pd-cephcontainer-osd03
     5   hdd 0.00980        0      0B      0B      0B   0B      0B      0B     0    0   0         osd.5                  
    13   hdd 0.00980  1.00000 10.0GiB 7.04GiB 16.6MiB   0B 7.02GiB 2.96GiB 70.38 0.77  18         osd.13                  
    -5       0.03918        - 30.0GiB 29.6GiB 44.7MiB   0B 29.6GiB  348MiB 98.87 1.09   -     host pd-cephcontainer-osd04
     2   hdd 0.00980  1.00000 10.0GiB 9.88GiB 16.5MiB   0B 9.87GiB  117MiB 98.86 1.09  11         osd.2                  
     6   hdd 0.00980  1.00000 10.0GiB 9.88GiB 15.9MiB   0B 9.87GiB  116MiB 98.86 1.09   4         osd.6                  
    10   hdd 0.00980  1.00000 10.0GiB 9.88GiB 12.2MiB   0B 9.87GiB  114MiB 98.88 1.09   0         osd.10                  
    14   hdd 0.00980        0      0B      0B      0B   0B      0B      0B     0    0   0         osd.14                  
     1             0        0      0B      0B      0B   0B      0B      0B     0    0   0 osd.1                          
     9             0        0      0B      0B      0B   0B      0B      0B     0    0   0 osd.9                         
    
  • OSD fail to start and crashing with below calltrace

     ceph version 12.2.12-79.el7cp (baaa55a74a953625c1acd175dab49779fabb84a2) luminous (stable)
     1: (()+0x4029b1) [0x55ce83fda9b1]
     2: (()+0xf630) [0x7f62c9878630]
     3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x452) [0x55ce83dada02]
     4: (BlueFS::_replay(bool)+0x48d) [0x55ce83dc14bd]
     5: (BlueFS::mount()+0x1d4) [0x55ce83dc50a4]
     6: (BlueStore::_open_db(bool)+0x1857) [0x55ce83e1ca87]
     7: (BlueStore::_fsck(bool, bool)+0x3c7) [0x55ce83e52657]
     8: (main()+0xf04) [0x55ce83d04cc4]
     9: (__libc_start_main()+0xf5) [0x7f62c824a545]
     10: (()+0x1c44bf) [0x55ce83d9c4bf]
     NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.