Why bluestore OSDs are getting full due to metadata usage in RHCS ?
Environment
- Red Hat Enterprise Linux 7.x
- Red Hat Ceph Storage 3.3.x
- Ceph 12.2.12-79.el7cp
- Ceph 12.2.12-84.el7cp
- Ceph 12.2.12-101.el7cp
- Ceph 12.2.12-115.el7cp
- container tags 3-37 - 3-45
Issue
- We are seeing high metadata usage for OSDs.
ceph osd df tree
output showing high disk usage even though no or very less data on OSD pools.
Resolution
Upgrade cluster to RHCS 3.3z6 release to fix bluefs log growing exponentially as well as not being compacted in RHCS 3.x build.
For a workaround :
-
Periodically issue online compaction for OSDs which are affected using ceph tell command
# ceph tell osd.<osdid> compact
Similarly can instruct online compaction for all OSDs using below command :
- Non-containerized environment :
# ceph tell osd.\* compact
- Containerized environment :
# docker exec <mon-container> ceph tell osd.\* compact
- For RGW workload, we recommend to use minimum 500 GB disk per OSD as rockdb uses large amount of space before compaction.
- Configure OSD pools with appropriate PG count.
- Make sure RGW buckets are properly sharded.
Root Cause
- The bluefs log is not being compacted due to a known ceph bug. The upstream patches PR#17354 and PR#35473 are addressing this issue.
- The bluefs log growing rapidly to bigger size due to some of the below factors. Make sure your cluster does not have any of these inconsistencies.
- The block device is on faster device (SSD/NVMe) whereas block.db on slow device (HDD)
- Smaller block.db (1GB) disk size
- The block.db size is less than 4% of block device
- RGW buckets are not (properly) sharded
- Lower PG count for OSD pools
- No data on pools
Diagnostic Steps
-
Verify if OSDs are having high metadata usage and almost reaching full capacity
# ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE USE DATA OMAP META AVAIL %USE VAR PGS TYPE NAME -1 0.13715 - 110GiB 100GiB 165MiB 0B 99.8GiB 9.97GiB 90.93 1.00 - root default -7 0.03918 - 40.0GiB 36.1GiB 59.9MiB 0B 36.0GiB 3.91GiB 90.22 0.99 - host pd-cephcontainer-osd01 3 hdd 0.00980 1.00000 10.0GiB 9.89GiB 12.8MiB 0B 9.87GiB 112MiB 98.91 1.09 0 osd.3 7 hdd 0.00980 1.00000 10.0GiB 9.88GiB 14.8MiB 0B 9.87GiB 117MiB 98.85 1.09 5 osd.7 11 hdd 0.00980 1.00000 10.0GiB 9.88GiB 15.9MiB 0B 9.86GiB 119MiB 98.84 1.09 7 osd.11 15 hdd 0.00980 1.00000 10.0GiB 6.42GiB 16.4MiB 0B 6.41GiB 3.57GiB 64.27 0.71 6 osd.15 -3 0.03918 - 30.0GiB 27.2GiB 44.3MiB 0B 27.2GiB 2.76GiB 90.80 1.00 - host pd-cephcontainer-osd02 0 hdd 0.00980 1.00000 10.0GiB 7.47GiB 16.4MiB 0B 7.45GiB 2.53GiB 74.69 0.82 11 osd.0 4 hdd 0.00980 1.00000 10.0GiB 9.88GiB 13.9MiB 0B 9.87GiB 117MiB 98.86 1.09 1 osd.4 8 hdd 0.00980 1.00000 10.0GiB 9.88GiB 14MiB 0B 9.87GiB 119MiB 98.84 1.09 0 osd.8 12 hdd 0.00980 0 0B 0B 0B 0B 0B 0B 0 0 0 osd.12 -9 0.01959 - 10.0GiB 7.04GiB 16.6MiB 0B 7.02GiB 2.96GiB 70.38 0.77 - host pd-cephcontainer-osd03 5 hdd 0.00980 0 0B 0B 0B 0B 0B 0B 0 0 0 osd.5 13 hdd 0.00980 1.00000 10.0GiB 7.04GiB 16.6MiB 0B 7.02GiB 2.96GiB 70.38 0.77 18 osd.13 -5 0.03918 - 30.0GiB 29.6GiB 44.7MiB 0B 29.6GiB 348MiB 98.87 1.09 - host pd-cephcontainer-osd04 2 hdd 0.00980 1.00000 10.0GiB 9.88GiB 16.5MiB 0B 9.87GiB 117MiB 98.86 1.09 11 osd.2 6 hdd 0.00980 1.00000 10.0GiB 9.88GiB 15.9MiB 0B 9.87GiB 116MiB 98.86 1.09 4 osd.6 10 hdd 0.00980 1.00000 10.0GiB 9.88GiB 12.2MiB 0B 9.87GiB 114MiB 98.88 1.09 0 osd.10 14 hdd 0.00980 0 0B 0B 0B 0B 0B 0B 0 0 0 osd.14 1 0 0 0B 0B 0B 0B 0B 0B 0 0 0 osd.1 9 0 0 0B 0B 0B 0B 0B 0B 0 0 0 osd.9
-
OSD fail to start and crashing with below calltrace
ceph version 12.2.12-79.el7cp (baaa55a74a953625c1acd175dab49779fabb84a2) luminous (stable) 1: (()+0x4029b1) [0x55ce83fda9b1] 2: (()+0xf630) [0x7f62c9878630] 3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x452) [0x55ce83dada02] 4: (BlueFS::_replay(bool)+0x48d) [0x55ce83dc14bd] 5: (BlueFS::mount()+0x1d4) [0x55ce83dc50a4] 6: (BlueStore::_open_db(bool)+0x1857) [0x55ce83e1ca87] 7: (BlueStore::_fsck(bool, bool)+0x3c7) [0x55ce83e52657] 8: (main()+0xf04) [0x55ce83d04cc4] 9: (__libc_start_main()+0xf5) [0x7f62c824a545] 10: (()+0x1c44bf) [0x55ce83d9c4bf] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments