"Ceph Mon store is getting too big" - 1 OSD Down - #nautilus 14.2.4
cluster:
id: e390486f-603c-4ca4-9c8a-4c14cc8d7985
health: HEALTH_OK
services:
mon: 3 daemons, quorum host1,host2,host3 (age 6d)
mgr: host1(active, since 3w)
mds: cephfs:2 {0=host1=up:active,1=host2=up:active} 1 up:standby
osd: 58 osds: 58 up (since 5d), 58 in (since 3w)
rgw: 2 daemons active (host2.rgw0, host3.rgw0)
data:
pools: 7 pools, 3104 pgs
objects: 30.46M objects, 33 TiB
usage: 99 TiB used, 307 TiB / 405 TiB avail
pgs: 3104 active+clean
io:
client: 148 MiB/s rd, 2.6 MiB/s wr, 163 op/s rd, 71 op/s wr
That is our cluster. Out of the 58 osds that we have if any one goes down, the store.db under /var/lib/ceph/mon/ceph- fills up very very quickly.
With *.sst files keep on pilling inside store.db.
It fills up 100G in 30 mins making the cluster slow if not breaking it. What is the root cause behind this? Is there a way to avoid the pileup of these *.sst files when an OSD goes down?
I know we are running an older version of ceph. #nautilus 14.2.4. We are planning to migrate soon but until then we need to put a band aid on this.
Attached are 2 files:
sstFilesPileup.txt - contains the long list of *.sst files piled up in /var/lib/ceph/mon/ceph-/store.db
cephMonLogs.txt - contains the logs from ceph mon which is full of "stuck undersized" messages
Kindly advise.
Thanks
Kireet