Ceph - OSD reboots every few minutes with FAILED assert(clone_size.count(clone))
Issue
- A single OSD reboots every few minutes. When this OSD is marked as "OUT" and another OSD backfills/takes its place the new OSD also begins to reboot continuously.
- In the OSD log (
/var/log/ceph/ceph-osd.*.log
) prior to the OSD logging the assert, we can see that Ceph is scrubbing a PG:
Oct 8 11:00:55 str-yyz-02-01 ceph-osd: -4> 2015-10-08 11:00:55.862853 7f33c3e26700 2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub osd.91 has 24 items
Oct 8 11:00:55 str-yyz-02-01 ceph-osd: -3> 2015-10-08 11:00:55.862877 7f33c3e26700 2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub replica 32 has 24 items
Oct 8 11:00:55 str-yyz-02-01 ceph-osd: -2> 2015-10-08 11:00:55.862885 7f33c3e26700 2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub replica 69 has 24 items
Oct 8 11:00:55 str-yyz-02-01 ceph-osd: -1> 2015-10-08 11:00:55.863074 7f33c3e26700 2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing]
Oct 8 11:02:31 str-yyz-02-01 ceph-osd: 0> 2015-10-08 11:02:31.048794 7f5782775700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f5782775700 time 2015-10-08 11:02:31.047352#012osd/osd_types.cc: 3543: FAILED assert(clone_size.count(clone))#012#012
ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
1: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x707b46]#012
2: (ReplicatedPG::_scrub(ScrubMap&)+0x9e8) [0x7c0198]#012
3: (PG::scrub_compare_maps()+0x5b6) [0x755306]#012
4: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1d9) [0x758999]#012
5: (PG::scrub(ThreadPool::TPHandle&)+0x19a) [0x75b96a]#012
6: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x657309]#012
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa56dd1]#012
8: (ThreadPool::WorkThread::entry()+0x10) [0xa57cc0]#012
9: (()+0x8182) [0x7f579ce6e182]#012
10: (clone()+0x6d) [0x7f579b5e0fbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
- Also prior to the assert, the following error which is identifying the actual object and snapset with the issue may be observed in the OSD log :
Oct 8 06:46:59 str-yyz-02-01 ceph-osd: 2015-10-08 06:46:59.844990 7fa3158a3700 0 log [ERR] : 5.2d6 shard 91: soid **4993e2d6/rbd_data.ed0c0f561c681.0000000000000a53/966f//5** size 0 != known size 4194304
Environment
- Red Hat Ceph Storage 1.2
- Red Hat Ceph Storage 1.2.3
- Red Hat Ceph Storage 1.3
- Red Hat Enterprise Linux 6
- Red Hat Enterprise Linux 7
- Ubuntu Precise 12.04
- Ubuntu Trusty 14.04
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.