Ceph - OSD reboots every few minutes with FAILED assert(clone_size.count(clone))

Solution Verified - Updated -

Issue

  • A single OSD reboots every few minutes. When this OSD is marked as "OUT" and another OSD backfills/takes its place the new OSD also begins to reboot continuously.
  • In the OSD log (/var/log/ceph/ceph-osd.*.log) prior to the OSD logging the assert, we can see that Ceph is scrubbing a PG:
Oct  8 11:00:55 str-yyz-02-01 ceph-osd:     -4> 2015-10-08 11:00:55.862853 7f33c3e26700  2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub   osd.91 has 24 items
Oct  8 11:00:55 str-yyz-02-01 ceph-osd:     -3> 2015-10-08 11:00:55.862877 7f33c3e26700  2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub replica 32 has 24 items
Oct  8 11:00:55 str-yyz-02-01 ceph-osd:     -2> 2015-10-08 11:00:55.862885 7f33c3e26700  2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub replica 69 has 24 items
Oct  8 11:00:55 str-yyz-02-01 ceph-osd:     -1> 2015-10-08 11:00:55.863074 7f33c3e26700  2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] 
Oct  8 11:02:31 str-yyz-02-01 ceph-osd:      0> 2015-10-08 11:02:31.048794 7f5782775700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f5782775700 time 2015-10-08 11:02:31.047352#012osd/osd_types.cc: 3543: FAILED assert(clone_size.count(clone))#012#012 

ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
1: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x707b46]#012 
2: (ReplicatedPG::_scrub(ScrubMap&)+0x9e8) [0x7c0198]#012 
3: (PG::scrub_compare_maps()+0x5b6) [0x755306]#012 
4: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1d9) [0x758999]#012 
5: (PG::scrub(ThreadPool::TPHandle&)+0x19a) [0x75b96a]#012 
6: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x657309]#012 
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa56dd1]#012 
8: (ThreadPool::WorkThread::entry()+0x10) [0xa57cc0]#012 
9: (()+0x8182) [0x7f579ce6e182]#012 
10: (clone()+0x6d) [0x7f579b5e0fbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  • Also prior to the assert, the following error which is identifying the actual object and snapset with the issue may be observed in the OSD log :
Oct  8 06:46:59 str-yyz-02-01 ceph-osd: 2015-10-08 06:46:59.844990 7fa3158a3700  0 log [ERR] : 5.2d6 shard 91: soid **4993e2d6/rbd_data.ed0c0f561c681.0000000000000a53/966f//5** size 0 != known size 4194304

Environment

  • Red Hat Ceph Storage 1.2
  • Red Hat Ceph Storage 1.2.3
  • Red Hat Ceph Storage 1.3
  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 7
  • Ubuntu Precise 12.04
  • Ubuntu Trusty 14.04

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content