Ceph - OSD taking suicide timeout during deep scrubbing

Solution Verified - Updated -

Environment

  • Red Hat Ceph Storage 1.3.1
  • Red Hat Enterprise Linux 7.x
  • RadosGW

Issue

  • Ceph Cluster is running RadosGW with very large bucket indexes with a large amount of Objects.
  • During deep scrubbing on a PG, the OSD takes a suicide timeout due to waiting on the deep_scrub thread for over 60 seconds. We can see that prior to the assert a deep_scrub would kick off for a placement group.
2015-11-18 18:21:04.385672 7fc1c21c4700 10 osd.933 pg_epoch: 821902 pg[5.143e( v 821849'118406 (811813'115381,821849'118406] local-les=821901 n=14374 ec=22869 les/c 821901/821902 821900/821900/817050) [1351,470,933] r=2 lpr=821900 pi=664979-821899/79 luod=0'0 crt=821849'118406 lcod 0'0 active] be_deep_scrub cc40143e/.dir.default.120331675.42828/head//5 seed 4294967295
  • The OSD will assert a little over 60 seconds later with the following failure:
2015-11-18 18:22:05.040871 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had timed out after 60
2015-11-18 18:22:05.040875 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had suicide timed out after 60
2015-11-18 18:22:05.042788 7fc1cc1d8700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fc1cc1d8700 time 2015-11-18 18:22:05.040885 common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xb1a76f]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2e1) [0xa55e71]
 3: (ceph::HeartbeatMap::is_healthy()+0xb7) [0xa56647]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x7a7) [0x671587]
 5: (OSD::heartbeat_dispatch(Message*)+0x48b) [0x67262b]
 6: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0xbdcdd7]
 7: (DispatchQueue::entry()+0x44a) [0xbd9f7a]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0xaf9e7d]
 9: (()+0x7e9a) [0x7fc1e1830e9a]
 10: (clone()+0x6d) [0x7fc1e05244bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  • The above assert is from upstream Hammer 94.5 but this issue would be present in Red Hat Ceph Storage 1.3.1 as well.

Resolution

  • The 'osd_scrub_thread_suicide_timeout' config option can be changed from the default value of 60 to the recommended value of 300.
  • Upstream Ceph Tracker and downstream Red Hat Bugzilla will increase the default timeout value to 300 seconds.

Root Cause

  • osd_scrub_thread_suicide_timeout was introduced in Hammer release 94.3 with a default value of 60 seconds. When a scrubbing thread is unresponsive for 60 seconds the OSD will assert with suicide timeout.
  • When scanning the data for objects of extremely large sized buckets for a deep scrub the thread does not respond. This leads to the 60 second timer being reached for the osd_scrub_thread_suicide_timeout.
  • In previous versions of Ceph the 'suicide timeout' was not a configurable option. The option ensures that poorly behaving OSDs self-terminate instead of running in degraded states and slowing traffic. This Bugzilla was open for the issue. This feature was introduced to prevent a dying OSD disk from impacting a cluster.
  • Red Hat released Errata which included this option.

Diagnostic Steps

  • During deep scrubbing on a PG, the OSD takes a suicide timeout due to waiting on the deep_scrub thread for over 60 seconds. Prior to the assert a deep_scrub would kick off for a placement group.
2015-11-18 18:21:04.385672 7fc1c21c4700 10 osd.933 pg_epoch: 821902 pg[5.143e( v 821849'118406 (811813'115381,821849'118406] local-les=821901 n=14374 ec=22869 les/c 821901/821902 821900/821900/817050) [1351,470,933] r=2 lpr=821900 pi=664979-821899/79 luod=0'0 crt=821849'118406 lcod 0'0 active] be_deep_scrub cc40143e/.dir.default.120331675.42828/head//5 seed 4294967295
  • The OSD will assert a little over 60 seconds later with the following failure:
2015-11-18 18:22:05.040871 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had timed out after 60
2015-11-18 18:22:05.040875 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had suicide timed out after 60
2015-11-18 18:22:05.042788 7fc1cc1d8700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fc1cc1d8700 time 2015-11-18 18:22:05.040885 common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xb1a76f]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2e1) [0xa55e71]
 3: (ceph::HeartbeatMap::is_healthy()+0xb7) [0xa56647]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x7a7) [0x671587]
 5: (OSD::heartbeat_dispatch(Message*)+0x48b) [0x67262b]
 6: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0xbdcdd7]
 7: (DispatchQueue::entry()+0x44a) [0xbd9f7a]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0xaf9e7d]
 9: (()+0x7e9a) [0x7fc1e1830e9a]
 10: (clone()+0x6d) [0x7fc1e05244bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  • Validate that the thread for the deep_scrub matches the thread the disk_tp is waiting on that caused the suicide time out.
  • Example of mapping the deep scrub thread to the timeout:
2015-11-18 18:21:04.385672 7fc1c21c4700 10 osd.933 pg_epoch: 821902 pg[5.143e( v 821849'118406 (811813'115381,821849'118406] local-les=821901 n=14374 ec=22869 les/c 821901/821902 821900/821900/817050) [1351,470,933] r=2 lpr=821900 pi=664979-821899/79 luod=0'0 crt=821849'118406 lcod 0'0 active] be_deep_scrub cc40143e/.dir.default.120331675.42828/head//5 seed 4294967295
  • Deep scrub thread is 7fc1c21c4700
2015-11-18 18:22:05.040875 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had suicide timed out after 60
  • Thread taking timeout is 'OSD::disk_tp thread 0x7fc1c21c4700'

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.