OSD continues reporting heartbeat-timeouts due to mark_down vs accept race condition

Solution In Progress - Updated -

Issue

  • OSDs report heartbeat_check timeout messages repeatedly at irregular time interval
  • Once the messages start only a restart of the relevant OSD stops them being logged
  • OSD logs show that osd.x received no reply from osd.y on either front or back, first ping sent :
7f26973c1700 -1 osd.x 409046 heartbeat_check: no reply from <IP_ADDR:PORT> osd.y ever on either front or back, first ping sent <TIMESTAMP> (cutoff <TIMESTAMP>)
7f26973c1700 -1 osd.x 409046 heartbeat_check: no reply from <IP_ADDR:PORT> osd.y ever on either front or back, first ping sent <TIMESTAMP> (cutoff <TIMESTAMP>)
  • OSD logs with debug setting debug_osd=30 shows the following messages :
7f55b5535700 25 osd.x 409978 handle_osd_ping got reply from osd.y first_tx <TIMESTAMP> last_tx <TIMESTAMP> last_rx_back <TIMESTAMP> -> <TIMESTAMP> last_rx_front 0.000000
7f55b1d66700 10 osd.x 409978 tick_without_osd_lock

Environment

  • Red Hat Ceph Storage 3.2z2

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content