OSD continues reporting heartbeat-timeouts due to mark_down vs accept race condition
Issue
- OSDs report heartbeat_check timeout messages repeatedly at irregular time interval
- Once the messages start only a restart of the relevant OSD stops them being logged
- OSD logs show that
osd.x
received no reply fromosd.y
oneither front or back, first ping sent
:
7f26973c1700 -1 osd.x 409046 heartbeat_check: no reply from <IP_ADDR:PORT> osd.y ever on either front or back, first ping sent <TIMESTAMP> (cutoff <TIMESTAMP>)
7f26973c1700 -1 osd.x 409046 heartbeat_check: no reply from <IP_ADDR:PORT> osd.y ever on either front or back, first ping sent <TIMESTAMP> (cutoff <TIMESTAMP>)
- OSD logs with debug setting
debug_osd=30
shows the following messages :
7f55b5535700 25 osd.x 409978 handle_osd_ping got reply from osd.y first_tx <TIMESTAMP> last_tx <TIMESTAMP> last_rx_back <TIMESTAMP> -> <TIMESTAMP> last_rx_front 0.000000
7f55b1d66700 10 osd.x 409978 tick_without_osd_lock
Environment
- Red Hat Ceph Storage 3.2z2
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.