5.6. Increasing the restart delay for large Ceph clusters

During deployment, Ceph services such as OSDs and Monitors, are restarted and the deployment does not continue until the service is running again. Ansible waits 15 seconds (the delay) and checks 5 times for the service to start (the retries). If the service does not restart, the deployment stops so the operator can intervene.

Depending on the size of the Ceph cluster, you may need to increase the retry or delay values. The exact names of these parameters and their defaults are as follows:

 health_mon_check_retries: 5
 health_mon_check_delay: 15
 health_osd_check_retries: 5
 health_osd_check_delay: 15

Procedure

  1. Update the CephAnsibleExtraConfig parameter to change the default delay and retry values:

    parameter_defaults:
      CephAnsibleExtraConfig:
        health_osd_check_delay: 40
        health_osd_check_retries: 30
        health_mon_check_delay: 20
        health_mon_check_retries: 10

    This example makes the cluster check 30 times and wait 40 seconds between each check for the Ceph OSDs, and check 20 times and wait 10 seconds between each check for the Ceph MONs.

  2. To incorporate the changes, pass the updated yaml file with -e using openstack overcloud deploy.