Red Hat Training

A Red Hat training course is available for Red Hat Ceph Storage

Chapter 4. Create a Cluster

If at any point you run into trouble and you want to start over, execute the following to purge the configuration:

ceph-deploy purgedata <ceph-node> [<ceph-node>]
ceph-deploy forgetkeys

To purge the Ceph packages too, you may also execute:

ceph-deploy purge <ceph-node> [<ceph-node>]

If you execute purge, you must re-install Ceph.

On your Calamari admin node from the directory you created for holding your configuration details, perform the following steps using ceph-deploy.

  1. Create the cluster. :

    ceph-deploy new <initial-monitor-node(s)>

    For example:

    ceph-deploy new node1

    Check the output of ceph-deploy with ls and cat in the current directory. You should see a Ceph configuration file, a monitor secret keyring, and a log file of the ceph-deploy procedures.

    At this stage, you may begin editing your Ceph configuration file.

    Note

    If you choose not to use ceph-deploy you will have to deploy Ceph manually or refer to Ceph manual deployment documentation and configure a deployment tool (e.g., Chef, Juju, Puppet, etc.) to perform each operation ceph-deploy performs for you.

  2. Add the public_network and cluster_network settings under the [global] section of your Ceph configuration file.

    public_network = <ip-address>/<netmask>
    cluster_network = <ip-address>/<netmask>

    These settings distinguish which network is public (front-side) and which network is for the cluster (back-side). Ensure that your nodes have interfaces configured for these networks. We do not recommend using the same NIC for the public and cluster networks.

  3. Turn on IPv6 if you intend to use it.

    ms_bind_ipv6 = true
  4. Add or adjust the osd journal size setting under the [global] section of your Ceph configuration file.

    osd_journal_size = 10000

    We recommend a general setting of 10GB. Ceph’s default osd_journal_size is 0, so you will need to set this in your ceph.conf file. A journal size should find the product of the filestore_max_sync_interval and the expected throughput, and multiply the product by two (2). The expected throughput number should include the expected disk throughput (i.e., sustained data transfer rate), and network throughput. For example, a 7200 RPM disk will likely have approximately 100 MB/s. Taking the min() of the disk and network throughput should provide a reasonable expected throughput.

  5. Set the number of copies to store (default is 3) and the default minimum required write data when in a degraded state (default is 2) under the [global] section of your Ceph configuration file. We recommend the default values for production clusters.

    osd_pool_default_size = 3
    osd_pool_default_min_size = 2

    For a quick start, you may wish to set osd_pool_default_size to 2, and the osd_pool_default_min_size to 1 so that you can achieve and active+clean state with only two OSDs.

    These settings establish the networking bandwidth requirements for the cluster network, and the ability to write data with eventual consistency (i.e., you can write data to a cluster in a degraded state if it has min_size copies of the data already).

  6. Set the default number of placement groups (osd_pool_default_pg_num) and placement groups for placement (osd_pool_default_pgp_num) for a pool under the [global] section of your Ceph configuration file. The number you specify depends upon the number of OSDs in your cluster. For small clusters (< 5 OSDs) we recommend 128 placement groups per pool. The osd_pool_default_pg_num and osd_pool_default_pgp_num value should be equal.

    osd_pool_default_pg_num = <n>
    osd_pool_default_pgp_num = <n>
    • Less than 5 OSDs set pg_num and pgp_num to 128
    • Between 5 and 10 OSDs set pg_num and pgp_num to 512
    • Between 10 and 50 OSDs set pg_num and pgp_num to 4096
    • If you have more than 50 OSDs, you need to understand the tradeoffs and how to calculate the pg_num and pgp_num values. Generally, you may use the formula:

                   (OSDs * 100)
      Total PGs =  ------------
                    pool size

      Where the pool size in the formula above is the osd_pool_default_size value you set in the preceding step. For best results, round the result of this formula up to the nearest power of two. It is an optional step, but it will help CRUSH balance objects evenly across placement groups.

  7. Set the maximum number of placement groups per OSD. The Ceph Storage Cluster has a default maximum value of 300 placement groups per OSD. You can set a different maximum value in your Ceph configuration file.

    mon_pg_warn_max_per_osd

    Multiple pools can use the same CRUSH ruleset. When an OSD has too many placement groups associated to it, Ceph performance may degrade due to resource use and load. This setting warns you, but you may adjust it to your needs and the capabilities of your hardware.

  8. Set a CRUSH leaf type to the largest serviceable failure domain for your replicas under the [global] section of your Ceph configuration file. The default value is 1, or host, which means that CRUSH will map replicas to OSDs on separate separate hosts. For example, if you want to make three object replicas, and you have three racks of chassis/hosts, you can set osd_crush_chooseleaf_type to 3, and CRUSH will place each copy of an object on OSDs in different racks. For example:

    osd_crush_chooseleaf_type = 3

    The default CRUSH hierarchy types are:

    • type 0 osd
    • type 1 host
    • type 2 chassis
    • type 3 rack
    • type 4 row
    • type 5 pdu
    • type 6 pod
    • type 7 room
    • type 8 datacenter
    • type 9 region
    • type 10 root
  9. Set max_open_files so that Ceph will set the maximum open file descriptors at the OS level to help prevent Ceph OSD Daemons from running out of file descriptors.

    max_open_files = 131072
  10. We recommend having settings for clock drift in your Ceph configuration in addition to setting up NTP on your monitor nodes, because clock drift is a common reason monitors fail to achieve a consensus on the state of the cluster. We recommend having the report time out and down out interval in the Ceph configuration file so you have a reference point for how long an OSD can be down before the cluster starts re-balancing.

    mon_clock_drift_allowed = .15
    mon_clock_drift_warn_backoff = 30
    mon_osd_down_out_interval = 300
    mon_osd_report_timeout = 300
  11. Set the full_ratio and near_full_ratio to acceptable values. They default to full at 95% and near full at 85% by default. You may also set backfill_full_ratio so that OSDs don’t accept backfill requests when they are already near capacity.

    mon_osd_full_ratio = .75
    mon_osd_nearfull_ratio = .65
    osd_backfill_full_ratio = .65

    Consider the amount of storage capacity that would be unavailable during the failure of a large-grained failure domain such as a rack (e.g., the failure of a power distribution unit or a rack switch). You should consider the cost/benefit tradeoff of having that amount of extra capacity available for the failure of a large-grained failure domain if you have stringent high availability requirements. As a best practice, as you get close to reaching the full ratio, you should start receiving "near full" warnings so that you have ample time to provision additional hardware for your cluster. "Near full" warnings may be annoying, but they are not as annoying as an interruption of service.

    Important

    When your cluster reaches its full ratio, Ceph prevents clients from accessing the cluster to ensure data durability. This results in a service interruption, so you should carefully consider the implications of capacity planning and the implications of reaching full capacity—​especially in view of failure.

In summary, your initial Ceph configuration file should have at least the following settings with appropriate values assigned after the = sign:

[global]
fsid = <cluster-id>
mon_initial_members = <hostname>[, <hostname>]
mon_host = <ip-address>[, <ip-address>]
public_network = <network>[, <network>]
cluster_network = <network>[, <network>]
ms_bind_ipv6 = [true | false]
max_open_files = 131072
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_journal_size = <n>
filestore_xattr_use_omap = true
osd_pool_default_size = <n>  # Write an object n times.
osd_pool_default_min_size = <n> # Allow writing n copy in a degraded state.
osd_pool_default_pg_num = <n>
osd_pool_default_pgp_num = <n>
osd_crush_chooseleaf_type = <n>
mon_osd_full_ratio = <n>
mon_osd_nearfull_ratio = <n>
osd_backfill_full_ratio = <n>
mon_clock_drift_allowed = .15
mon_clock_drift_warn_backoff = 30
mon_osd_down_out_interval = 300
mon_osd_report_timeout = 300