Red Hat Training

A Red Hat training course is available for Red Hat Ceph Storage

Chapter 6. Create a Cluster

If at any point you run into trouble and you want to start over, execute the following to purge the configuration:

ceph-deploy purgedata <ceph-node> [<ceph-node>]
ceph-deploy forgetkeys

To purge the Ceph packages too, you may also execute:

ceph-deploy purge <ceph-node> [<ceph-node>]

If you execute purge, you must re-install Ceph.

On your Calamari admin node from the directory you created for holding your configuration details, perform the following steps using ceph-deploy.

  1. Create the cluster. :

    ceph-deploy new <initial-monitor-node(s)>

    For example:

    ceph-deploy new node1

    Check the output of ceph-deploy with ls and cat in the current directory. You should see a Ceph configuration file, a monitor secret keyring, and a log file of the ceph-deploy procedures.

    At this stage, you may begin editing your Ceph configuration file.

    Note

    If you choose not to use ceph-deploy you will have to deploy Ceph manually or refer to Ceph manual deployment documentation and configure a deployment tool (e.g., Chef, Juju, Puppet, etc.) to perform each operation ceph-deploy performs for you.

  2. Add the public_network and cluster_network settings under the [global] section of your Ceph configuration file.

    public_network = <ip-address>/<netmask>
    cluster_network = <ip-address>/<netmask>

    These settings distinguish which network is public (front-side) and which network is for the cluster (back-side). Ensure that your nodes have interfaces configured for these networks. We do not recommend using the same NIC for the public and cluster networks.

  3. Turn on IPv6 if you intend to use it.

    ms_bind_ipv6 = true
  4. Add or adjust the osd journal size setting under the [global] section of your Ceph configuration file.

    osd_journal_size = 10000

    We recommend a general setting of 10GB. Ceph’s default osd_journal_size is 0, so you will need to set this in your ceph.conf file. A journal size should find the product of the filestore_max_sync_interval and the expected throughput, and multiply the product by two (2). The expected throughput number should include the expected disk throughput (i.e., sustained data transfer rate), and network throughput. For example, a 7200 RPM disk will likely have approximately 100 MB/s. Taking the min() of the disk and network throughput should provide a reasonable expected throughput.

  5. Set the number of copies to store (default is 3) and the default minimum required write data when in a degraded state (default is 2) under the [global] section of your Ceph configuration file. We recommend the default values for production clusters.

    osd_pool_default_size = 3
    osd_pool_default_min_size = 2

    For a quick start, you may wish to set osd_pool_default_size to 2, and the osd_pool_default_min_size to 1 so that you can achieve and active+clean state with only two OSDs.

    These settings establish the networking bandwidth requirements for the cluster network, and the ability to write data with eventual consistency (i.e., you can write data to a cluster in a degraded state if it has min_size copies of the data already).

  6. Set the maximum number of placement groups per OSD. The Ceph Storage Cluster has a default maximum value of 300 placement groups per OSD. You can set a different maximum value in your Ceph configuration file (i.e., where n is the maximum number of PGs per OSD).

    mon_pg_warn_max_per_osd = n

    Multiple pools can use the same CRUSH ruleset. When an OSD has too many placement groups associated to it, Ceph performance may degrade due to resource use and load. This setting warns you, but you may adjust it to your needs and the capabilities of your hardware.

  7. Set a CRUSH leaf type to the largest serviceable failure domain for your replicas under the [global] section of your Ceph configuration file. The default value is 1, or host, which means that CRUSH will map replicas to OSDs on separate separate hosts. For example, if you want to make three object replicas, and you have three racks of chassis/hosts, you can set osd_crush_chooseleaf_type to 3, and CRUSH will place each copy of an object on OSDs in different racks. For example:

    osd_crush_chooseleaf_type = 3

    The default CRUSH hierarchy types are:

    • type 0 osd
    • type 1 host
    • type 2 chassis
    • type 3 rack
    • type 4 row
    • type 5 pdu
    • type 6 pod
    • type 7 room
    • type 8 datacenter
    • type 9 region
    • type 10 root
  8. Set max_open_files so that Ceph will set the maximum open file descriptors at the OS level to help prevent Ceph OSD Daemons from running out of file descriptors.

    max_open_files = 131072

In summary, your initial Ceph configuration file should have at least the following settings with appropriate values assigned after the = sign:

[global]
fsid = <cluster-id>
mon_initial_members = <hostname>[, <hostname>]
mon_host = <ip-address>[, <ip-address>]
public_network = <network>[, <network>]
cluster_network = <network>[, <network>]
ms_bind_ipv6 = [true | false]
max_open_files = 131072
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_journal_size = <n>
filestore_xattr_use_omap = true
osd_pool_default_size = <n>  # Write an object n times.
osd_pool_default_min_size = <n> # Allow writing n copy in a degraded state.
osd_crush_chooseleaf_type = <n>