Red Hat Training

A Red Hat training course is available for Red Hat Ceph Storage

Chapter 2. Storage Cluster Quick Start

This Quick Start sets up a Red Hat Ceph Storage cluster using ceph-deploy on your Calamari admin node. Create a small Ceph cluster so you can explore Ceph functionality. As a first exercise, create a Ceph Storage Cluster with one Ceph Monitor and some Ceph OSD Daemons, each on separate nodes. Once the cluster reaches an active + clean state, you can use the cluster.

diag 0d7740404b62e85311f1350a7c5e0eba

2.1. Executing ceph-deploy

When executing ceph-deploy to install the Red Hat Ceph Storage, ceph-deploy retrieves Ceph packages from the /opt/calamari/ directory on the Calamari administration host. To do so, ceph-deploy needs to read the .cephdeploy.conf file created by the ice_setup utility. Therefore, ensure to execute ceph-deploy in the local working directory created in the Create a Working Directory section, for example ~/ceph-config/:

cd ~/ceph-config
Important

Execute ceph-deploy commands as a regular user not as root or by using sudo. The Create a Ceph Deploy User and Enable Password-less SSH steps enable ceph-deploy to execute as root without sudo and without connecting to Ceph nodes as the root user. You might still need to execute ceph CLI commands as root or by using sudo.

2.2. Create a Cluster

If at any point you run into trouble and you want to start over, execute the following to purge the configuration:

ceph-deploy purge <ceph-node> [<ceph-node>]
ceph-deploy purgedata <ceph-node> [<ceph-node>]
ceph-deploy forgetkeys

If you execute the foregoing procedure, you must re-install Ceph.

On your Calamari admin node from the directory you created for holding your configuration details, perform the following steps using ceph-deploy.

  1. Create the cluster:

    ceph-deploy new <initial-monitor-node(s)>

    For example:

    ceph-deploy new node1

    Check the output of ceph-deploy with ls and cat in the current directory. You should see a Ceph configuration file, a monitor secret keyring, and a log file of the ceph-deploy procedures.

2.3. Modify the Ceph Configuration File

At this stage, you may begin editing your Ceph configuration file (ceph.conf).

Note

If you choose not to use ceph-deploy you will have to deploy Ceph manually or configure a deployment tool (e.g., Chef, Juju, Puppet, etc.) to perform each operation that ceph-deploy performs for you. To deploy Ceph manually, please see our Knowledgebase article.

  1. Add the public_network and cluster_network settings under the [global] section of your Ceph configuration file.

    public_network = <ip-address>/<netmask>
    cluster_network = <ip-address>/<netmask>

    These settings distinguish which network is public (front-side) and which network is for the cluster (back-side). Ensure that your nodes have interfaces configured for these networks. We do not recommend using the same NIC for the public and cluster networks. Please see the Network Configuration Settings for details on the public and cluster networks.

  2. Turn on IPv6 if you intend to use it.

    ms_bind_ipv6 = true

    Please see Bind for more details.

  3. Add or adjust the osd journal size setting under the [global] section of your Ceph configuration file.

    osd_journal_size = 10000

    We recommend a general setting of 10GB. Ceph’s default osd_journal_size is 0, so you will need to set this in your ceph.conf file. A journal size should be the product of the filestore_max_sync_interval option and the expected throughput, and then multiply the resulting product by two. The expected throughput number should include the expected disk throughput (i.e., sustained data transfer rate), and network throughput. For example, a 7200 RPM disk will likely have approximately 100 MB/s. Taking the min() of the disk and network throughput should provide a reasonable expected throughput. Please see Journal Settings for more details.

  4. Set the number of copies to store (default is 3) and the default minimum required to write data when in a degraded state (default is 2) under the [global] section of your Ceph configuration file. We recommend the default values for production clusters.

    osd_pool_default_size = 3
    osd_pool_default_min_size = 2

    For a quick start, you may wish to set osd_pool_default_size to 2, and the osd_pool_default_min_size to 1 so that you can achieve and active+clean state with only two OSDs.

    These settings establish the networking bandwidth requirements for the cluster network, and the ability to write data with eventual consistency (i.e., you can write data to a cluster in a degraded state if it has min_size copies of the data already). Please see Settings for more details.

  5. Set a CRUSH leaf type to the largest serviceable failure domain for your replicas under the [global] section of your Ceph configuration file. The default value is 1, or host, which means that CRUSH will map replicas to OSDs on separate separate hosts. For example, if you want to make three object replicas, and you have three racks of chassis/hosts, you can set osd_crush_chooseleaf_type to 3, and CRUSH will place each copy of an object on OSDs in different racks.

    osd_crush_chooseleaf_type = 3

    The default CRUSH hierarchy types are:

    • type 0 osd
    • type 1 host
    • type 2 chassis
    • type 3 rack
    • type 4 row
    • type 5 pdu
    • type 6 pod
    • type 7 room
    • type 8 datacenter
    • type 9 region
    • type 10 root

      Please see Settings for more details.

  6. Set max_open_files so that Ceph will set the maximum open file descriptors at the OS level to help prevent Ceph OSD Daemons from running out of file descriptors.

    max_open_files = 131072

    Please see the General Configuration Reference for more details.

In summary, your initial Ceph configuration file should have at least the following settings with appropriate values assigned after the = sign:

[global]
fsid = <cluster-id>
mon_initial_members = <hostname>[, <hostname>]
mon_host = <ip-address>[, <ip-address>]
public_network = <network>[, <network>]
cluster_network = <network>[, <network>]
ms_bind_ipv6 = [true | false]
max_open_files = 131072
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_journal_size = <n>
filestore_xattr_use_omap = true
osd_pool_default_size = <n>  # Write an object n times.
osd_pool_default_min_size = <n> # Allow writing n copy in a degraded state.
osd_crush_chooseleaf_type = <n>

2.4. Install Ceph with the ISO

To install Ceph from a local repository, use the --repo argument first to ensure that ceph-deploy is pointing to the .cephdeploy.conf file generated by ice_setup (e.g., in the exemplary ~/ceph-config directory, the /root directory, or ~). Otherwise, you may not receive packages from the local repository. Specify --release=<daemon-name> to specify the daemon package you wish to install. Then, install the packages. Ideally, you should run ceph-deploy from the directory where you keep your configuration (e.g., the exemplary ~/ceph-config) so that you can maintain a {cluster-name}.log file with all the commands you have executed with ceph-deploy.

ceph-deploy install --repo --release=[ceph-mon|ceph-osd] <ceph-node> [<ceph-node> ...]
ceph-deploy install --<daemon> <ceph-node> [<ceph-node> ...]

For example:

ceph-deploy install --repo --release=ceph-mon monitor1 monitor2 monitor3
ceph-deploy install --mon monitor1 monitor2 monitor3
ceph-deploy install --repo --release=ceph-osd srv1 srv2 srv3
ceph-deploy install --osd srv1 srv2 srv3

The ceph-deploy utility will install the appropriate Ceph daemon on each node.

Note

If you use ceph-deploy purge, you must re-execute this step to re-install Ceph.

2.5. Install Ceph by Using CDN

When installing Ceph on remote nodes from the CDN (not ISO), you must specify which Ceph daemon you wish to install on the node by passing one of --mon or --osd to ceph-deploy.

ceph-deploy install [--mon|--osd] <ceph-node> [<ceph-node> ...]

For example:

ceph-deploy install --mon monitor1 monitor2 monitor3
ceph-deploy install --osd srv1 srv2 srv3
Note

If you use ceph-deploy purge, you must re-execute this step to re-install Ceph.

2.6. Install ceph-selinux

With Red Hat Ceph Storage 1.3.2 or later, a new ceph-selinux package can be installed on Ceph nodes. This package provides SELinux support for Ceph and SELinux therefore no longer needs to be in permissive or disabled mode.

Once installed, ceph-selinux adds the SELinux policy for Ceph and also relabels files on the cluster accordingly. Ceph processes are labeled with the ceph_exec_t SELinux context.

To install ceph-selinux, use the following command:

ceph-deploy pkg --install ceph-selinux <nodes>

For example:

ceph-deploy pkg --install ceph-selinux node1 node2 node3
Note

All Ceph daemons will be down for the time the ceph-selinux package is being installed. Therefore, your cluster will not be able to serve any data at this point. This operation is necessary in order to update the metadata of the files located on the underlying file system and to make Ceph daemons run with the correct context. This operation may take several minutes depending on the size and speed of the underlying storage.

If SELinux was in permissive, run the following command as root to set it to enforcing again:

# setenforce 1

To configure SELinux persistently, modify the /etc/selinux/config configuration file.

For more information about SELinux, see the SELinux User’s and Administrator’s Guide for Red Hat Enterprise Linux 7.

2.7. Add Initial Monitors

Add the initial monitor(s) and gather the keys.

ceph-deploy mon create-initial

Once you complete the process, your local directory should have the following keyrings:

  • <cluster-name>.client.admin.keyring
  • <cluster-name>.bootstrap-osd.keyring
  • <cluster-name>.bootstrap-mds.keyring
  • <cluster-name>.bootstrap-rgw.keyring

2.8. Connect Monitor Hosts to Calamari

Once you have added the initial monitor(s), you need to connect the monitor hosts to Calamari. From your admin node, execute:

ceph-deploy calamari connect --master '<FQDN for the Calamari admin node>' <ceph-node>[<ceph-node> ...]

For example, using the exemplary node1 from above, you would execute:

ceph-deploy calamari connect --master '<FQDN for the Calamari admin node>' node1

If you expand your monitor cluster with additional monitors, you will have to connect the hosts that contain them to Calamari, too.

2.9. Make your Calamari Admin Node a Ceph Admin Node

After you create your initial monitors, you can use the Ceph CLI to check on your cluster. However, you have to specify the monitor and admin keyring each time with the path to the directory holding your configuration, but you can simplify your CLI usage by making the admin node a Ceph admin client.

Note

You will also need to install ceph-common on the Calamari node. ceph-deploy install --cli does this.

ceph-deploy install --cli <node-name>
ceph-deploy admin <node-name>

For example:

ceph-deploy install --cli admin-node
ceph-deploy admin admin-node

The ceph-deploy utility will copy the ceph.conf and ceph.client.admin.keyring files to the /etc/ceph directory. When ceph-deploy is talking to the local admin host (admin-node), it must be reachable by its hostname (e.g., hostname -s). If necessary, modify /etc/hosts to add the name of the admin host. If you do not have an /etc/ceph directory, you should install ceph-common.

You may then use the Ceph CLI.

Once you have added your new Ceph monitors, Ceph will begin synchronizing the monitors and form a quorum. You can check the quorum status by executing the following as root:

# ceph quorum_status --format json-pretty
Note

Your cluster will not achieve an active + clean state until you add enough OSDs to facilitate object replicas. This is inclusive of CRUSH failure domains.

2.10. Adjust CRUSH Tunables

Red Hat Ceph Storage CRUSH tunables defaults to bobtail, which refers to an older release of Ceph. This setting guarantees that older Ceph clusters are compatible with older Linux kernels. However, if you run a Ceph cluster on Red Hat Enterprise Linux 7, reset CRUSH tunables to optimal. As root, execute the following:

# ceph osd crush tunables optimal

See the CRUSH Tunables chapter in the Storage Strategies guides for details on the CRUSH tunables.

2.11. Add OSDs

Before creating OSDs, consider the following:

  • We recommend using the XFS file system, which is the default file system.
Warning

Use the default XFS file system options that the ceph-deploy utility uses to format the OSD disks. Deviating from the default values can cause stability problems with the storage cluster.

For example, setting the directory block size higher than the default value of 4096 bytes can cause memory allocation deadlock errors in the file system. For more details, view the Red Hat Knowledgebase article regarding these errors.

  • Red Hat recommends using SSDs for journals. It is common to partition SSDs to serve multiple OSDs. Ensure that the number of SSD partitions does not exceed the SSD’s sequential write limits. Also, ensure that SSD partitions are properly aligned, or their write performance will suffer.
  • Red Hat recommends to delete the partition table of a Ceph OSD drive by using the ceph-deploy disk zap command before executing the ceph-deploy osd prepare command:

    ceph-deploy disk zap <ceph_node>:<disk_device>

    For example:

    ceph-deploy disk zap node2:/dev/sdb

From your administration node, use ceph-deploy osd prepare to prepare the OSDs:

ceph-deploy osd prepare <ceph_node>:<disk_device> [<ceph_node>:<disk_device>]

For example:

ceph-deploy osd prepare node2:/dev/sdb

The prepare command creates two partitions on a disk device; one partition is for OSD data, and the other is for the journal.

Once you prepare OSDs, activate the OSDs:

ceph-deploy osd activate <ceph_node>:<data_partition>

For example:

ceph-deploy osd activate node2:/dev/sdb1
Note

In the ceph-deploy osd activate command, specify a particular disk partition, for example /dev/sdb1.

It is also possible to use a disk device that is wholly formatted without a partition table. In that case, a partition on an additional disk must be used to serve as the journal store:

ceph-deploy osd activate <ceph_node>:<disk_device>:<data_partition>

In the following example, sdd is a spinning hard drive that Ceph uses entirely for OSD data. ssdb1 is a partition of an SSD drive, which Ceph uses to store the journal for the OSD:

ceph-deploy osd activate node{2,3,4}:sdd:ssdb1

To achieve the active + clean state, you must add as many OSDs as the osd pool default size = <n> parameter specifies in the Ceph configuration file.

For information on creating encrypted OSD nodes, see the Encrypted OSDs subsection in the Adding OSDs by Using ceph-deploy section in the Administration Guide for Red Hat Ceph Storage 2.

2.12. Connect OSD Hosts to Calamari

Once you have added the initial OSDs, you need to connect the OSD hosts to Calamari.

ceph-deploy calamari connect --master '<FQDN for the Calamari admin node>' <ceph-node>[<ceph-node> ...]

For example, using the exemplary node2, node3 and node4 from above, you would execute:

ceph-deploy calamari connect --master '<FQDN for the Calamari admin node>' node2 node3 node4

As you expand your cluster with additional OSD hosts, you will have to connect the hosts that contain them to Calamari, too.

2.13. Create a CRUSH Hierarchy

You can run a Ceph cluster with a flat node-level hierarchy (default). This is NOT RECOMMENDED. We recommend adding named buckets of various types to your default CRUSH hierarchy. This will allow you to establish a larger-grained failure domain, usually consisting of racks, rows, rooms and data centers.

ceph osd crush add-bucket <bucket-name> <bucket-type>

For example:

ceph osd crush add-bucket dc1 datacenter
ceph osd crush add-bucket room1 room
ceph osd crush add-bucket row1 row
ceph osd crush add-bucket rack1 rack
ceph osd crush add-bucket rack2 rack
ceph osd crush add-bucket rack3 rack

Then, place the buckets into a hierarchy:

ceph osd crush move dc1 root=default
ceph osd crush move room1 datacenter=dc1
ceph osd crush move row1 room=room1
ceph osd crush move rack1 row=row1
ceph osd crush move node2 rack=rack1

2.14. Add OSD Hosts/Chassis to the CRUSH Hierarchy

Once you have added OSDs and created a CRUSH hierarchy, add the OSD hosts/chassis to the CRUSH hierarchy so that CRUSH can distribute objects across failure domains. For example:

ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=row1 rack=rack1 host=node2
ceph osd crush set osd.1 1.0 root=default datacenter=dc1 room=room1 row=row1 rack=rack2 host=node3
ceph osd crush set osd.2 1.0 root=default datacenter=dc1 room=room1 row=row1 rack=rack3 host=node4

The foregoing example uses three different racks for the exemplary hosts (assuming that is how they are physically configured). Since the exemplary Ceph configuration file specified "rack" as the largest failure domain by setting osd_crush_chooseleaf_type = 3, CRUSH can write each object replica to an OSD residing in a different rack. Assuming osd_pool_default_min_size = 2, this means (assuming sufficient storage capacity) that the Ceph cluster can continue operating if an entire rack were to fail (e.g., failure of a power distribution unit or rack router).

2.15. Check CRUSH Hierarchy

Check your work to ensure that the CRUSH hierarchy is accurate.

ceph osd tree

If you are not satisfied with the results of your CRUSH hierarchy, you may move any component of your hierarchy with the move command.

ceph osd crush move <bucket-to-move> <bucket-type>=<parent-bucket>

If you want to remove a bucket (node) or OSD (leaf) from the CRUSH hierarchy, use the remove command:

ceph osd crush remove <bucket-name>

2.16. Check Cluster Health

To ensure that the OSDs in your cluster are peering properly, execute:

ceph health

You may also check on the health of your cluster using the Calamari dashboard.

2.17. List and Create a Pool

You can manage pools using Calamari, or using the Ceph command line. Verify that you have pools for writing and reading data:

ceph osd lspools

You can bind to any of the pools listed using the admin user and client.admin key. To create a pool, use the following syntax:

ceph osd pool create <pool-name> <pg-num> [<pgp-num>] [replicated] [crush-ruleset-name]

For example:

ceph osd pool create mypool 512 512 replicated replicated_ruleset
Note

To find the rule set names available, execute ceph osd crush rule list. To calculate the pg-num and pgp-num see Ceph Placement Groups (PGs) per Pool Calculator.

2.18. Storing and Retrieving Object Data

To perform storage operations with Ceph Storage Cluster, all Ceph clients regardless of type must:

  1. Connect to the cluster.
  2. Create an I/O contest to a pool.
  3. Set an object name.
  4. Execute a read or write operation for the object.

The Ceph Client retrieves the latest cluster map and the CRUSH algorithm calculates how to map the object to a placement-group, and then calculates how to assign the placement group to a Ceph OSD Daemon dynamically. Client types such as Ceph Block Device and the Ceph Object Gateway perform the last two steps transparently.

To find the object location, all you need is the object name and the pool name. For example:

ceph osd map <poolname> <object-name>
Note

The rados CLI tool in the following example is for Ceph administrators only.

Exercise: Locate an Object

As an exercise, lets create an object. Specify an object name, a path to a test file containing some object data and a pool name using the rados put command on the command line. For example:

echo <Test-data> > testfile.txt
rados put <object-name> <file-path> --pool=<pool-name>
rados put test-object-1 testfile.txt --pool=data

To verify that the Ceph Storage Cluster stored the object, execute the following:

rados -p data ls

Now, identify the object location:

ceph osd map <pool-name> <object-name>
ceph osd map data test-object-1

Ceph should output the object’s location. For example:

osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0]

To remove the test object, simply delete it using the rados rm command. For example:

rados rm test-object-1 --pool=data

As the cluster size changes, the object location may change dynamically. One benefit of Ceph’s dynamic rebalancing is that Ceph relieves you from having to perform the migration manually.