Chapter 11. OSD BlueStore (Technology Preview)

OSD BlueStore is a new back end for the OSD daemons. Compared to the currently used FileStore back end, BlueStore allows for storing objects directly on the Ceph block devices without any file system interface.

Important

BlueStore is provided as a Technology Preview only in Red Hat Ceph Storage 2. Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend to use them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

See the support scope for Red Hat Technology Preview features for more details.

Also, note that it will not be possible to preserve data when updating BlueStore OSD nodes to future versions of Red Hat Ceph Storage, because the on-disk data format is undergoing rapid development. In this release, BlueStore is provided mainly to benchmark BlueStore OSDs and Red Hat does not recommend storing any important data on OSD nodes with the BlueStore back end.

BlueStore is generally available and ready for production with Red Hat Ceph Storage 3.2. In addition, BlueStore is the default back end for any newly installed clusters using the Red Hat Ceph Storage 3.2 and further versions. For details, see the BlueStore chapter in the Red Hat Ceph Storage 3 Administration Guide.

BlueStore stores the OSD metadata in the RocksDB key-value database that contains:

  • object metadata
  • write-ahead log (WAL)
  • Ceph omap data
  • allocator metadata

BlueStore includes the following features and enhancements:

No large double-writes
BlueStore first writes any new data to unallocated space on a block device, and then commits a RocksDB transaction that updates the object metadata to reference the new region of the disk. Only when the write operation is below a configurable size threshold, it falls back to a write-ahead journaling scheme, similar to what is used now.
Multi-device support

BlueStore can use multiple block devices for storing different data, for example: Hard Disk Drive (HDD) for the data Solid-state Drive (SSD) for metadata Non-volatile Memory (NVM) or Non-volatile random-access memory (NVRAM) or persistent memory for the RocksDB write-ahead log (WAL).

Note

The ceph-disk utility does not yet provision multiple devices. To use multiple devices, OSDs must be set up manually.

Efficient block device usage
Because BlueStore does not use any file system, it minimizes the need to clear the storage device cache.
Flexible allocator
The block allocation policy is pluggable, allowing BlueStore to implement different policies for different types of storage devices. There is a different behavior for hard disks and SSDs.

Adding a new Ceph OSD node with the BlueStore back end

To install a new Ceph OSD node with the BlueStore back end by using the Ansible automation application:

  1. Add a new OSD node to the /etc/ansible/hosts file under the [osds] section, for example:

    [osds]
    <osd_host_name>

    For details, see Before You Start…​.

  2. Append the following settings the group_vars/all file:

    osd_objectstore: bluestore
    ceph_conf_overrides:
          global:
                enable experimental unrecoverable data corrupting features: 'bluestore rocksdb'
  3. Add the following setting to the group_vars/osds file:

    bluestore: true
  4. Run the ansible-playbook:

    ansible playbook site.yml
  5. Verify the status of the Ceph cluster. The output will include the following warning message:

    ceph -s
    
    2016-03-25 13:03:31.846668 7f313ad2b700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
    2016-03-25 13:03:31.855052 7f313ad2b700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
       cluster 179c40e3-8b3e-4ed0-9153-fefd638349a2
        health HEALTH_OK
        monmap e1: 1 mons at {rbd-mirroring-b4dae55c-34e3-4eb6-a84d-1b621af31c75=192.168.0.44:6789/0}
               election epoch 3, quorum 0 rbd-mirroring-b4dae55c-34e3-4eb6-a84d-1b621af31c75
        osdmap e9: 2 osds: 2 up, 2 in
               flags sortbitwise
        pgmap v13: 64 pgs, 1 pools, 0 bytes data, 0 objects
               2052 MB used, 38705 MB / 40757 MB avail
                     64 active+clean