Red Hat Training

A Red Hat training course is available for Red Hat Ceph Storage

Chapter 6. Data Retention

Red Hat Ceph Storage stores user data, but usually in an indirect manner. Customer data retention may involve other applications such as the Red Hat OpenStack Platform.

6.1. Ceph Storage Cluster

The Ceph Storage Cluster—​often referred to as the Reliable Autonomic Distributed Object Store or RADOS—​stores data as objects within pools. In most cases, these objects are the atomic units representing client data such as Ceph Block Device images, Ceph Object Gateway objects, or Ceph Filesystem files. However, custom applications built on top of librados may bind to a pool and store data too.

Cephx controls access to the pools storing object data. However, Ceph Storage Cluster users are typically Ceph clients, and not end users. Consequently, end users generally DO NOT have the ability to write, read or delete objects directly in a Ceph Storage Cluster pool.

6.2. Ceph Block Device

The most popular use of Red Hat Ceph Storage, the Ceph Block Device interface, also referred to as RADOS Block Device or RBD, creates virtual volumes, images and compute instances and stores them as a series of objects within pools. Ceph assigns these objects to placement groups and distributes or places them pseudo-randomly in OSDs throughout the cluster.

Depending upon the application consuming the Ceph Block Device interface—​usually Red Hat OpenStack Platform—​end users may create, modify and delete volumes and images. Ceph handles the CRUD operations of each individual object.

Deleting volumes and images destroys the corresponding objects in an unrecoverable manner. However, residual data artifacts may continue to reside on storage media until overwritten. Data may also remain in back up archives.

6.3. Ceph Filesystem

The Ceph Filesystem interface creates virtual filesystems and stores them as a series of objects within pools. Ceph assigns these objects to placement groups and distributes or places them pseudo-randomly in OSDs throughout the cluster.

Typically, the Ceph Filesystem uses two pools:

  • Metadata: The metadata pool stores the data of the metadata server (mds), which generally consists of inodes; that is, the file ownership, permissions, creation date/time, last modified/accessed date/time, parent directory, etc.
  • Data: The data pool stores file data. Ceph may store a file as one or more objects, typically representing smaller chunks of file data such as extents.

Depending upon the application consuming the Ceph Filesystem interface—​usually Red Hat OpenStack Platform—​end users may create, modify and delete files in a Ceph filesystem. Ceph handles the CRUD operations of each individual object representing the file.

Deleting files destroys the corresponding objects in an unrecoverable manner. However, residual data artifacts may continue to reside on storage media until overwritten. Data may also remain in back up archives.

6.4. Ceph Object Gateway

From a data security and retention perspective, the Ceph Object Gateway interface has some important differences when compared to the Ceph Block Device and Ceph Filesystem interfaces. The Ceph Object Gateway provides a service to end users. So the Ceph Object Gateway may store:

  • User Authentication Information: User authentication information generally consists of user IDs, user access keys and user secrets. It may also comprise a user’s name and email address if provided. Ceph Object Gateway will retain user authentication data unless the user is explicitly deleted from the system.
  • User Data: User data generally comprises user- or administrator-created buckets or containers, and the user-created S3 or Swift objects contained within them. The Ceph Object Gateway interface creates one or more Ceph Storage cluster objects for each S3 or Swift object and stores the corresponding Ceph Storage cluster objects within a data pool. Ceph assigns the Ceph Storage cluster objects to placement groups and distributes or places them pseudo-randomly in OSDs throughout the cluster. The Ceph Object Gateway may also store an index of the objects contained within a bucket or index to enable services such as listing the contents of an S3 bucket or Swift container. Additionally, when implementing multi-part uploads, the Ceph Object Gateway may temporarily store partial uploads of S3 or Swift objects.

    End users may create, modify and delete buckets or containers, and the objects contained within them in a Ceph Object Gateway. Ceph handles the CRUD operations of each individual Ceph Storage cluster object representing the S3 or Swift object.

    Deleting S3 or Swift objects destroys the corresponding Ceph Storage cluster objects in an unrecoverable manner. However, residual data artifacts may continue to reside on storage media until overwritten. Data may also remain in back up archives.

  • Logging: Ceph Object Gateway also stores logs of user operations that the user intends to accomplish and operations that have executed. This data provides traceability about who created, modified or deleted a bucket or container, or an S3 or Swift object residing in a an S3 bucket or Swift container. When users delete their data, the logging information is not effected and will remain in storage until deleted by a system administrator or removed automatically by expiration policy.

Bucket Lifecycle

Ceph Object Gateway also supports bucket lifecycle features, including object expiration. Data retention regulations like the General Data Protection Regulation may require administrators to set object expiration policies and disclose them to end users among other compliance factors.

Multisite

Ceph Object Gateway is often deployed in a multi-site context whereby a user stores an object at one site and the Ceph Object Gateway creates a replica of the object in another cluster possibly at another geographic location. For example, if a primary cluster fails, a secondary cluster may resume operations. In another example, a secondary cluster may be in a different geographic location, such as an edge network or content-delivery network such that a client may access the closest cluster to improve response time, throughput and other performance characteristics. In multisite scenarios, administrators must ensure that each site has implemented security measures. Additionally, if geographic distribution of data would occur in a multisite scenario, administrators must be aware of any regulatory implications when the data crosses political boundaries.