Chapter 1. What is the Ceph File System (CephFS)?

The Ceph File System (CephFS) is a file system compatible with POSIX standards that uses a Ceph Storage Cluster to store its data. The Ceph File System uses the same Ceph Storage Cluster system as the Ceph Block Device, Ceph Object Gateway, or librados API.

Important

The Ceph File System is a Technology Preview only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend to use them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information on Red Hat Technology Preview features support scope, see https://access.redhat.com/support/offerings/techpreview/.

In addition, see Section 1.2, “Limitations” for details on current CephFS limitations and experimental features.

diag 8b55be25ec97b381bf1b2ff6346b4d46

To run the Ceph File System, you must have a running Ceph Storage Cluster with at least one Ceph Metadata Server (MDS) running. For details on installing the Ceph Storage Cluster, see the Installation Guide for Red Hat Enterprise Linux or Installation Guide for Ubuntu. See Chapter 2, Installing and Configuring Ceph Metadata Servers (MDS) for details on installing the Ceph Metadata Server.

1.1. Features

The Ceph File System introduces the following features and enhancements:

Scalability
The Ceph File System is highly scalable because clients read directly from and write to all OSD nodes.
Shared File System
The Ceph File System is a shared file system so multiple clients can work on the same file system at once.
High Availability
The Ceph File System provides a cluster of Ceph Metadata Servers (MDS). One is active and others are in standby mode. If the active MDS terminates unexpectedly, one of the standby MDS becomes active. As a result, client mounts continue working through a server failure. This behavior makes the Ceph File System highly available.
File and Directory Layouts
The Ceph File System allows users to configure file and directory layouts to use multiple pools.
POSIX Access Control Lists (ACL)

The Ceph File System supports the POSIX Access Control Lists (ACL). ACL are enabled by default with the Ceph File Systems mounted as kernel clients with kernel version kernel-3.10.0-327.18.2.el7.

To use ACL with the Ceph File Systems mounted as FUSE clients, you must enabled them. See Section 1.2, “Limitations” for details.

Client Quotas

The Ceph File System FUSE client supports setting quotas on any directory in a system. The quota can restrict the number of bytes or the number of files stored beneath that point in the directory hierarchy.

To enable the client quotas, set the client quota option to true in the Ceph configuration file:

[client]

client quota = true

1.2. Limitations

The Ceph File System is provided as a Technical Preview and as such, there are several limitations:

Access Control Lists (ACL) support in FUSE clients

To use the ACL feature with the Ceph File System mounted as a FUSE client, you must enable it. To do so, add the following options to the Ceph configuration file:

[client]

fuse_default_permission=0
client_acl_type=posix_acl

Then restart the Ceph services.

Snapshots

Creating snapshots is not enabled by default because this feature is still experimental and it can cause the MDS or client nodes to terminate unexpectedly.

If you understand the risks and still wish to enable snapshots, use:

ceph mds set allow_new_snaps true --yes-i-really-mean-it
Multiple active MDS

By default, only configurations with one active MDS are supported. Having more active MDS can cause the Ceph File System to fail.

If you understand the risks and still wish to use multiple active MDS, increase the value of the max_mds option and set the allow_multimds option to true in the Ceph configuration file.

Multiple Ceph File Systems

By default, creation of multiple Ceph File Systems in one cluster is disabled. An attempt to create an additional Ceph File System fails with the following error:

Error EINVAL: Creation of multiple filesystems is disabled.

Creating multiple Ceph File Systems in one cluster is not fully supported yet and can cause the MDS or client nodes to terminate unexpectedly.

If you understand the risks and still wish to enable multiple Ceph file systems, use:

ceph fs flag set enable_multiple true  --yes-i-really-mean-it
FUSE clients cannot be mounted permanently on Red Hat Enterprise Linux 7.2
The util-linux package shipped with Red Hat Enterprise Linux 7.2 does not support mounting CephFS FUSE clients in /etc/fstab. Red Hat Enterprise Linux 7.3 includes a new version of util-linux that supports mounting CephFS FUSE clients permanently.
The kernel clients in Red Hat Enterprise Linux 7.3 do not support the pool_namespace layout setting
As a consequence, files written from FUSE clients with a namespace set might not be accessible from Red Hat Enterprise Linux 7.3 kernel clients. Attempts to read or set the ceph.file.layout.pool_namespace extended attribute fail with the "No such attribute" error.

1.3. Differences from POSIX Compliance

The Ceph File System aims to adhere to POSIX semantics wherever possible. For example, in contrast to many other common network file systems like NFS, CephFS maintains strong cache coherency across clients. The goal is for processes using the file system to behave the same when they are on different hosts as when they are on the same host.

However, there are a few places where CephFS diverges from strict POSIX semantics for various reasons:

  • If a client’s attempt to write a file fails, the write operations are not necessarily atomic. That is, the client might call the write() system call on a file opened with the O_SYNC flag with an 8MB buffer and then terminates unexpectedly and the write operation can be only partially applied. Almost all file systems, even local file systems, have this behavior.
  • In situations when the write operations occur simultaneously, a write operation that exceeds object boundaries is not necessarily atomic. For example, writer A writes "aa|aa" and writer B writes "bb|bb" simultaneously (where "|" is the object boundary) and "aa|bb" is written rather than the proper "aa|aa" or "bb|bb".
  • POSIX includes the telldir() and seekdir() system calls that allow you to obtain the current directory offset and seek back to it. Because CephFS can fragment directories at any time, it is difficult to return a stable integer offset for a directory. As such, calling the seekdir() system call to a non-zero offset might often work but is not guaranteed to do so. Calling seekdir() to offset 0 will always work. This is an equivalent to the rewinddir() system call.
  • Sparse files propagate incorrectly to the st_blocks field of the stat() system call. Because CephFS does not explicitly track which parts of a file are allocated or written, the st_blocks field is always populated by the file size divided by the block size. This behavior causes utilities, such as du, to overestimate consumed space.
  • When the mmap() system call maps a file into memory on multiple hosts, write operations are not coherently propagated to caches of other hosts. That is, if a page is cached on host A, and then updated on host B, host A page is not coherently invalidated.
  • CephFS clients present a hidden .snap directory that is used to access, create, delete, and rename snapshots. Although the this directory is excluded from the readdir() system call, any process that tries to create a file or directory with the same name returns an error. You can change the name of this hidden directory at mount time with the -o snapdirname=.<new_name> option or by using the client_snapdir configuration option.