Ceph - OSD becomes non-writable with the following error: "No space left on device", while still shows plenty of space

Solution Verified - Updated -

Environment

  • Red Hat Ceph Storage

Issue

  • Ceph OSD becomes non-writable with the following error: "No space left on device" while OSD still shows plenty of space.
root@ceph-storage1:/var/lib/ceph/osd/ceph-6# df -h
Filesystem                        Size  Used Avail Use% Mounted on
/dev/sda2                         2.7T  313G  2.2T  13% /
udev                               32G   12K   32G   1% /dev
tmpfs                              13G  364K   13G   1% /run
none                              5.0M     0  5.0M   0% /run/lock
none                               32G     0   32G   0% /run/shm
/dev/sdd1                         2.8T  1.8T 1010G  64% /var/lib/ceph/osd/ceph-8
/dev/sdc1                         2.8T  1.7T  1.2T  60% /var/lib/ceph/osd/ceph-7
/dev/sde1                         3.7T  158G  3.5T   5% /var/lib/ceph/osd/ceph-15
/dev/sdf1                         3.7T  185G  3.5T   5% /var/lib/ceph/osd/ceph-16
/dev/sdg1                         7.3T  155G  7.2T   3% /var/lib/ceph/osd/ceph-17
/dev/sdb1                         2.8T  1.9T  937G  67% /var/lib/ceph/osd/ceph-6

Resolution

  • Ceph is a user-space application which accepts object reads/writes from the clients and performs those reads/writes as files onto the XFS formatted partition.
  • As such, the amount of data-churn and fragmentation depends entirely on the use patterns of the your users / application.
  • In filestore OSDs, ceph does perform directory splitting & merging, but only at certain thresholds ( this is configurable ) to prevent excessively high file count in a given directory.
  • Ceph, being user-space, does not do anything in attempt to minimize or correct any fragmentation on the underlying XFS partition.
  • If in your monitoring, you find that an OSD data partition is fragmented to a point that ENOSPC is likely, then you may defragment the data partition. The threshold and timing of these events / checks is purely up to your comfort level as Red Hat does not have any specific guidance on this.
  • On rare occasions with specific client use patterns, ENOSPC can be returned upon write attempts long before all actual raw space is consumed on a filestore partition. There are many situations which can lead to this.
  • For more information please refer to XFS: Error: No Space left on the device, even though plenty of space is available.

Root Cause

  • Environments with high data churn on OSDs can result in an OSD with , "No space left on device" or "ENOSPC" error while the OSD appears to have space left.
  • This can be due to fragmentation as a XFS filesystem created on a Ceph OSD, is just a susceptible to fragmentation just as any other filesystem.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments