[Engineering Notes] I/O Limits: block sizes, alignment and I/O hints

Updated -

Table of Contents

Table of Contents

Overview:


The Linux I/O stack has been enhanced to consume vendor-provided "I/O Limits" information that allows Linux tools (parted, lvm, mkfs.*, etc) to optimize placement of and access to data. I/O that is not properly aligned relative to the device's "I/O Limits" will result in reduced performance or, in the worst case, application failure (see: "Direct I/O best practices" in "Userspace access" below).

Not all storage devices export this "I/O Limits" information yet. Such "legacy" devices will work fine given the various RHEL6 tools' defaults will conservatively align all I/O on a 4K, or larger power of 2, boundary. Utilization of this "I/O Limits" information enables 4K sector devices to be fully supported for data volumes. Boot support for 4K sector devices is planned but not yet supported. The kernel provides both block device ioctl and sysfs access to each device's various "I/O Limits".

I/O Limits


Certain 4K sector devices may use a 4K 'physical_block_size' internally but expose a finer-grained 512 byte 'logical_block_size' to Linux. This discrepancy introduces potential for misaligned I/O. Linux will attempt to start all data areas on a naturally aligned ('physical_block_size') boundary by making sure it accounts for any 'alignment_offset' if the beginning of the Linux block device is offset from the underlying physical alignment.

Storage vendors can also supply "I/O hints" about a device's preferred minimum unit for random I/O ('minimum_io_size') and streaming I/O ('optimal_io_size'). For example, these hints may correspond to a RAID device's chunk size and stripe size respectively.

Userspace access


Direct I/O best practices


Users must always take care to use properly aligned and sized IO. This is especially important for Direct I/O access. Direct I/O should be aligned on a 'logical_block_size' boundary and in multiples of the 'logical_block_size'. With native 4K devices (logical_block_size is 4K) it is now critical that applications perform Direct I/O that is a multiple of the device's 'logical_block_size'. This means that applications that do not perform 4K aligned I/O, but 512-byte aligned I/O, will break with native 4K devices. Applications may consult a device's "I/O Limits" to ensure they are using properly aligned and sized I/O. The "I/O Limits" are exposed through both sysfs and block device ioctl interfaces (also see: libblkid).

sysfs interface


/sys/block/<disk>/alignment_offset
/sys/block/<disk>/<partition>/alignment_offset
/sys/block/<disk>/queue/physical_block_size
/sys/block/<disk>/queue/logical_block_size
/sys/block/<disk>/queue/minimum_io_size
/sys/block/<disk>/queue/optimal_io_size

The kernel will still export these sysfs attribute for "legacy" devices that do not provide "I/O Limits" information, for example:

alignment_offset:    0
physical_block_size: 512
logical_block_size:  512
minimum_io_size:     512
optimal_io_size:     0

block device ioctls


BLKALIGNOFF: alignment_offset
BLKPBSZGET: physical_block_size
BLKSSZGET: logical_block_size
BLKIOMIN: minimum_io_size
BLKIOOPT: optimal_io_size

Standards


ATA


ATA devices must report appropriate information via the IDENTIFY DEVICE command. ATA devices only report "I/O Limits" for 'physical_block_size', 'logical_block_size' and 'alignment_offset'. The additional "I/O Hints" are outside the scope of the ATA Command Set.

SCSI


The kernel's "I/O Limits" support requires at least version 3 of the SCSI Primary Commands protocol (SPC-3). Linux will only send a READ CAPACITY(16) and "extended inquiry" (which gains access to the BLOCK LIMITS VPD page) to devices which claim conformance to SPC-3.

1) READ CAPACITY(16) provides the block sizes and alignment offset:
LOGICAL BLOCK LENGTH IN BYTES:
/sys/block/<disk>/queue/logical_block_size

LOGICAL BLOCKS PER PHYSICAL BLOCK EXPONENT is used to derive:
/sys/block/<disk>/queue/physical_block_size

LOWEST ALIGNED LOGICAL BLOCK ADDRESS:
/sys/block/<disk>/alignment_offset
/sys/block/<disk>/<partition>/alignment_offset

2) BLOCK LIMITS VPD provides the "I/O hints":
OPTIMAL TRANSFER LENGTH GRANULARITY and OPTIMAL TRANSFER LENGTH are used
to derive:
/sys/block/<disk>/queue/minimum_io_size
/sys/block/<disk>/queue/optimal_io_size

The sg3_utils package provides the 'sg_inq' utility that can be used to access the BLOCK LIMITS VPD page (0xb0), using:

#  sg_inq -p 0xb0 <device>

Stacking I/O Limits


All layers of the Linux I/O stack have been engineered to propagate the various "I/O Limits" up the stack. When a layer consumes an attribute or aggregates many devices, it must expose appropriate "I/O Limits" so that upper-layer devices or tools will have an accurate view of the storage as it transformed. Some practical examples are:

  • only one layer in the I/O stack should adjust for a non-zero 'alignment_offset'; once a layer adjusts for it it will export a device with an 'alignment_offset' of zero
  • a striped Device Mapper (DM) device, created with LVM, must export a 'minimum_io_size' and 'optimal_io_size' relative to the stripe count (number of disks) and user provided chunk size

Linux Device Mapper (DM) and Software Raid (MD) device drivers can be used to arbitrarily combine devices with different "I/O Limits". The kernel's block layer goes to great lengths to reasonably combine the "I/O Limits" of the individual devices. The kernel will not prevent combining heterogenuous devices but the user should be aware of the risk associated with doing so.

For instance, a 512 byte device and a 4K device may be combined into a single logical DM device; the resulting DM device would have a 'logical_block_size' of 4K. Filesystems layered on such a hybrid device assume that 4K will be written atomically but in reality it will span 8 LBAs when issued to the 512 byte device. Using a 4K 'logical_block_size' for the higher-level DM device increases potential for a partial write to the 512b device if there is a system crash.

If combining multiple devices' "I/O Limits" results in a conflict the block layer may report a warning that the device is susceptible to partial writes and/or misaligned.

Logical Volume Manager (LVM)


LVM provides userspace tools that are used to manage the kernel's DM devices. LVM will shift the start of the data area, that a given DM device will use, to account for a non-zero 'alignment_offset' associated with any device LVM manages. This means LVM logical volumes will be properly aligned (alignment_offset=0). LVM will adjust for any 'alignment_offset' by default but this may be disabled through lvm.conf's 'data_alignment_offset_detection'. Disabling this is not recommended.

LVM will also detect the "I/O hints" for a device. The start of a device's data area will be a multiple of the 'minimum_io_size' or 'optimal_io_size' exposed in sysfs. 'minimum_io_size' is used if
'optimal_io_size' is undefined (0). LVM will automatically determine these "I/O hints" by default but this may be disabled through lvm.conf's 'data_alignment_detection'. Disabling this is not recommended.

Partition and Filesystem tools


util-linux-ng's libblkid and fdisk


The libblkid library provided with the util-linux-ng package includes a programmatic API to access a device's "I/O Limits". libblkid allows applications, especially those that use Direct I/O, to properly size their I/O requests. util-linux-ng's fdisk uses libblkid to determine a device's "I/O Limits" for optimal placement of all partitions. If a device doesn't provide "I/O Limits" information fdisk will align all partitions on a 1MB boundary.

parted and libparted


parted's libparted also uses libblkid's "I/O Limits" API. The RHEL6 installer (anaconda) uses libparted. This means that all partitions created with either the installer or parted will be properly aligned.
The default alignment for all partitions created on a device that doesn't appear to provide "I/O Limits" information will be 1MB.

The heuristic parted uses is:

  • 1) Always use the reported 'alignment_offset' as the offset for the start of the first primary partition.
  • 2a) If 'optimal_io_size' is defined (not 0) align all partitions on an 'optimal_io_size' boundary.
  • 2b) If 'optimal_io_size' is undefined (0) and 'alignment_offset' is 0 and 'minimum_io_size' is a power of 2: use a 1MB default alignment.
    • as you can see this is the catch all for "legacy" devices which don't appear to provide "I/O hints"; so in the default case all partitions will align on a 1MB boundary.
    • NOTE: we can't distinguish between a "legacy" device and modern device that provides "I/O hints" with alignment_offset=0 and optimal_io_size=0. Such a device might be a single SAS 4K device. So worst case we lose < 1MB of space at the start of the disk.

Filesystem tools


mkfs.ext[234], mkfs.xfs, and mkfs.gfs2 have been enhanced to consume a device's "I/O Limits". Linux filesystems are not allowed to be formatted to use a block size that is smaller than the underlying storage's 'logical_block_size'. mkfs.ext[234] and mkfs.xfs also use the "I/O hints" to layout ondisk data structure and data areas relative to the underlying storage's 'minimum_io_size' and 'optimal_io_size' -- this allows filesystems to be optimally formatted for various RAID (striped) layouts.

Comments