Red Hat Training

A Red Hat training course is available for Red Hat Enterprise Linux

3.8. Configuring Error Behavior

When an error occurs during an I/O operation, the XFS driver responds in one of two ways:
  • Continue retries until either:
    • the I/O operation succeeds, or
    • an I/O operation retry count or time limit is exceeded.
  • Consider the error permanent and halt the system.
XFS currently recognizes the following error conditions for which you can configure the desired behavior specifically:
  • EIO: Error while trying to write to the device
  • ENOSPC: No space left on the device
  • ENODEV: Device cannot be found
All other possible error conditions, which do not have specific handlers defined, share a single, global configuration.
You can set the conditions under which XFS deems the errors permanent, both in the maximum number of retries and the maximum time in seconds. XFS stops retrying when any one of the conditions is met.
There is also an option to immediately cancel the retries when unmounting the file system, regardless of any other configuration. This allows the unmount operation to succeed even in case of persistent errors.

3.8.1. Configuration Files for Specific and Undefined Conditions

Configuration files controlling error behavior are located in the /sys/fs/xfs/device/error/ directory.
The /sys/fs/xfs/device/error/metadata/ directory contains subdirectories for each specific error condition:
  • /sys/fs/xfs/device/error/metadata/EIO/ for the EIO error condition
  • /sys/fs/xfs/device/error/metadata/ENODEV/ for the ENODEV error condition
  • /sys/fs/xfs/device/error/metadata/ENOSPC/ for the ENOSPC error condition
Each one then contains the following configuration files:
  • /sys/fs/xfs/device/error/metadata/condition/max_retries: controls the maximum number of times that XFS retries the operation.
  • /sys/fs/xfs/device/error/metadata/condition/retry_timeout_seconds: the time limit in seconds after which XFS will stop retrying the operation
All other possible error conditions, apart from those described in the previous section, share a common configuration in these files:
  • /sys/fs/xfs/device/error/metadata/default/max_retries: controls the maximum number of retries
  • /sys/fs/xfs/device/error/metadata/default/retry_timeout_seconds: controls the time limit for retrying

3.8.2. Setting File System Behavior for Specific and Undefined Conditions

To set the maximum number of retries, write the desired number to the max_retries file.
  • For specific conditions:
    # echo value > /sys/fs/xfs/device/error/metadata/condition/max_retries
  • For undefined conditions:
    # echo value > /sys/fs/xfs/device/error/metadata/default/max_retries
value is a number between -1 and the maximum possible value of int, the C signed integer type. This is 2147483647 on 64-bit Linux.
To set the time limit, write the desired number of seconds to the retry_timeout_seconds file.
  • For specific conditions:
    # echo value > /sys/fs/xfs/device/error/metadata/condition/retry_timeout_seconds
  • For undefined conditions:
    # echo value > /sys/fs/xfs/device/error/metadata/default/retry_timeout_seconds
value is a number between -1 and 86400, which is the number of seconds in a day.
In both the max_retries and retry_timeout_seconds options, -1 means to retry forever and 0 to stop immediately.
device is the name of the device, as found in the /dev/ directory; for example, sda.

Note

The default behavior for a each error condition is dependent on the error context. Some errors, like ENODEV, are considered to be fatal and unrecoverable, regardless of the retry count, so their default value is 0.

3.8.3. Setting Unmount Behavior

If the fail_at_unmount option is set, the file system overrides all other error configurations during unmount, and immediately umnounts the file system without retrying the I/O operation. This allows the unmount operation to succeed even in case of persistent errors.
To set the unmount behavior:
# echo value > /sys/fs/xfs/device/error/fail_at_unmount
value is either 1 or 0:
  • 1 means to cancel retrying immediately if an error is found.
  • 0 means to respect the max_retries and retry_timeout_seconds options.
device is the name of the device, as found in the /dev/ directory; for example, sda.

Important

The fail_at_unmount option has to be set as desired before attempting to unmount the file system. After an unmount operation has started, the configuration files and directories may be unavailable.