What is the meaning of "lost page write due to I/O error" in RHEL 5, 6?

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5
  • Red Hat Enterprise Linux 6

Issue

  • Messages similar to the following are seen in the logs:

    Aug  3 12:24:37 localhost kernel: lost page write due to I/O error on dm-29
    
  • Should I be concerned about messages indicating "lost page write due to I/O error"?

Resolution

  • Check the switch and storage array controllers for errors or link failures
  • Review the messages prior to these in /var/log/messages for clues as to what may have caused the lost page write (these are usually accompanied by more descriptive errors).

Root Cause

  • This is a serious error and potentially indicates data loss has occurred. There can be many root causes, depending on the specific storage configuration.

  • If device-mapper-multipath was in use, all paths may have been lost, and queue_if_no_path was not explicitly set on the multipath map, or no_path_retries exhausted all retries.  To determine if one of these was the case:

    • queue_if_no_path will be displayed in the "features" in the output of multipath -ll if it is configured for that device:

      mpatha (360014380125989a10000400001300000) dm-6 HP,HSV300
      size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
      |-+- policy='round-robin 0' prio=50 status=active
      | |- 0:0:0:7 sdg  8:96   active ready running
      | `- 0:0:3:7 sdah 66:16  active ready running
      
    • no_path_retry may be specified in /etc/multipath.conf in a device section.  It may also be a default setting for that device type, and if so, should be listed in /usr/share/doc/device-mapper-multipath-<version>/multipath.conf.defaults.  Search for your device vendor/product and see if that device block contains a value for no_path_retry.  If it is not listed in either location, then no_path_retry is not enabled on the device.

  • If multipath is not in use, a SCSI device WRITE command timed out.

  • If SAN storage was involved, paths to LUNs may have been lost from a cable pull, reconfiguration of storage or switches, or a link failure.
  • An async read-operation would be serviced from the page cache and if the page which needs to be read from the cache is not yet marked as PG_uptodate then such read operation would fail and would need Filesystem's journalling capabilities to sync the read-cache for marking those pages as valid once again.

Diagnostic Steps

  • Follow basic steps for data recovery, such as umounting any filesystem involved with the device, and running fsck.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.