What are some diagnostic steps can I follow to determine the root cause of a storage failure in RHEL?

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 3, 4, 5, 6

Issue

  • Storage failed with a specific message indicating a data corruption has occurred, or a data corruption is suspected.

Resolution

The following diagnostic steps will assist in gathering data about storage issues, compiling it together, and searching for relevant information

Diagnostic Steps

1. Obtain /var/log/messages file(s) which contains the console messages from the start of boot leading up until the first error.

  a) If there are multiple files, take all and combine them into one file with a simple command such as the following.  (NOTE: Be careful to combine the files properly - use the 'head' and 'tail' commands to verify they have been combined properly):

# cat messages.4 messages.3 messages.2 messages.1 messages > messages-complete

  b) Look for the first occurance of the error with the following command, and note the timestamp of occurrance:

# grep -n "specific error message" messages-complete | head -1
262790:Aug  4 23:18:23 localhost kernel: attempt to access beyond end of device

  c) Determine the timestamp of the boot that preceeded the above error.  The start of the boot may be found by looking for messages containing "Linux version" or "syslogd restart".  (NOTE: If the machine had been up for a long time, there may be no boot sequence that preceeded the first error, in which case it may not be possible to provide a RCA.)  In the below example, the boot at "Aug  4 17:18:26" is the one that preceeded the first error at "Aug 4 23:18:23" above:

# grep -n "Linux version" messages-complete
230798:Aug  3 07:52:19 localhost kernel: Linux version 2.6.18-238.9.1.el5 (mockbuild@x86-002.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Fri Mar 18 12:42:39 EDT 2011
249323:Aug  3 16:27:43 localhost kernel: Linux version 2.6.18-238.9.1.el5 (mockbuild@x86-002.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Fri Mar 18 12:42:39 EDT 2011
250320:Aug  3 17:18:26 localhost kernel: Linux version 2.6.18-238.9.1.el5 (mockbuild@x86-002.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Fri Mar 18 12:42:39 EDT 2011Aug  4 17:18:26 localhost kernel: Linux version 2.6.18-238.9.1.el5 (mockbuild@x86-002.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Fri Mar 18 12:42:39 EDT 2011
264985:Aug  5 12:19:24 localhost kernel: Linux version 2.6.18-238.9.1.el5 (mockbuild@x86-002.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Fri Mar 18 12:42:39 EDT 2011

  d) Trim the combined messages file to contain only the messages starting at the time of boot and ending at the time of first error.  In the above example, this would be any messages between Aug 4 17:18:26 and ending Aug 4 23:18:23.  Often a "good enough" way to trim the file is to grep out the lines containing the day of the first error, or if the machine was up for a long time, the day of the error and the previous few days.  For example:

# egrep '(Aug  4|Aug  3)' messages-complete > messages-trimmed

Or, a simple awk script may be used, with the line numbers as follows.  For the above example, the boot started at line 250320 and our error was at line 262790, so our awk script would be:

# awk '{ if ((NR>=250320)&&(NR<=262790)) print }' messages-complete > messages-trimmed

  e) Look for reboot / shutdown messages in the log.  If a reboot or shutdown message is between the boot time and the time of the first error, the error may be related to a shutdown / teardown path.

# grep -n "shutting down" messages-trimmed
260851:Aug  4 23:08:16 localhost shutdown[28967]: shutting down for system reboot

2. Examine messages for any indications that storage has been changed or been reconfigured.  Some kernels + storage device combinations will indicate changes to underlying storage with messages similar to the following:

a) Capacity change on block storage

Aug  4 20:19:24 localhost kernel: sda: detected capacity change from 10737418240 to 8589934592

b) LUN mapping change (RHEL5.6 or above) on block storage; see https://access.redhat.com/kb/docs/DOC-53124

Aug  4 18:50:03 localhost kernel: sd 3:0:0:7: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.

Example:

# egrep '(LUN assignments|capacity change)' messages-trimmed

3. Examine the period of time just before the event, and fill in the timeline with any interesting messages.  A simple command which may help to narrow the scope is as follows (NOTE: 500 lines is just a guess, you may need to increase it):

# grep -B500 "specific error message" messages-trimmed | head -500

Make a short list of any interesting messages that preceeded the error.  For each message, search for the meaning of it, or ask an expert.  For a list of some common block storage messages, see https://access.redhat.com/kb/docs/DOC-64557.

4. If there are no obvious indications that storage has changed, research other possible events around the time of the first error, some of which may include:

  a) device driver, device-mapper-multipath, or udev events (/var/log/messages, other logs)

  b) cron jobs which kick off various processes

  c) Power or weather related events which may cause hardware glitches

  d) IT change control actions, such as network or storage reconfiguration

5. Obtain storage logs from the storage vendor.

  a) Compare the time on the host with the time on the storage logs.  NOTE: If the timezone is different, or there are time differences, the log file times must be adjusted accordingly.  Convert time zones with a tool like the following: http://www.timezoneconverter.com/cgi-bin/tzc.tzc.   Also, even if times are failry close, it is good to do a 'time' on each the storage and the host and record any small differences.

  b) Using the host timeframe gathered in Step #1 and the time zone information gathered in Step 4 (a), create a range of interesting times to examine in the storage logs.

  c) Use any information in the storage vendor logs to fill in time timeline further.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments