RHEL 6.5 - "Kernel: multipath, not tainted" alarms

Latest response

Hi, this is my first post on RHEL forums! I feel the need to post, as since the most recent kernel patch I am seeing the following errors:

Feb 10 05:33:18 know-iprd-testaql1 kernel: INFO: task multipathd:4620 blocked for more than 120 seconds.
Feb 10 05:33:18 know-iprd-testaql1 kernel:      Not tainted 2.6.32-431.3.1.el6.x86_64 #1
Feb 10 05:33:18 know-iprd-testaql1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 10 05:33:18 know-iprd-testaql1 kernel: multipathd    D 0000000000000005     0  4620      1 0x00000000
Feb 10 05:33:18 know-iprd-testaql1 kernel: ffff88105ac4f968 0000000000000082 0000000000000000 ffff880b48cf8880
Feb 10 05:33:18 know-iprd-testaql1 kernel: ffff881061c305e0 ffff881063c21000 ffffffff8137ffe0 ffff880b48ebee40
Feb 10 05:33:18 know-iprd-testaql1 kernel: ffff88105a828638 ffff88105ac4ffd8 000000000000fbc8 ffff88105a828638
Feb 10 05:33:18 know-iprd-testaql1 kernel: Call Trace:
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8137ffe0>] ? scsi_done+0x0/0x60
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff815287c5>] schedule_timeout+0x215/0x2e0
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff810841dd>] ? del_timer+0x7d/0xe0
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff81528443>] wait_for_common+0x123/0x180
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8152855d>] wait_for_completion+0x1d/0x20
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8126d9ec>] blk_execute_rq+0x8c/0xf0
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8126d730>] ? blk_rq_map_user+0x1a0/0x280
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff81271f15>] sg_io+0x215/0x3d0
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff812728a0>] scsi_cmd_ioctl+0x400/0x470
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff81272961>] scsi_cmd_blk_ioctl+0x51/0x70
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffffa00bc5bf>] sd_ioctl+0xaf/0x110 [sd_mod]
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8126f637>] __blkdev_driver_ioctl+0x67/0x80
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8126fabd>] blkdev_ioctl+0x1ed/0x6e0
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff811c46dc>] block_ioctl+0x3c/0x40
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8119db52>] vfs_ioctl+0x22/0xa0
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff81282715>] ? _atomic_dec_and_lock+0x55/0x80
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8119dcf4>] do_vfs_ioctl+0x84/0x580
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8118a7e1>] ? __fput+0x1a1/0x210
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8119e271>] sys_ioctl+0x81/0xa0
Feb 10 05:33:18 know-iprd-testaql1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Feb 10 05:35:18 know-iprd-testaql1 kernel: INFO: task multipathd:4620 blocked for more than 120 seconds.
Feb 10 05:35:18 know-iprd-testaql1 kernel:      Not tainted 2.6.32-431.3.1.el6.x86_64 #1
Feb 10 05:35:18 know-iprd-testaql1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 10 05:35:18 know-iprd-testaql1 kernel: multipathd    D 0000000000000005     0  4620      1 0x00000000
Feb 10 05:35:18 know-iprd-testaql1 kernel: ffff88105ac4f968 0000000000000082 0000000000000000 ffff880b48cf8880
Feb 10 05:35:18 know-iprd-testaql1 kernel: ffff881061c305e0 ffff881063c21000 ffffffff8137ffe0 ffff880b48ebee40
Feb 10 05:35:18 know-iprd-testaql1 kernel: ffff88105a828638 ffff88105ac4ffd8 000000000000fbc8 ffff88105a828638
Feb 10 05:35:18 know-iprd-testaql1 kernel: Call Trace:
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8137ffe0>] ? scsi_done+0x0/0x60
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff815287c5>] schedule_timeout+0x215/0x2e0
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff810841dd>] ? del_timer+0x7d/0xe0
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff81528443>] wait_for_common+0x123/0x180
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8152855d>] wait_for_completion+0x1d/0x20
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8126d9ec>] blk_execute_rq+0x8c/0xf0
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8126d730>] ? blk_rq_map_user+0x1a0/0x280
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff81271f15>] sg_io+0x215/0x3d0
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff812728a0>] scsi_cmd_ioctl+0x400/0x470
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff81272961>] scsi_cmd_blk_ioctl+0x51/0x70
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffffa00bc5bf>] sd_ioctl+0xaf/0x110 [sd_mod]
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8126f637>] __blkdev_driver_ioctl+0x67/0x80
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8126fabd>] blkdev_ioctl+0x1ed/0x6e0
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff811c46dc>] block_ioctl+0x3c/0x40
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8119db52>] vfs_ioctl+0x22/0xa0
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff81282715>] ? _atomic_dec_and_lock+0x55/0x80
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8119dcf4>] do_vfs_ioctl+0x84/0x580
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8118a7e1>] ? __fput+0x1a1/0x210
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8119e271>] sys_ioctl+0x81/0xa0
Feb 10 05:35:18 know-iprd-testaql1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Feb 10 05:36:17 know-iprd-testaql1 multipathd: sdq: couln't get asymmetric access state
Feb 10 05:36:17 know-iprd-testaql1 kernel: qla2xxx [0000:06:00.0]-801c:0: Abort command issued nexus=0:1:4 --  1 2002.
Feb 10 05:36:17 know-iprd-testaql1 multipathd: data04: load table [0 419430400 multipath 1 queue_if_no_path 0 3 2 round-robin 0 1 1 65:0 1 round-robin 0 4 1 65:192 1 66:128 1 67:64
1 68:0 1 round-robin 0 3 1 8:64 1 68:192 1 69:128 1]
Feb 10 05:36:37 know-iprd-testaql1 multipathd: data04: load table [0 419430400 multipath 1 queue_if_no_path 0 2 1 round-robin 0 4 1 65:192 1 66:128 1 67:64 1 68:0 1 round-robin 0 4
1 65:0 1 8:64 1 68:192 1 69:128 1]

I don't know if this was an issue prior to the patch, or not, sorry. What I do know is, it worries me....should it?
What is actually broken here, if anything? Is there a fix?

Many thanks,
Mark Hands

Responses

Hey Mark - welcome to the forum!

Whenever I see "blocked for more than 120 seconds." I become concerned.

I suspect that it's not your kernel that has broken anything (although it is possible), but multipath might have been patched at the same time.

ls -l /etc/multipath.conf
grep -v ^# /etc/multipath.conf
multipath -ll -v2

Also - could you let us know what type of storage array you are using (to validate the multipath config) and whether you are booting from SAN (as the initrd would likely need some attention too).

Thanks James :)

We are using an IBM Storwize V7000 SAN, and booting from local drives on an IBM Blade.
Here's the (very long) output from those commands:

[root@know-iprd-testaql1 ~]# ls -l /etc/multipath.conf
-rw-r--r-- 1 root root 3911 Feb 10 08:05 /etc/multipath.conf
[root@know-iprd-testaql1 ~]# grep -v ^# /etc/multipath.conf

defaults {
        user_friendly_names yes
        path_grouping_policy multibus
}
blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z]"
        devnode "^dcssblk[0-9]*"
        device {
                vendor "DGC"
                product "LUNZ"
        }
        device {
                vendor "IBM"
                product "S/390.*"
        }
        # don't count normal SATA devices as multipaths
        device {
                vendor  "ATA"
        }
        # don't count 3ware devices as multipaths
        device {
                vendor  "3ware"
        }
        device {
                vendor  "AMCC"
        }
        # nor highpoint devices
        device {
                vendor  "HPT"
        }
        wwid "3600508e0000000005b7549c3c11df80b"
        device {
                vendor MATSHITA
                product DVD-ROM_UJ8A0
        }
        device {
                vendor IMM
                product Virtual_CD_DVD
        }
        wwid "*"
}
blacklist_exceptions {
        wwid "360050768028102742800000000000007"
        wwid "360050768028102742800000000000008"
        wwid "360050768028102742800000000000009"
        wwid "36005076802810274280000000000000a"
        wwid "36005076802810274280000000000000b"
        wwid "36005076802810274280000000000000c"
        wwid "36005076802810274280000000000000d"
        wwid "36005076802810274280000000000000e"
        wwid "36005076802810274280000000000000f"
        wwid "360050768028102742800000000000010"
        wwid "360050768028102742800000000000011"
        wwid "360050768028102742800000000000012"
}
multipaths {
        multipath {
                uid 0
                gid 0
                wwid "360050768028102742800000000000007"
                mode 0600
                alias data00
        }
        multipath {
                uid 0
                gid 0
                wwid "360050768028102742800000000000008"
                mode 0600
                alias data01
        }
        multipath {
                uid 0
                gid 0
                wwid "360050768028102742800000000000009"
                mode 0600
                alias data02
        }
        multipath {
                uid 0
                gid 0
                wwid "36005076802810274280000000000000a"
                mode 0600
                alias data03
        }
        multipath {
                uid 0
                gid 0
                wwid "36005076802810274280000000000000b"
                mode 0600
                alias data04
        }
        multipath {
                uid 0
                gid 0
                wwid "36005076802810274280000000000000c"
                mode 0600
                alias data05
        }
        multipath {
                uid 0
                gid 0
                wwid "36005076802810274280000000000000d"
                mode 0600
                alias data06
        }
        multipath {
                uid 0
                gid 0
                wwid "36005076802810274280000000000000e"
                mode 0600
                alias data07
        }
        multipath {
                uid 0
                gid 0
                wwid "36005076802810274280000000000000f"
                mode 0600
                alias data08
        }
        multipath {
                uid 0
                gid 0
                wwid "360050768028102742800000000000010"
                mode 0600
                alias data09
        }
        multipath {
                uid 0
                gid 0
                wwid "360050768028102742800000000000011"
                mode 0600
                alias data10
        }
        multipath {
                uid 0
                gid 0
                wwid "360050768028102742800000000000012"
                mode 0600
                alias data11
        }
}

[root@know-iprd-testaql1 ~]# multipath -ll -v2
data01 (360050768028102742800000000000008) dm-2 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:0:1  sdb  8:16   active ready running
| |- 0:0:1:1  sdn  8:208  active ready running
| |- 1:0:2:1  sdbv 68:144 active ready running
| `- 1:0:3:1  sdch 69:80  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:2:1  sdz  65:144 active ready running
  |- 1:0:0:1  sdax 67:16  active ready running
  |- 0:0:3:1  sdal 66:80  active ready running
  `- 1:0:1:1  sdbj 67:208 active ready running
data00 (360050768028102742800000000000007) dm-3 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:2:0  sdy  65:128 active ready running
| |- 1:0:0:0  sdaw 67:0   active ready running
| |- 0:0:3:0  sdak 66:64  active ready running
| `- 1:0:1:0  sdbi 67:192 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:0:0  sda  8:0    active ready running
  |- 0:0:1:0  sdm  8:192  active ready running
  |- 1:0:2:0  sdbu 68:128 active ready running
  `- 1:0:3:0  sdcg 69:64  active ready running
data09 (360050768028102742800000000000010) dm-17 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:0:9  sdj  8:144  active ready running
| |- 0:0:1:9  sdv  65:80  active ready running
| |- 1:0:2:9  sdcd 69:16  active ready running
| `- 1:0:3:9  sdcp 69:208 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:2:9  sdah 66:16  active ready running
  |- 0:0:3:9  sdat 66:208 active ready running
  |- 1:0:0:9  sdbf 67:144 active ready running
  `- 1:0:1:9  sdbr 68:80  active ready running
data11 (360050768028102742800000000000012) dm-20 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:0:11 sdl  8:176  active ready running
| |- 0:0:1:11 sdx  65:112 active ready running
| |- 1:0:3:11 sdcr 69:240 active ready running
| `- 1:0:2:11 sdcf 69:48  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:2:11 sdaj 66:48  active ready running
  |- 0:0:3:11 sdav 66:240 active ready running
  |- 1:0:0:11 sdbh 67:176 active ready running
  `- 1:0:1:11 sdbt 68:112 active ready running
data08 (36005076802810274280000000000000f) dm-15 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:2:8  sdag 66:0   active ready running
| |- 0:0:3:8  sdas 66:192 active ready running
| |- 1:0:0:8  sdbe 67:128 active ready running
| `- 1:0:1:8  sdbq 68:64  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:0:8  sdi  8:128  active ready running
  |- 0:0:1:8  sdu  65:64  active ready running
  |- 1:0:2:8  sdcc 69:0   active ready running
  `- 1:0:3:8  sdco 69:192 active ready running
data10 (360050768028102742800000000000011) dm-18 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:2:10 sdai 66:32  active ready running
| |- 0:0:3:10 sdau 66:224 active ready running
| |- 1:0:0:10 sdbg 67:160 active ready running
| `- 1:0:1:10 sdbs 68:96  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:0:10 sdk  8:160  active ready running
  |- 0:0:1:10 sdw  65:96  active ready running
  |- 1:0:2:10 sdce 69:32  active ready running
  `- 1:0:3:10 sdcq 69:224 active ready running
data07 (36005076802810274280000000000000e) dm-13 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:0:7  sdh  8:112  active ready running
| |- 0:0:1:7  sdt  65:48  active ready running
| |- 1:0:2:7  sdcb 68:240 active ready running
| `- 1:0:3:7  sdcn 69:176 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:2:7  sdaf 65:240 active ready running
  |- 0:0:3:7  sdar 66:176 active ready running
  |- 1:0:0:7  sdbd 67:112 active ready running
  `- 1:0:1:7  sdbp 68:48  active ready running
data06 (36005076802810274280000000000000d) dm-9 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:2:6  sdae 65:224 active ready running
| |- 0:0:3:6  sdaq 66:160 active ready running
| |- 1:0:0:6  sdbc 67:96  active ready running
| `- 1:0:1:6  sdbo 68:32  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:0:6  sdg  8:96   active ready running
  |- 0:0:1:6  sds  65:32  active ready running
  |- 1:0:2:6  sdca 68:224 active ready running
  `- 1:0:3:6  sdcm 69:160 active ready running
data05 (36005076802810274280000000000000c) dm-11 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:0:5  sdf  8:80   active ready running
| |- 0:0:1:5  sdr  65:16  active ready running
| |- 1:0:2:5  sdbz 68:208 active ready running
| `- 1:0:3:5  sdcl 69:144 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:2:5  sdad 65:208 active ready running
  |- 0:0:3:5  sdap 66:144 active ready running
  |- 1:0:0:5  sdbb 67:80  active ready running
  `- 1:0:1:5  sdbn 68:16  active ready running
data04 (36005076802810274280000000000000b) dm-7 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:2:4  sdac 65:192 active ready running
| |- 0:0:3:4  sdao 66:128 active ready running
| |- 1:0:0:4  sdba 67:64  active ready running
| `- 1:0:1:4  sdbm 68:0   active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:0:4  sde  8:64   active ready running
  |- 0:0:1:4  sdq  65:0   active ready running
  |- 1:0:2:4  sdby 68:192 active ready running
  `- 1:0:3:4  sdck 69:128 active ready running
data03 (36005076802810274280000000000000a) dm-5 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:0:3  sdd  8:48   active ready running
| |- 0:0:1:3  sdp  8:240  active ready running
| |- 1:0:2:3  sdbx 68:176 active ready running
| `- 1:0:3:3  sdcj 69:112 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:2:3  sdab 65:176 active ready running
  |- 1:0:0:3  sdaz 67:48  active ready running
  |- 0:0:3:3  sdan 66:112 active ready running
  `- 1:0:1:3  sdbl 67:240 active ready running
data02 (360050768028102742800000000000009) dm-4 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 0:0:2:2  sdaa 65:160 active ready running
| |- 1:0:0:2  sday 67:32  active ready running
| |- 0:0:3:2  sdam 66:96  active ready running
| `- 1:0:1:2  sdbk 67:224 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 0:0:0:2  sdc  8:32   active ready running
  |- 0:0:1:2  sdo  8:224  active ready running
  |- 1:0:2:2  sdbw 68:160 active ready running
  `- 1:0:3:2  sdci 69:96  active ready running

In anticipation of your next reply.........
Mark

I should preface by stating that you should absolutely open a case if this is urgent/important. We love to help folks at the forum, but certainly don't want to impact your business while we help you get through this ;-)

I have not worked with the v7000 on RHEL yet, but 8-paths to each LUN seems odd (although is absolutely possible). The fact that they have different priorities also seems odd. I would check IBMs recommendations for the mulitpath config. Also - I would check to see if the patching left behind and old version of the multipath.conf (as it appears something updated that file this morning. If you use satellite and configuration channels, there may be a copy in /var/lib/rhncfg/backups/

find /etc/ -name "multipath*" -exec ls -lh {} \;
find /var/ -name "multipath*" -exec ls -lh {} \;

Please take a look at this
http://pic.dhe.ibm.com/infocenter/storwize/ic/topic/com.ibm.storwize.v7000.doc/svc_linux_settings.html

specifically the global-device section:

    device {
        vendor "IBM"
        product "2145"
        path_grouping_policy group_by_prio
        getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
        features "1 queue_if_no_path"
        prio alua
        path_checker tur
        failback immediate
        no_path_retry "5"
        rr_min_io 1
#       polling_interval 30
        dev_loss_tmo 120
    }

You will likely NOT want to use getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n" - as that will change the "common name" (e.g. data02) that you have implemented elsewhere in your config, I believe.

This is brilliant James. Thank you so much. It's leading me down some good paths.
I will apply some of the changes recommended in these best practice docs, and report back after a day or 2's soak testing.

P.S. this is not critical production kit, merely a test environment :)

Cool - then we're all here to help ;-) (some folks don't realize that we are not actually support).

It shouldn't change the devices' aliases. The getuid_callout should just be for pulling devid (typically the LUN serial) information from the SCSI device's code-page. The devid is what gets populated into the parenthesis-surrounded field of the multipath -ll output (and is what the OP is using to bind his aliases - the "data02" labels - with).

The "dev_loss_tmo 120" declaration is what's getting triggered in the original "NFO: task multipathd:4620 blocked for more than 120 seconds" logged message. Basically, an I/O that was sent down one of his eight paths to the LUN failed to complete within the time specified by the "dev_loss_tmo 120" declaration. Assuming the failed I/O was able to be shunted to another path and successfully committed, it should result in no damage to the data. Mostly, what it should be pointing to is that there have been I/O failures down a given path in the path-group.

Looking further into the logs, it looks like the /dev/sdq devnode/path (a member of the "data04" metadevice) is what experienced the error. That sdq is currently showing as good means that it was likely a self-correcting/transient condition. That said, if these are regular occurrences, it's likely going to be a good idea to talk to the SAN admin to see if either the array (or just one of the HBAs on the array) or the fabric has been experiencing issues that would have caused that path to become unavailable for more than 120s at a go.

Hi Tom, thanks for this info. I do think a check of all Fibre cabling between my Bladecenter and SAN is in order here too. I will get that sorted.

Hi again, unfortunately after applying all the multipath fixes in that document, and following the further advice on IBM website to create a specific rule for IBM SAN drives:

vi /etc/udev/rules.d/99-ibm-2145.rules

...it has errored again with the exact same message.
Further reading, led me to this page -> https://access.redhat.com/site/solutions/23366
..but I am unclear as to how to modify the file it talks about as modules.conf does not exist, and I don't know what file to modify under /etc/modprobe.d/.
Any further help appreciated.
Thanks

I'll have to do some more digging. I'm surprised that UDEV is still recommended - and that recommendation might be release-specific.
que-depth is typically tuned on a per-HBA basis. (e.g. if you had a Database host that had SAN storage, and also direct fiber-channel tape devices - you would tune the cards differently).
Do you know what kind of HBAs you have installed? Are you using the included driver, or the stack provided by the vendor (Emulex or Qlogic, etc...)?

lspci | egrep -i 'fibre|fiber'

Que depth is an important tuning parameter, but in general it would make a box run poorly and not completely stop I/O if misconfigured. When I have seen that hung_tasktime_out message, it has typically indicated that the OS/kernel was not able to access something on disk.

Does the V7000 have an active/active multipath, or primary/stand-by? Are you able to see if the path (on the array itself) is transferring between the array controllers (it's been about 3 years since I have messed around with one of those arrays).

Hopefully I get some time to dig deeper on this and I can help you out.

Hi James, this server uses a QLogic HB. Results from that command:

[root@know-iprd-testaql1 ~]# lspci | egrep -i 'fibre|fiber'
06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
06:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)

It is an active/active layout to a single SAN surround with 2 nodes.
As I replied above to Tom Jones (famous Welsh singer?), I will also get the SAN<>Bladecenter fibre cabling checked out as this may be a case of a faulty cable after all.

Also, are you suggesting I should have not bothered with the udev changes I applied?

I don't have any particular stance on UDEV. UDEV is quite perplexing to me, actually. It seemed as though releases had relied on UDEV heavily for a while, then they found some security issue with it.. and something changed ... and suddenly I don't hear about it for quite some time - then it pops up on the forum ;-)
We HAD to use UDEV for a while when Oracle ASM was not supported on RHEL 6. We have regretted the decision to use it (for this particular function).

So - to answer your question: I'm not sure if UDEV was/is necessary. IBMs documentation is usually pretty awesome, but the docs we found might be a bit dated. We use Emulex and included driver stack is actually sufficient for us.
Did you happen across this doc?
https://www-304.ibm.com/partnerworld/wps/servlet/download/DownloadServlet?id=6bQjQiT_N1PiPCA$cnt&attachmentName=Linux_IO_performance.pdf&token=MTM5MjIzMzk2NjUwMA==&locale=en_ALL_ZZ

Sorry to add such ambiguity in to the conversation! ;-) I wish I had a better answer.

Unless you're using an older EL5-based system, you generally don't want to muck with /etc/modprobe.conf. Its use is deprecated in EL6 and even in later EL5s.

That said, in general, if a given file doesn't exist in /etc/modprobe.d, you can just go ahead and create it - populating it with the suggested information. The fun of /etc/modprobe.d files is that, while there may be naming conventions promulgated by a module-maker, there's not really any specific name you must to give to your files: absent a vendor filename recommendation, just pick something meaningful to you (i.e., "self-documenting").

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.