Hung I/O to filesystem on md raid5/6 storage device.

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux 6.3
  • kernel-2.6.32-279.el6
  • MD RAID5/6 volume

Issue

  • Writes to MD RAID5 volume don't always complete, hanging I/O.
  • nfs server was not responding to nfs client machines and got the following message on /var/log/messages of nfs clients
"server "nfsserver.com" not responding, timed out"
  • Following hung_task messages were seen
INFO: task jbd2/md0-8:2070 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/md0-8    D 0000000000000000     0  2070      2 0x00000000
 ffff88066bf5fb30 0000000000000046 0000000000000000 ffff880c6d33e800
 ffff880c6d33ea28 0000000000000286 ffff880c6d33e928 0000000000000002
 ffff88066c75c5f8 ffff88066bf5ffd8 000000000000fb88 ffff88066c75c5f8
Call Trace:
 [<ffffffff811140c0>] ? sync_page+0x0/0x50
 [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
 [<ffffffff811140fd>] sync_page+0x3d/0x50
 [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
 [<ffffffff81114333>] wait_on_page_bit+0x73/0x80
 [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
 [<ffffffff8112a835>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff811147ab>] wait_on_page_writeback_range+0xfb/0x190
 [<ffffffff8111486f>] filemap_fdatawait+0x2f/0x40
 [<ffffffffa007de69>] jbd2_journal_commit_transaction+0x7e9/0x14b0 [jbd2]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff8107eabb>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa0083f78>] kjournald2+0xb8/0x220 [jbd2]
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0083ec0>] ? kjournald2+0x0/0x220 [jbd2]
 [<ffffffff81091d66>] kthread+0x96/0xa0
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

Resolution

Upgrade to kernel-2.6.32-358.el6, released with errata RHSA-2013-0496.

Root Cause

The I/O not completing was do to the MD RAID5 device becoming deadlocked. The MD RAID5 driver uses several lists and bits of state information to track the state of a RAID5 stripe. In some rare conditions when reading in part of a modified stripe, the stripe information would be put on a list in the wrong state, resulting in MD thinking no new I/O should be started, and deadlocking the RAID5 MD device.

Diagnostic Steps

  • In a vmcore, the issue can be seen by finding md's r5conf structure, then checking the state of the stripe_head structures. The r5conf with have a non-zero preread_active_stripes, hold_list and handle_list will be empty, and the stripe_heads with STRIPE_PREREAD_ACTIVE set will be on the delayed_list
crash> p ((struct r5conf *)0xffff880c6d33e800)->preread_active_stripes
$16 = {
  counter = 3
}

p ((struct r5conf *)0xffff880c6d33e800)->delayed_list
$17 = (struct list_head *) 0xffff880c6d33e878

list -H 0xffff880c6d33e878 -o stripe_head.lru -s stripe_head.state
  • At least one stripe_head struct will have STRIPE_PREREAD_ACTIVE set in the state variable value.

Detailed vmcore analysis

  • Analysis #1

- Found pid 2070 as blocked task triggering hung task panic backtrace <begin> PID: 2070 TASK: ffff88066c75c040 CPU: 0 COMMAND: "jbd2/md0-8" #0 [ffff88066bf5fa70] schedule at ffffffff814fd7e2 #1 [ffff88066bf5fb38] io_schedule at ffffffff814fdfc3 #2 [ffff88066bf5fb58] sync_page at ffffffff811140fd #3 [ffff88066bf5fb68] __wait_on_bit at ffffffff814fe97f #4 [ffff88066bf5fbb8] wait_on_page_bit at ffffffff81114333 #5 [ffff88066bf5fc18] wait_on_page_writeback_range at ffffffff811147ab #6 [ffff88066bf5fd18] filemap_fdatawait at ffffffff8111486f #7 [ffff88066bf5fd28] jbd2_journal_commit_transaction at ffffffffa007de69 [jbd2] #8 [ffff88066bf5fe68] kjournald2 at ffffffffa0083f78 [jbd2] #9 [ffff88066bf5fee8] kthread at ffffffff81091d66 #10 [ffff88066bf5ff48] kernel_thread at ffffffff8100c14a crash> backtrace <end> END: Basic hung vmcore analysis - Looks fairly standard. Perhaps just waiting on IO to complete? However, no in-progress IO in the vmcore, so we can't be waiting on IO. Looks like we're looking for a missed wakeup? crash> dev -d MAJOR GENDISK NAME REQUEST_QUEUE TOTAL ASYNC SYNC DRV 8 ffff880c6e167c00 sdb ffff88066c686378 0 0 0 0 8 ffff880c707cec00 sdc ffff88066c6eeee0 0 0 0 0 8 ffff880c7076ac00 sdd ffff88066c6ee3b8 0 0 0 0 8 ffff880c707eb400 sde ffff88066c6f0f20 0 0 0 0 8 ffff880c7077d400 sdf ffff88066c6f03f8 0 0 0 0 8 ffff880c707f0400 sdg ffff88066c6f2f60 0 0 0 0 8 ffff880c70510000 sdh ffff88066c6f2438 0 0 0 0 8 ffff880c704dfc00 sdj ffff88066c6f4478 0 0 0 0 8 ffff880c704e7000 sdi ffff88066c6f4fa0 0 0 0 0 8 ffff880c704c9400 sdk ffff88066c6f6fe0 0 0 0 0 8 ffff880c704c5800 sdl ffff88066c6f64b8 0 0 0 0 8 ffff880c704a3400 sdo ffff88066c6fb060 0 0 0 0 8 ffff880c704c1000 sdm ffff88066c6f9020 0 0 0 0 8 ffff880c7049f000 sdp ffff88066c6fa538 0 0 0 0 8 ffff880c704ac800 sdn ffff88066c6f84f8 0 0 0 0 8 ffff880c6e12f800 sda ffff88066c686ea0 0 0 0 0 253 ffff88066d7c8c00 dm-0 ffff88066c6fd0a0 0 0 0 0 9 ffff880c6ddcb800 md0 ffff880c6f4d2b60 0 0 0 0 9 ffff88066cd6c800 md1 ffff88066c6fc578 0 0 0 0 253 ffff88066e31dc00 dm-1 ffff88066c4a0338 0 0 0 0 253 ffff88066c484c00 dm-2 ffff88066f0b30e0 0 0 0 0 253 ffff88066e31d400 dm-3 ffff88066f0b25b8 0 0 0 0 253 ffff88066ceecc00 dm-4 ffff88066eefd120 0 0 0 0 253 ffff88066e717400 dm-5 ffff88066eefc5f8 0 0 0 0 crash> Look at how long things have been blocked, what else is blocked. crash> ps -l|~/crash/bin/ps_dt|grep UN [ 0 00:00:06.483] [UN] PID: 2833 TASK: ffff88066ef5f540 CPU: 20 COMMAND: "nfsd" [ 0 00:00:10.744] [UN] PID: 6754 TASK: ffff88066c784aa0 CPU: 0 COMMAND: "flush-9:0" [ 0 00:00:41.195] [UN] PID: 2831 TASK: ffff88066f226aa0 CPU: 20 COMMAND: "nfsd" [ 0 00:02:06.308] [UN] PID: 2830 TASK: ffff88066f227500 CPU: 7 COMMAND: "nfsd" [ 0 00:02:45.716] [UN] PID: 2070 TASK: ffff88066c75c040 CPU: 0 COMMAND: "jbd2/md0-8" <--------- triggered hung task panic [ 0 00:03:06.098] [UN] PID: 2835 TASK: ffff88066ef5e080 CPU: 10 COMMAND: "nfsd" Only 2835 has been blocked longer: crash> bt 2835 PID: 2835 TASK: ffff88066ef5e080 CPU: 10 COMMAND: "nfsd" #0 [ffff88066f1e5a00] schedule at ffffffff814fd7e2 #1 [ffff88066f1e5ac8] io_schedule at ffffffff814fdfc3 #2 [ffff88066f1e5ae8] sync_page at ffffffff811140fd #3 [ffff88066f1e5af8] __wait_on_bit at ffffffff814fe97f #4 [ffff88066f1e5b48] wait_on_page_bit at ffffffff81114333 #5 [ffff88066f1e5ba8] wait_on_page_writeback_range at ffffffff811147ab #6 [ffff88066f1e5ca8] filemap_write_and_wait_range at ffffffff81114978 #7 [ffff88066f1e5cd8] vfs_fsync_range at ffffffff811a9fae #8 [ffff88066f1e5d28] vfs_fsync at ffffffff811aa07d #9 [ffff88066f1e5d38] nfsd_commit at ffffffffa0327e0b [nfsd] #10 [ffff88066f1e5d78] nfsd4_commit at ffffffffa0333608 [nfsd] #11 [ffff88066f1e5d88] nfsd4_proc_compound at ffffffffa0332f68 [nfsd] #12 [ffff88066f1e5dd8] nfsd_dispatch at ffffffffa032043e [nfsd] #13 [ffff88066f1e5e18] svc_process_common at ffffffffa02975d4 [sunrpc] #14 [ffff88066f1e5e98] svc_process at ffffffffa0297c10 [sunrpc] #15 [ffff88066f1e5eb8] nfsd at ffffffffa0320b62 [nfsd] #16 [ffff88066f1e5ee8] kthread at ffffffff81091d66 #17 [ffff88066f1e5f48] kernel_thread at ffffffff8100c14a
  • Analysis #2
The task which caused the hung_task panic, 2070, was waiting on PG_writeback to clear on (struct page *) 0xffffea002b11b2b8
In the vmcore, PG_writeback is still set so it should still be waiting.

Here's the associated inode and dentries, as well as the mount point.
crash> files -d ffff880c6d97e800
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff880c6d97e800 ffff880c550940c0 ffff880c6ea8d800 REG  /nfs/home/arachako/ccsg-complaints/training/models/themes/gms/example_predict_vectors
crash> files -d ffff880c6d97e740
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff880c6d97e740 ffff880c550940c0 ffff880c6ea8d800 REG

Now get the associated mount point:
crash> mount | grep ffff880c6ea8d800
ffff880c6ed10e80 ffff880c6ea8d800 ext4   /dev/md0  /nfs/home 


Details
=======

Here's the panic task

crash> bt
PID: 2070   TASK: ffff88066c75c040  CPU: 0   COMMAND: "jbd2/md0-8"
 #0 [ffff88066bf5fa70] schedule at ffffffff814fd7e2
 #1 [ffff88066bf5fb38] io_schedule at ffffffff814fdfc3
 #2 [ffff88066bf5fb58] sync_page at ffffffff811140fd
 #3 [ffff88066bf5fb68] __wait_on_bit at ffffffff814fe97f
 #4 [ffff88066bf5fbb8] wait_on_page_bit at ffffffff81114333
 #5 [ffff88066bf5fc18] wait_on_page_writeback_range at ffffffff811147ab
 #6 [ffff88066bf5fd18] filemap_fdatawait at ffffffff8111486f
 #7 [ffff88066bf5fd28] jbd2_journal_commit_transaction at ffffffffa007de69 [jbd2]
 #8 [ffff88066bf5fe68] kjournald2 at ffffffffa0083f78 [jbd2]
 #9 [ffff88066bf5fee8] kthread at ffffffff81091d66
#10 [ffff88066bf5ff48] kernel_thread at ffffffff8100c14a

Looks like this guy is just waiting on the PG_writeback to clear.  Page is under writeback so we're waiting on that to complete.

Let's get the page and see what we can find out.se this transition to get the page:
 #4 [ffff880037a71d00] wait_on_page_bit at ffffffff81110d03
 #5 [ffff880037a71d60] wait_on_page_writeback_range at ffffffff8111111b

<exercise for the reader>

crash> bt -f | grep wait_on_page_writeback_range
 #5 [ffff88066bf5fc18] wait_on_page_writeback_range at ffffffff811147ab
crash> px (0xffff88066bf5fc18 - 1*8)
$29 = 0xffff88066bf5fc10
crash> rd 0xffff88066bf5fc10
ffff88066bf5fc10:  ffff88066bf5fd10                    ...k....
crash> px (0xffff88066bf5fd10 - 0xd8)
$30 = 0xffff88066bf5fc38
crash> rd 0xffff88066bf5fc38
ffff88066bf5fc38:  ffffea002b11b2b8                    ...+....
crash> p (struct page *)0xffffea002b11b2b8
$31 = (struct page *) 0xffffea002b11b2b8
crash> source ~/crash/kernel/pageflags 
crash> rhel6-pageflags 0xffffea002b11b2b8
 PG_referenced(2) PG_uptodate(3) PG_lru(5) PG_private(11) PG_writeback(13)

Page still has PG_writeback set, so yeah, we should still be waiting.  Why is there no IO though?

Now get the inode:
crash> p $31->mapping->host
$32 = (struct inode *) 0xffff880c550940c0

Get the dentries:

crash> p &$32->i_dentry
$33 = (struct list_head *) 0xffff880c550940f0
crash> list dentry.d_alias -H 0xffff880c550940f0
ffff880c6d97e800
ffff880c6d97e740

Simple check to make sure the dentry points back to the inode:
crash> dentry.d_inode ffff880c6d97e800
  d_inode = 0xffff880c550940c0
crash> dentry.d_inode ffff880c6d97e740
  d_inode = 0xffff880c550940c0

crash> files -d ffff880c6d97e800
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff880c6d97e800 ffff880c550940c0 ffff880c6ea8d800 REG  /nfs/home/arachako/ccsg-complaints/training/models/themes/gms/example_predict_vectors
crash> files -d ffff880c6d97e740
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff880c6d97e740 ffff880c550940c0 ffff880c6ea8d800 REG

Now get the associated mount point:
crash> mount | grep ffff880c6ea8d800
ffff880c6ed10e80 ffff880c6ea8d800 ext4   /dev/md0  /nfs/home
  • Analysis #3
The other task 2835 which was blocked a bit longer, over 3 minutes, but for some reason didn't trigger the hung task panic, was waiting on the same PG_writeback bit to clear for the same page.


Details
=======

Let's look at the other task which was blocked longer.  Unclear why this one didn't trigger the hung_task panic.

crash> bt
PID: 2835   TASK: ffff88066ef5e080  CPU: 10  COMMAND: "nfsd"
 #0 [ffff88066f1e5a00] schedule at ffffffff814fd7e2
 #1 [ffff88066f1e5ac8] io_schedule at ffffffff814fdfc3
 #2 [ffff88066f1e5ae8] sync_page at ffffffff811140fd
 #3 [ffff88066f1e5af8] __wait_on_bit at ffffffff814fe97f
 #4 [ffff88066f1e5b48] wait_on_page_bit at ffffffff81114333
 #5 [ffff88066f1e5ba8] wait_on_page_writeback_range at ffffffff811147ab
 #6 [ffff88066f1e5ca8] filemap_write_and_wait_range at ffffffff81114978
 #7 [ffff88066f1e5cd8] vfs_fsync_range at ffffffff811a9fae
 #8 [ffff88066f1e5d28] vfs_fsync at ffffffff811aa07d
 #9 [ffff88066f1e5d38] nfsd_commit at ffffffffa0327e0b [nfsd]
#10 [ffff88066f1e5d78] nfsd4_commit at ffffffffa0333608 [nfsd]
#11 [ffff88066f1e5d88] nfsd4_proc_compound at ffffffffa0332f68 [nfsd]
#12 [ffff88066f1e5dd8] nfsd_dispatch at ffffffffa032043e [nfsd]
#13 [ffff88066f1e5e18] svc_process_common at ffffffffa02975d4 [sunrpc]
#14 [ffff88066f1e5e98] svc_process at ffffffffa0297c10 [sunrpc]
#15 [ffff88066f1e5eb8] nfsd at ffffffffa0320b62 [nfsd]
#16 [ffff88066f1e5ee8] kthread at ffffffff81091d66
#17 [ffff88066f1e5f48] kernel_thread at ffffffff8100c14a

Get the page.

crash> bt -f | grep wait_on_page_writeback_range
 #5 [ffff88066f1e5ba8] wait_on_page_writeback_range at ffffffff811147ab
crash> px (0xffff88066f1e5ba8 - 1*8)
$34 = 0xffff88066f1e5ba0
crash> rd 0xffff88066f1e5ba0
ffff88066f1e5ba0:  ffff88066f1e5ca0                    .\.o....
crash> px (0xffff88066f1e5ca0  - 0xd8)
$35 = 0xffff88066f1e5bc8
crash> rd 0xffff88066f1e5bc8
ffff88066f1e5bc8:  ffffea002b11b2b8                    ...+....
crash> p (struct page *)0xffffea002b11b2b8

Same page as the panic task, 2070.

So this page should definitely be submitted for IO.  Somewhere it must have gotten stuck.

crash> p $38->mapping->a_ops->writepages
$47 = (int (*)(struct address_space *, struct writeback_control *)) 0xffffffffa00a94f0 <ext4_da_writepages>

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments