Infiniband driver causing hung processes in rwsem_down_failed_common on RHEL 6 kernel versions lower than kernel-2.6.32-462.el6.
Issue
-
The problem is the Infiniband driver can deadlock with itself if a page fault happens while it has a mm_struct.mmap_sem structure locked.
-
There will be messages like the following in the system log file.
INFO: task cl5936_main_lin:21734 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cl5936_main_l D 000000000000000a 0 21734 21730 0x00000080
ffff8804307cdca0 0000000000000086 0000000000000000 ffff880880018e18
00000037ffffffc8 ffff880880021b40 0000000000000000 ffff880800000041
ffff880736fd5078 ffff8804307cdfd8 000000000000f4e8 ffff880736fd5078
Call Trace:
[<ffffffff814eefb5>] rwsem_down_failed_common+0x95/0x1d0
[<ffffffff811238a1>] ? __alloc_pages_nodemask+0x111/0x940
[<ffffffff814ef146>] rwsem_down_read_failed+0x26/0x30
[<ffffffff81276e04>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff814ee644>] ? down_read+0x24/0x30
[<ffffffffa00191f2>] kcopy_get_pages+0xf2/0x190 [kcopy]
[<ffffffffa001963b>] kcopy_write+0x3ab/0x5d0 [kcopy]
[<ffffffff81052600>] ? __dequeue_entity+0x30/0x50
[<ffffffff811765d8>] vfs_write+0xb8/0x1a0
[<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0
[<ffffffff81176fe1>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task cl5936_main_lin:21735 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cl5936_main_l D 0000000000000003 0 21735 21730 0x00000080
ffff880868e7b870 0000000000000082 0000000000000000 ffffea000180a538
ffffea000180a500 ffffea000180a4c8 ffffea000180a490 ffff880868e7b7e8
ffff88085de230f8 ffff880868e7bfd8 000000000000f4e8 ffff88085de230f8
Call Trace:
[<ffffffff8112a862>] ? shrink_inactive_list+0x4f2/0x740
[<ffffffff814eefb5>] rwsem_down_failed_common+0x95/0x1d0
[<ffffffff814ef146>] rwsem_down_read_failed+0x26/0x30
[<ffffffff81276e04>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff814ee644>] ? down_read+0x24/0x30
[<ffffffff81042b87>] __do_page_fault+0x187/0x480
[<ffffffff8116e0ed>] ? follow_trans_huge_pmd+0xed/0xf0
[<ffffffff8116e0c9>] ? follow_trans_huge_pmd+0xc9/0xf0
[<ffffffff8100bcee>] ? invalidate_interrupt5+0xe/0x20
[<ffffffff814f248e>] do_page_fault+0x3e/0xa0
[<ffffffff814ef845>] page_fault+0x25/0x30
[<ffffffff812759ed>] ? copy_use- r_generic_string+0x2d/0x40
[<ffffffffa015604b>] ? qib_user_sdma_writev+0x24b/0x1380 [ib_qib]
[<ffffffffa012efb0>] ? qib_aio_write+0x0/0x50 [ib_qib]
[<ffffffffa012eff3>] qib_aio_write+0x43/0x50 [ib_qib]
[<ffffffff8117619b>] do_sync_readv_writev+0xfb/0x140
[<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81055c83>] ? perf_event_task_sched_out+0x33/0x80
[<ffffffff8120c3c6>] ? security_file_permission+0x16/0x20
[<ffffffff8117722f>] do_readv_writev+0xcf/0x1f0
[<ffffffff814eca40>] ? thread_return+0x4e/0x77e
[<ffffffff81177396>] vfs_writev+0x46/0x60
[<ffffffff811774c1>] sys_writev+0x51/0xb0
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
- In the displayed stack traces there may be one or more procedure names starting with qib.
Examples : qib_aio_write+0x43/0x50 [ib_qib]
? qib_user_sdma_writev+0x24b/0x1380 [ib_qib]
- Processes that use the files ‘/proc/*/cmdline’ such the ps command does may also be hung.
Environment
- Red Hat Enterprise Linux 6
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
