Infiniband driver causing hung processes in rwsem_down_failed_common on RHEL 6 kernel versions lower than kernel-2.6.32-462.el6.

Solution In Progress - Updated -

Issue

  • The problem is the Infiniband driver can deadlock with itself if a page fault happens while it has a mm_struct.mmap_sem structure locked.

  • There will be messages like the following in the system log file.

INFO: task cl5936_main_lin:21734 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cl5936_main_l D 000000000000000a     0 21734  21730 0x00000080 
 ffff8804307cdca0 0000000000000086 0000000000000000 ffff880880018e18 
 00000037ffffffc8 ffff880880021b40 0000000000000000 ffff880800000041 
 ffff880736fd5078 ffff8804307cdfd8 000000000000f4e8 ffff880736fd5078 
Call Trace: 
 [<ffffffff814eefb5>] rwsem_down_failed_common+0x95/0x1d0 
 [<ffffffff811238a1>] ? __alloc_pages_nodemask+0x111/0x940 
 [<ffffffff814ef146>] rwsem_down_read_failed+0x26/0x30 
 [<ffffffff81276e04>] call_rwsem_down_read_failed+0x14/0x30 
 [<ffffffff814ee644>] ? down_read+0x24/0x30 
 [<ffffffffa00191f2>] kcopy_get_pages+0xf2/0x190 [kcopy] 
 [<ffffffffa001963b>] kcopy_write+0x3ab/0x5d0 [kcopy] 
 [<ffffffff81052600>] ? __dequeue_entity+0x30/0x50 
 [<ffffffff811765d8>] vfs_write+0xb8/0x1a0 
 [<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0 
 [<ffffffff81176fe1>] sys_write+0x51/0x90 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 
INFO: task cl5936_main_lin:21735 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cl5936_main_l D 0000000000000003     0 21735  21730 0x00000080 
 ffff880868e7b870 0000000000000082 0000000000000000 ffffea000180a538 
 ffffea000180a500 ffffea000180a4c8 ffffea000180a490 ffff880868e7b7e8 
 ffff88085de230f8 ffff880868e7bfd8 000000000000f4e8 ffff88085de230f8 
Call Trace: 
 [<ffffffff8112a862>] ? shrink_inactive_list+0x4f2/0x740 
 [<ffffffff814eefb5>] rwsem_down_failed_common+0x95/0x1d0 
 [<ffffffff814ef146>] rwsem_down_read_failed+0x26/0x30 
 [<ffffffff81276e04>] call_rwsem_down_read_failed+0x14/0x30 
 [<ffffffff814ee644>] ? down_read+0x24/0x30 
 [<ffffffff81042b87>] __do_page_fault+0x187/0x480 
 [<ffffffff8116e0ed>] ? follow_trans_huge_pmd+0xed/0xf0 
 [<ffffffff8116e0c9>] ? follow_trans_huge_pmd+0xc9/0xf0 
 [<ffffffff8100bcee>] ? invalidate_interrupt5+0xe/0x20 
 [<ffffffff814f248e>] do_page_fault+0x3e/0xa0 
 [<ffffffff814ef845>] page_fault+0x25/0x30 
 [<ffffffff812759ed>] ? copy_use- r_generic_string+0x2d/0x40 
 [<ffffffffa015604b>] ? qib_user_sdma_writev+0x24b/0x1380 [ib_qib] 
 [<ffffffffa012efb0>] ? qib_aio_write+0x0/0x50 [ib_qib] 
 [<ffffffffa012eff3>] qib_aio_write+0x43/0x50 [ib_qib] 
 [<ffffffff8117619b>] do_sync_readv_writev+0xfb/0x140 
 [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40 
 [<ffffffff81055c83>] ? perf_event_task_sched_out+0x33/0x80 
 [<ffffffff8120c3c6>] ? security_file_permission+0x16/0x20 
 [<ffffffff8117722f>] do_readv_writev+0xcf/0x1f0 
 [<ffffffff814eca40>] ? thread_return+0x4e/0x77e 
 [<ffffffff81177396>] vfs_writev+0x46/0x60 
 [<ffffffff811774c1>] sys_writev+0x51/0xb0 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 
  • In the displayed stack traces there may be one or more procedure names starting with qib.
 Examples :     qib_aio_write+0x43/0x50 [ib_qib] 
                ? qib_user_sdma_writev+0x24b/0x1380 [ib_qib]
  • Processes that use the files ‘/proc/*/cmdline’ such the ps command does may also be hung.

Environment

  • Red Hat Enterprise Linux 6

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.