transparent huge pages (khugepaged) contention
Issue
- MPI Jobs typically running 256 processes (16 per host) but do not consume all available RAM.
- The jobs would run for several days but after the first 24 hours users would note that latencies would suffer.
- The host log files showed that processes were blocking during memory access for a considerable period of time.
- Sometimes we also saw that "khugepaged" was also blocking.
2013-03-28 22:32:25 INFO: task khugepaged:357 blocked for more than 120 seconds.
2013-03-28 22:32:25 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-03-28 22:32:25 khugepaged D 000000000000000a 0 357 2 0x00000000
2013-03-28 22:32:25 ffff88083169fc90 0000000000000046 0000000000000003 ffff880840021b40
2013-03-28 22:32:25 ffff880831696080Mar 28 22:32:25 00000000004352dabullet0179 kerne ffff88083169fd40l: INFO: task kh ffffffff8112c02bugepaged:357 blo
2013-03-28 22:32:25 cked for more th ffff880831696638an 120 seconds.
2013-03-28 22:32:25 ffff88083169ffd8Mar 28 22:32:25 000000000000fb88bullet0179 kerne ffff880831696638l: "echo 0 > /pr
2013-03-28 22:32:25 oc/sys/kernel/huCall Trace:
2013-03-28 22:32:25 ng_task_timeout_ [<ffffffff8112c02b>] ? __alloc_pages_nodemask+0x57b/0x8d0
2013-03-28 22:32:25 secs" disables t [<ffffffff8150fd25>] rwsem_down_failed_common+0x95/0x1d0
2013-03-28 22:32:25 his message.
2013-03-28 22:32:25 [<ffffffff8150fe83>] rwsem_down_write_failed+0x23/0x30
2013-03-28 22:32:25 [<ffffffff812834c3>] call_rwsem_down_write_failed+0x13/0x20
2013-03-28 22:32:25 [<ffffffff8150f382>] ? down_write+0x32/0x40
2013-03-28 22:32:25 [<ffffffff81179f96>] khugepaged+0x7f6/0x1310
2013-03-28 22:32:25 [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
2013-03-28 22:32:25 [<ffffffff811797a0>] ? khugepaged+0x0/0x1310
2013-03-28 22:32:25 [<ffffffff81096936>] kthread+0x96/0xa0
2013-03-28 22:32:25 [<ffffffff8100c0ca>] child_rip+0xa/0x20
2013-03-28 22:32:25 [<ffffffff810968a0>] ? kthread+0x0/0xa0
2013-03-28 22:32:25 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-03-28 22:32:25 Kernel panic - not syncing: hung_task: blocked tasks
Environment
- Red Hat Enterprise Linux 6
- Dell M620 systems (~3000 cores)
- Dual E5-2600 SB CPU
- 64GB RAM
- Mellanox QDR Infiniband network
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.