A node is fenced after the logs show oom-killer killing a process in a RHEL 5, 6, or 7 High Availability cluster

Solution In Progress - Updated -

Issue

  • A cluster node got fenced, and I can see in the vmcore-dmesg.txt file in /var/crash that there was an oom-killer right before it got fenced. However it didn't kill any cluster-related processes, so why did it stop responding and get fenced?
<4>sh invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0
<6>sh cpuset=/ mems_allowed=0-1
<4>Pid: 21255, comm: sh Not tainted 2.6.32-358.6.2.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff810cb5f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
<4> [<ffffffff8111cdf0>] ? dump_header+0x90/0x1b0
<4> [<ffffffff8111cf5e>] ? check_panic_on_oom+0x4e/0x80
<4> [<ffffffff8111d64b>] ? out_of_memory+0x1bb/0x3c0
<4> [<ffffffff8112b8d0>] ? drain_local_pages+0x0/0x20
<4> [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
<4> [<ffffffff8116095a>] ? alloc_pages_current+0xaa/0x110
<4> [<ffffffff81129d3e>] ? __get_free_pages+0xe/0x50
<4> [<ffffffff8106bef4>] ? copy_process+0xe4/0x1450
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
<4> [<ffffffff81009598>] ? sys_clone+0x28/0x30
<4> [<ffffffff8100b393>] ? stub_clone+0x13/0x20
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
  • Node got fenced after "panic_on_oom" message is seen
<0>Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
<0>
<4>Pid: 21255, comm: sh Not tainted 2.6.32-358.6.2.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8150d478>] ? panic+0xa7/0x16f
<4> [<ffffffff8111cef1>] ? dump_header+0x191/0x1b0
<4> [<ffffffff8111cf8c>] ? check_panic_on_oom+0x7c/0x80
<4> [<ffffffff8111d64b>] ? out_of_memory+0x1bb/0x3c0
<4> [<ffffffff8112b8d0>] ? drain_local_pages+0x0/0x20
<4> [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
<4> [<ffffffff8116095a>] ? alloc_pages_current+0xaa/0x110
<4> [<ffffffff81129d3e>] ? __get_free_pages+0xe/0x50
<4> [<ffffffff8106bef4>] ? copy_process+0xe4/0x1450
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
<4> [<ffffffff81009598>] ? sys_clone+0x28/0x30
<4> [<ffffffff8100b393>] ? stub_clone+0x13/0x20
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
  • A node was powered off by the cluster via fencing, and in the sar data we see memory consumption climbing leading up to the event, until eventually there was an oom-kill

Environment

  • Red Hat Enterprise Linux (RHEL) 5, 6, or 7 with the High Availability Add On
  • sysctl parameter vm.panic_on_oom is set to 1
    • See Diagnostic Steps below for steps to check this

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.