A node is fenced after the logs show oom-killer killing a process in a RHEL 5, 6, or 7 High Availability cluster
Issue
- A cluster node got fenced, and I can see in the
vmcore-dmesg.txtfile in/var/crashthat there was an oom-killer right before it got fenced. However it didn't kill any cluster-related processes, so why did it stop responding and get fenced?
<4>sh invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0
<6>sh cpuset=/ mems_allowed=0-1
<4>Pid: 21255, comm: sh Not tainted 2.6.32-358.6.2.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff810cb5f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
<4> [<ffffffff8111cdf0>] ? dump_header+0x90/0x1b0
<4> [<ffffffff8111cf5e>] ? check_panic_on_oom+0x4e/0x80
<4> [<ffffffff8111d64b>] ? out_of_memory+0x1bb/0x3c0
<4> [<ffffffff8112b8d0>] ? drain_local_pages+0x0/0x20
<4> [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
<4> [<ffffffff8116095a>] ? alloc_pages_current+0xaa/0x110
<4> [<ffffffff81129d3e>] ? __get_free_pages+0xe/0x50
<4> [<ffffffff8106bef4>] ? copy_process+0xe4/0x1450
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
<4> [<ffffffff81009598>] ? sys_clone+0x28/0x30
<4> [<ffffffff8100b393>] ? stub_clone+0x13/0x20
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
- Node got fenced after "panic_on_oom" message is seen
<0>Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
<0>
<4>Pid: 21255, comm: sh Not tainted 2.6.32-358.6.2.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8150d478>] ? panic+0xa7/0x16f
<4> [<ffffffff8111cef1>] ? dump_header+0x191/0x1b0
<4> [<ffffffff8111cf8c>] ? check_panic_on_oom+0x7c/0x80
<4> [<ffffffff8111d64b>] ? out_of_memory+0x1bb/0x3c0
<4> [<ffffffff8112b8d0>] ? drain_local_pages+0x0/0x20
<4> [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
<4> [<ffffffff8116095a>] ? alloc_pages_current+0xaa/0x110
<4> [<ffffffff81129d3e>] ? __get_free_pages+0xe/0x50
<4> [<ffffffff8106bef4>] ? copy_process+0xe4/0x1450
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
<4> [<ffffffff81009598>] ? sys_clone+0x28/0x30
<4> [<ffffffff8100b393>] ? stub_clone+0x13/0x20
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
- A node was powered off by the cluster via fencing, and in the sar data we see memory consumption climbing leading up to the event, until eventually there was an oom-kill
Environment
- Red Hat Enterprise Linux (RHEL) 5, 6, or 7 with the High Availability Add On
sysctlparametervm.panic_on_oomis set to 1- See Diagnostic Steps below for steps to check this
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
