A node is fenced after the logs show oom-killer killing a process in a RHEL 5, 6, or 7 High Availability cluster
Issue
- A cluster node got fenced, and I can see in the
vmcore-dmesg.txtfile in/var/crashthat there was an oom-killer right before it got fenced. However it didn't kill any cluster-related processes, so why did it stop responding and get fenced?
<4>sh invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0
<6>sh cpuset=/ mems_allowed=0-1
<4>Pid: 21255, comm: sh Not tainted 2.6.32-358.6.2.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff810cb5f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
<4> [<ffffffff8111cdf0>] ? dump_header+0x90/0x1b0
<4> [<ffffffff8111cf5e>] ? check_panic_on_oom+0x4e/0x80
<4> [<ffffffff8111d64b>] ? out_of_memory+0x1bb/0x3c0
<4> [<ffffffff8112b8d0>] ? drain_local_pages+0x0/0x20
<4> [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
<4> [<ffffffff8116095a>] ? alloc_pages_current+0xaa/0x110
<4> [<ffffffff81129d3e>] ? __get_free_pages+0xe/0x50
<4> [<ffffffff8106bef4>] ? copy_process+0xe4/0x1450
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
<4> [<ffffffff81009598>] ? sys_clone+0x28/0x30
<4> [<ffffffff8100b393>] ? stub_clone+0x13/0x20
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
- Node got fenced after "panic_on_oom" message is seen
<0>Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
<0>
<4>Pid: 21255, comm: sh Not tainted 2.6.32-358.6.2.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8150d478>] ? panic+0xa7/0x16f
<4> [<ffffffff8111cef1>] ? dump_header+0x191/0x1b0
<4> [<ffffffff8111cf8c>] ? check_panic_on_oom+0x7c/0x80
<4> [<ffffffff8111d64b>] ? out_of_memory+0x1bb/0x3c0
<4> [<ffffffff8112b8d0>] ? drain_local_pages+0x0/0x20
<4> [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
<4> [<ffffffff8116095a>] ? alloc_pages_current+0xaa/0x110
<4> [<ffffffff81129d3e>] ? __get_free_pages+0xe/0x50
<4> [<ffffffff8106bef4>] ? copy_process+0xe4/0x1450
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
<4> [<ffffffff81009598>] ? sys_clone+0x28/0x30
<4> [<ffffffff8100b393>] ? stub_clone+0x13/0x20
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
- A node was powered off by the cluster via fencing, and in the sar data we see memory consumption climbing leading up to the event, until eventually there was an oom-kill
Environment
- Red Hat Enterprise Linux (RHEL) 5, 6, or 7 with the High Availability Add On
sysctlparametervm.panic_on_oomis set to 1- See Diagnostic Steps below for steps to check this
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.