Server partly unresponsive, caused VCS to fail

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5
  • Verits Cluster Server

Issue

  • The server became almost entirely unresponsive - even a simple ls or cat would not work.
  • Veritas Cluster Server had crashed. Only option was to force a crashdump in order to regain the server.

Resolution

  • A third party application netprobe was consuming very high memory, as a result of it system was hung.
  • Please check with vendor of above application to determine why is it consuming so much memory?

Root Cause

  • System ran out of memory when the issue occured:
                       PAGES           TOTAL           PERCENTAGE
TOTAL MEM              66034077        251.9 GB        ----
FREE                   82252           321.3 MB        0% of TOTAL MEM
USED                   65951825        251.6 GB        99% of TOTAL MEM
SHARED                 13349           52.1 MB         0% of TOTAL MEM
BUFFERS                523             2 MB            0% of TOTAL MEM
CACHED                 9255            36.2 MB         0% of TOTAL MEM

Swap was 100% utilized:

TOTAL   SWAP   8387928        32 GB                 ----
SWAP   USED    8387928        32 GB     100% of TOTAL SWAP
SWAP   FREE    0               0        0%   of TOTAL SWAP

Following process found to be consuming most of the memory:

PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
14054      1  17  ffff812f6fc82100  RU  92.2 268409640 249506188  netprobe_lin

crash> ps -a 14054
PID: 14054  TASK: ffff812f6fc82100  CPU: 17  COMMAND: "netprobe_lin"
ARG: /export/trdwatch/netprobe/newprod/bin/netprobe_lin 1 -port 7021
ENV: MANPATH=/usr/lib/perl5/man:/usr/local/man:/usr/dt/man:/usr/openwin/man:/usr/man:/usr/share/man:/usr/X11R6/man

Above application netprobe is not provided by Red Hat so please check with vendor to determine why is it consuming so much memory ?

  • As the system ran out of memory due to netprobe application, the Veritas Cluster Server also failed, and reported following errors in /var/log/messages:
Apr 25 07:14:07 host01 Had[26025]: VCS CRITICAL V-16-1-50086 Swap usage on foptr-ext01a is 93%
Apr 25 07:16:26 host01 Had[26025]: VCS CRITICAL V-16-1-50086 CPU usage on foptr-ext01a is 98%
Apr 25 07:16:33 host01 AgentFramework[26213]: VCS ERROR V-16-1-13027 Thread(4152359824) Resource(fop-euulfixtrdp02_dg) - monitor procedure did not complete within the expe
cted time.
Apr 25 07:17:07 host01 auditd[15846]: Audit daemon rotating log files
Apr 25 07:17:32 host01foptr-ext01a Had[26025]: VCS ERROR V-16-1-13027 (foptr-ext01a) Resource(fop-euulfixtrdp02_dg) - monitor procedure did not complete within the expected time.
Apr 25 07:20:37 host01 Had[26025]: VCS CRITICAL V-16-1-50086 CPU usage on foptr-ext01a is 100%
Apr 25 07:20:40 host01 AgentFramework[26213]: VCS ERROR V-16-1-13027 Thread(4154907536) Resource(fop-euulfixcffx01_dg) - monitor procedure did not complete within the expected time.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.