Server partly unresponsive, caused VCS to fail
Environment
- Red Hat Enterprise Linux 5
- Verits Cluster Server
Issue
- The server became almost entirely unresponsive - even a simple ls or cat would not work.
- Veritas Cluster Server had crashed. Only option was to force a crashdump in order to regain the server.
Resolution
- A third party application
netprobewas consuming very high memory, as a result of it system was hung. - Please check with vendor of above application to determine why is it consuming so much memory?
Root Cause
- System ran out of memory when the issue occured:
PAGES TOTAL PERCENTAGE
TOTAL MEM 66034077 251.9 GB ----
FREE 82252 321.3 MB 0% of TOTAL MEM
USED 65951825 251.6 GB 99% of TOTAL MEM
SHARED 13349 52.1 MB 0% of TOTAL MEM
BUFFERS 523 2 MB 0% of TOTAL MEM
CACHED 9255 36.2 MB 0% of TOTAL MEM
Swap was 100% utilized:
TOTAL SWAP 8387928 32 GB ----
SWAP USED 8387928 32 GB 100% of TOTAL SWAP
SWAP FREE 0 0 0% of TOTAL SWAP
Following process found to be consuming most of the memory:
PID PPID CPU TASK ST %MEM VSZ RSS COMM
14054 1 17 ffff812f6fc82100 RU 92.2 268409640 249506188 netprobe_lin
crash> ps -a 14054
PID: 14054 TASK: ffff812f6fc82100 CPU: 17 COMMAND: "netprobe_lin"
ARG: /export/trdwatch/netprobe/newprod/bin/netprobe_lin 1 -port 7021
ENV: MANPATH=/usr/lib/perl5/man:/usr/local/man:/usr/dt/man:/usr/openwin/man:/usr/man:/usr/share/man:/usr/X11R6/man
Above application netprobe is not provided by Red Hat so please check with vendor to determine why is it consuming so much memory ?
- As the system ran out of memory due to
netprobeapplication, the Veritas Cluster Server also failed, and reported following errors in/var/log/messages:
Apr 25 07:14:07 host01 Had[26025]: VCS CRITICAL V-16-1-50086 Swap usage on foptr-ext01a is 93%
Apr 25 07:16:26 host01 Had[26025]: VCS CRITICAL V-16-1-50086 CPU usage on foptr-ext01a is 98%
Apr 25 07:16:33 host01 AgentFramework[26213]: VCS ERROR V-16-1-13027 Thread(4152359824) Resource(fop-euulfixtrdp02_dg) - monitor procedure did not complete within the expe
cted time.
Apr 25 07:17:07 host01 auditd[15846]: Audit daemon rotating log files
Apr 25 07:17:32 host01foptr-ext01a Had[26025]: VCS ERROR V-16-1-13027 (foptr-ext01a) Resource(fop-euulfixtrdp02_dg) - monitor procedure did not complete within the expected time.
Apr 25 07:20:37 host01 Had[26025]: VCS CRITICAL V-16-1-50086 CPU usage on foptr-ext01a is 100%
Apr 25 07:20:40 host01 AgentFramework[26213]: VCS ERROR V-16-1-13027 Thread(4154907536) Resource(fop-euulfixcffx01_dg) - monitor procedure did not complete within the expected time.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
