How do I troubleshoot Minor Long GC Pauses ?

Solution In Progress - Updated -

Environment

  • Red Hat JBoss Enterprise Application Platform (EAP) 6.1.1
  • VMware
  • Rhel Virtual box

Issue

2015-05-18T02:30:34.235+1000: 317248.797: [GC2015-05-18T02:30:34.236+1000: 317248.797: [ParNew: 224625K->2047K(306688K), 7.7793540 secs] 2975927K->2896668K(4160256K), 7.7803710 secs] [Times: user=23.07 sys=0.01, real=7.78 secs] 
2015-05-18T02:30:42.677+1000: 317257.238: [GC2015-05-18T02:30:42.678+1000: 317257.239: [ParNew: 254844K->2224K(306688K), 9.7654650 secs] 3149464K->3074110K(4160256K), 9.7663640 secs] [Times: user=28.32 sys=0.18, real=9.76 secs] 
..........
2015-05-15T04:00:37.728+1000: 63452.289: [GC2015-05-15T04:00:37.728+1000: 63452.289: [ParNew: 272498K->31856K(306688K), 7.8617690 secs] 3553329K->3479932K(4160256K), 7.8626360 secs] [Times: user=21.13 sys=0.08, real=7.87 secs] 
2015-05-15T04:00:45.614+1000: 63460.175: [GC [1 CMS-initial-mark: 3448076K(3853568K)] 3541769K(4160256K), 0.1105370 secs] [Times: user=0.02 sys=0.00, real=0.11 secs] 
2015-05-15T04:00:45.726+1000: 63460.287: [CMS-concurrent-mark-start]
2015-05-15T04:00:45.991+1000: 63460.553: [GC2015-05-15T04:00:45.992+1000: 63460.553: [ParNew (promotion failed): 304496K->306688K(306688K), 0.6682420 secs]2015-05-15T04:00:46.660+1000: 63461.221: [CMS2015-05-15T04:00:50.761+1000: 63465.322: [CMS-concurrent-mark: 4.347/5.035 secs] [Times: user=1.17 sys=0.09, real=5.04 secs] 
 (concurrent mode failure): 3494551K->619035K(3853568K), 14.1227260 secs] 3752572K->619035K(4160256K), [CMS Perm : 213966K->213923K(356576K)], 14.7918010 secs] [Times: user=2.49 sys=0.17, real=14.79 secs] 
2015-05-15T04:01:02.063+1000: 63476.624: [GC2015-05-15T04:01:02.063+1000: 63476.624: [ParNew: 272640K->34048K(306688K), 0.0398180 secs] 891675K->716467K(4160256K), 0.0406330 secs] [Times: user=0.14 sys=0.00, real=0.04 secs] 

Resolution

  • Increase amount of memory available to VMWare VM

  • Consider using pre-allocated HugePages for Java memory allocations, as HugePages cannot be swapped

This requires memory to be pre-allocated and the Linux system rebooted to ensure HugePage memory is allocated as one contiguous block:

How do I enable large page support on Linux?

We recommend leaving at least 1 GiB overhead for the operating system to function. eg: If you're giving 4 GiB to the JVM, then the system needs at least 5 GiB memory. Depending on workload, more may be required.

Then add "-XX:+UseLargePages" to the EAP JVM options. This is done in "bin/standalone.conf" if in standalone mode, or in "domain.xml/host.xml" if domain mode.

  • Install "tuned" and run the "virtual-guest" performance profile.

Diagnostic Steps

JVM Configuration Options

Check for JVM options

-server
-Xms4096m
-Xmx4096m 
-XX:MaxPermSize=512m 
-XX:+UseConcMarkSweepGC 
-XX:MaxGCPauseMillis=50 
-XX:GCTimeRatio=10 
-XX:+CMSClassUnloadingEnabled 
-XX:+UseTLAB 
-XX:TLABSize=256K 
-XX:-UseGCOverheadLimit
-XX:-TraceClassUnloading
-XX:+HeapDumpOnOutOfMemoryError
-verbose:gc 
-Xloggc:/app/jboss6/gc.log 
-XX:+PrintGCDetails 
-XX:+PrintGCDateStamps
-Djava.net.preferIPv4Stack=true 
-Dorg.jboss.resolver.warning=true 
-Dsun.rmi.dgc.client.gcInterval=1200000 
-Dsun.rmi.dgc.server.gcInterval=1200000 
-Dsun.lang.ClassLoader.allowArraySyntax=true"
-Djboss.modules.system.pkgs=$JBOSS_MODULES_SYSTEM_PKGS 
-Djava.awt.headless=true
-Duser.timezone=Australia/Melbourne 
-Dorg.apache.jasper.compiler.Parser.STRICT_QUOTE_ESCAPING=false
-Djboss.server.default.config=standalone-full-ha.xml
  • "-XX:CMSInitiatingOccupancyFraction" is not specified ( where this is actually used to determine when to start CMS collections , which defaults to 92% if not specified )
  • -XX:MaxGCPauseMillis=50 is quite low . Setting its value too small will cause the system to spend an excessive amount of time doing garbage collection as well .
    Test below options

    • CMSInitiatingOccupancyFraction (option to a value lower than the 92% default value. This is the tenured generation threshold that triggers a concurrent collection)
  • UseCMSInitiatingOccupancyOnly
    eg.
    -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=85

So when CMS collector run is based on the promotion rate ,this will help calculate an estimate of how long it will take to fill the Old generation , ie the objects promoted from the young generation.
Setting a lower value would mean more work and low throughput ,and setting it to higher value would mean (setting late) old gen might be exhausted.

Recommend ruling any issues around JVM configurations. If the issue still persist investigate Kernel side of things.

Kernel

Things to check

  • At the slow time, capture the top CPU-using processs is Java.
  • Check if CPU core is pinned on iowait.
  • Is I/O being done to the swap LVM Logical Volume.
  • Check if Java is causing the system to swap fairly heavily.
  • Collect sosreport , check memory usage

For example below Memory usage in the "sosreport" (below usecase )indicates almost all memory is allocated to Java:

 $ xsos -m .
MEMORY
  Stats graphed as percent of MemTotal:
    MemUsed    ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊..  96.8%
    Buffers    ..................................................   0.1%
    Cached     ▊.................................................   2.8%
    HugePages  ..................................................   0.0%
  RAM:
    4.7 GiB total ram
    4.6 GiB (97%) used
    4.4 GiB (94%) used excluding Buffers/Cached
  HugePages:
    No ram pre-allocated to HugePages

 $ xsos -p .
PS CHECK
  Top users of CPU & MEM: 
    USER   %CPU  %MEM   RSS 
    jboss  5.5%  86.7%  4.10 GiB
    root   4.2%  3.8%   0.26 GiB
  Top MEM-using processes: 
    USER     PID    %CPU  %MEM  VSZ-MiB  RSS-MiB  TTY    STAT  START  TIME    COMMAND  
    jboss    22643  5.5   86.7  8512     4200     ?      Sl    May18  61:52   /app/j2sdk/bin/java -D[Standalone] -XX:+UseCompressedOops 
    root     16836  0.1   1.8   2380     91       ?      Sl    Apr13  71:31   /app/j2sdk/bin/java -Djava.endorsed.dirs=/app/rhq-agent/lib/endorsed -Djava.library.path=/app/rhq-agent/li

and evidently that isn't enough.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.