Chapter 4. Troubleshooting and optimizing mod_jk

While optimizing the configuration in Apache HTTP Server, mod_jk, mod_proxy, mod_cluster and JBoss Enterprise Application Platform typically resolves any and all problems and errors in load balancing, there are exceptions (such as long running servlets that require additional optimization).
In most cases, a correctly tuned configuration is the catch all for mod_jk issues. This section discusses some problems and how the configuration can be improved to avoid them.

Optimization Considerations

  • Ensure you are on the latest supported component versions.
  • Ensure the relevant configurations are tuned correctly. The Red Hat Global Support Services staff can use interactive tools to assist you with tailored configuration settings. Find the appropriate contact details at https://access.redhat.com/support/.
If optimizing the configuration does not resolve the issue the problem is most likely on the JBoss/JVM side. Refer to Procedure 4.5, “JBoss/JVM Problems” for advice about these issues.

4.1. Common Problems

The list below outlines some common configuration problems. Ensuring your implementation is not subject to one of these may assist to resolve your issue.
Specific errors and general performance issues are discussed later in this section.

Common Configuration Issues

JkShmFile on a NFS share
Placing the JkShmFile on a NFS share can cause unexplained pauses in mod_jk and odd behavior. It is strongly recommended that the JkShmFile always be placed on local storage to avoid problems.
A firewall between Apache HTTP Server and JBoss Enterprise Application Platform
If there is a firewall between Apache HTTP Server and JBoss Enterprise Application Platform and no socket_keepalive parameter is set, the firewall can close connections unexpectedly.
MaxClients higher than maxThreads
Setting the MaxClients parameter in Apache HTTP Server higher than the maxThreads setting in JBoss (with a high load on the server) will result in Apache HTTP Server overwhelming the JBoss instance with threads which will cause hung and/or dropped connections.
No connectionTimeout parameter set
The connectionTimeout parameter set in JBoss is required for proper maintenance of old connections.
No CPing/CPong set
The CPing/CPong property in mod_jk is the most important worker property setting, allowing mod_jk to test and detect faulty connections. Not setting this parameter can lead to bad connections not being detected as quickly which can lead to web requests behaving as if 'hung'.
Running an old version of mod_jk.
There are known issues with sticky sessions in versions prior to mod_jk 1.2.27.
Running an older version of EAP.
There is a bug in EAP 4.2 base and EAP 4.2 CP01 which causes sockets to be left in the CLOSE_WAIT state, thus causing the appearance of hung requests again. This issue has been reported and fixed https://jira.jboss.org/jira/browse/JBPAPP-366
Unresponsive back end server
java.lang.OutOfMemoryError errors or high pause times can cause the back end server to become unresponsive.
All of the problems listed above are typically resolved after optimizing the configuration in Apache HTTP Server, mod_jk, and JBoss.

Common Errors

"CPing/CPong" Errors
Presents with errors like the following:
[info] ajp_handle_cping_cpong::jk_ajp_common.c (865): timeout in reply cpong
...
[error] ajp_connect_to_endpoint::jk_ajp_common.c (957): (nodeA) cping/cpong after connecting to the backend server failed (errno=110)
[error] ajp_send_request::jk_ajp_common.c (1507): (nodeA) connecting to backend failed. Tomcat is probably not started or is listening on the  wrong port (errno=110)
These CPing/CPong messages do not indicate a problem with mod_jk at all, they indicate that JBoss did not respond in the defined CPing/CPong time.
This is seen many times when there is high load on the JVM JBoss is running on causing high garbage collection or potentially thread contention. It could also be that the JBoss instance is overloaded, or even that a firewall is blocking the connection or there are network issues.
The following workflow may assist to correct these type of issues:

Procedure 4.1. Resolving "CPing/CPong" Errors

  1. Optimize your Apache HTTP Server and JBoss Enterprise Application Platform configuration. You can contact Red Hat's Global Support Services for assistance with this.
    If this does not resolve the issue, proceed to Step 2
  2. Confirm that there is no firewall blocking or dropping the AJP connections.
"Tomcat is down" Errors
Presents with errors like the following:
  1. [error] ajp_get_reply::jk_ajp_common.c (2020): (node1) Timeout with waiting reply from tomcat. Tomcat is down, stopped or network problems (errno=110)
    The above error means that JBoss did not respond in the configured reply_timeout time. The solution can be one (or both) of the following:
    1. Increase the reply_timeout.
    2. Verify there are no garbage collection issues/long pause times in JBoss that may prevent the request from responding thus causing that error.
  2. [Fri May 25 11:53:37 2012][11159:3086420192] [debug] init_ws_service::mod_jk.c (977): Service protocol=HTTP/1.1 method=POST ssl=false host=(null) addr=127.0.0.1 name=localhost port=80 auth=(null) user=(null) laddr=127.0.0.1 raddr=127.0.0.1 uri=/foo/bar
    ...
    [Fri May 25 11:58:39 2012][11159:3086420192] [debug] jk_shutdown_socket::jk_connect.c (681): About to shutdown socket 17
    [Fri May 25 11:58:39 2012][11159:3086420192] [debug] jk_shutdown_socket::jk_connect.c (689): Failed sending SHUT_WR for socket 17
    [Fri May 25 11:58:39 2012][11159:3086420192] [info] ajp_connection_tcp_get_message::jk_ajp_common.c (1150): (node1) can not receive the response header message from tomcat, network problems or tomcat (127.0.0.1:8009) is down (errno=104)
    [Fri May 25 11:58:39 2012][11159:3086420192] [error] ajp_get_reply::jk_ajp_common.c (1962): (node1) Tomcat is down or refused connection. No response has been sent to the client (yet)
    The above error likely means that JBoss Enterprise Application Platform did not respond within the configured core Apache HTTP Server timeout period.
    Note that with these messages the [11159:3086420192] portion of the message serves as an identifier for the connection/request in question. Therefore tracing back from the point of the error in logs can help clarify the activity around the connection/request that lead to the error.
    In this case, that helps clarify that the error was experienced five minutes after the response was sent to JBoss, which likely points to a five minute timeout (this is Apache HTTP Server's Timeout directive default if not specified). If the Timeout is interrupting mod_jk requests, then it should be increased from the current value to allow for the maximum acceptable response time.

    Procedure 4.2. Resolving "Tomcat is down" Errors

    1. Optimize your Apache HTTP Server and JBoss Enterprise Application Platform configuration. You can contact Red Hat's Global Support Services for assistance with this.
      If this does not resolve the issue, proceed to Step 2
    2. Confirm that there is no firewall blocking or dropping the AJP connections.
General Performance Issues
Presents with errors like the following:
ERROR [org.apache.coyote.ajp.AjpMessage] (ajp-192.168.0.101-8001-13) Invalid message received with signature 12336
The above exception when using mod_jk in JBoss Web typically indicates a non AJP request sent to the AJP connector.
Workflows that may assist in resolving these kinds of issues is below:

Procedure 4.3. General Performance Problems

  1. Optimize your Apache HTTP Server and JBoss Enterprise Application Platform configuration. You can contact Red Hat's Global Support Services for assistance with this.
    If this does not resolve the issue, proceed to Step 2
  2. Gather garbage collection logs for analysis.
    If the logs show long garbage collection pause times then you should optimize the Java Virtual Machine to reduce the garbage collection pauses and gather/recheck updated logs. Refer to https://access.redhat.com/knowledge/solutions/19932 (Red Hat account required) for more information.
    If this is not the case, or did not resolve the issue, try Step 3, Step 4 and/or Step 5 until your issue is resolved.
  3. Determine how long the longest request should take. Factor in transaction times. You may need to increase the reply_timeout to resolve the problem.
    If this does not resolve the issue, continue to Step 4.
  4. Determine if your current environment can handle the given load. If not, you may need to upgrade or add more machines.
    If this does not resolve the issue, continue to Step 5.
  5. Confirm that there is no firewall blocking or dropping the AJP connections.

Procedure 4.4. 503 Errors

  1. Optimize your Apache HTTP Server and JBoss Enterprise Application Platform configuration. You can contact Red Hat's Global Support Services for assistance with this.
    If this does not resolve the issue, proceed to Step 2
  2. Gather garbage collection logs for analysis.
    If the logs show long garbage collection pause times then you should optimize the Java Virtual Machine to reduce the garbage collection pauses and gather/recheck updated logs. Refer to https://access.redhat.com/knowledge/solutions/19932 (Red Hat account required) for more information.
    If this is not the case, or does not resolve the issue, continue to Step 3
  3. Determine how long the longest request should take. Factor in transaction times. You may need to increase the reply_timeout to resolve the issue.
    If this does not resolve the issue, move on to Step 4.
  4. Determine if your current environment can handle the given load. If not, you may need to upgrade or add more machines.
JBoss/JVM-related Issues
May present with errors like:
[error] service::jk_lb_worker.c (1473): All tomcat instances failed, no more workers left
If Apache HTTP Server and JBoss Enterprise Application Platform are optimized and you still receive "no more workers left" this typically indicates an issue on the JBoss/JVM side. A number of JVM-related problems could lead to mod_jk not being able to get a connection to JBoss in the configured timeouts, thus causing the worker to go into the error state and producing this message.

Procedure 4.5. JBoss/JVM Problems

  1. Enable garbage collection logging.
    1. For UNIX based systems, the options should be placed in run.conf, not run.sh. The run.conf in the server configuration directory (e.g. <JBOSS_HOME>/server/<PROFILE>/run.conf) takes precedence over the run.conf in the <JBOSS_HOME>/bin directory (except in JBoss EAP 5.0.0 due to a regression fixed in version 5.0.1).
    2. For Windows, the options need to be added to run.bat, as it does not read run.conf.
    3. Check boot.log to see the value of the user.dir environment variable (e.g. <JBOSS_HOME>/bin), the default location for garbage collection logging when no path is provided. If you are running multiple instances of JBoss against the same directory like so:
      ./run.sh -c node1 -b 127.0.0.1 -Djboss.messaging.ServerPeerID=1
      ./run.sh -c node2 -b 127.0.0.1 -Djboss.messaging.ServerPeerID=2 -Djboss.service.binding.set=ports-01
    4. Then for the gc.log files to be properly separated you will need to make sure each <PROFILE> has a unique run.conf with the JVM_OPTS specific to that <PROFILE>.
      For example node1 will contain a <JBOSS_HOME>/server/node1/run.conf with contents:
      JAVA_OPTS="$JAVA_OPTS -verbose:gc -Xloggc:gc_node1.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
    5. And <JBOSS_HOME>/server/node2/run.conf with contents:
      JAVA_OPTS="$JAVA_OPTS -verbose:gc -Xloggc:gc_node2.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps"

      Important

      gc.log is recreated every time JBoss starts.
      Be sure to back up gc.log if you are restarting the server. Alternatively you may be able to add a timestamp to the file name depending on the OS and/or shell. For example, with OpenJDK or Oracle/Sun JDK on Linux: -Xloggc:gc.log.`date +%Y%m%d%H%M%S`.
    6. On Windows, you can use
      for /f "tokens=2-4 delims=/ " %%a in ('date /t') do (set mydate=%%c-%%a-%%b)
      for /f "tokens=1-2 delims=/:" %%a in ("%TIME%") do (set mytime=%%a%%b)
      set "JAVA_OPTS=%JAVA_OPTS% -Xloggc:C:/log/gv.log.%mydate%_%mytime%
  2. For the time period when there are slowdowns, hangs, or errors, gather the following data:
  3. Determine if the CPU utilization is caused by the JVM (Java application). Here, you want to validate that a Java process is indeed using an unexpected amount of CPU.
    The Java thread data gathered in the first step should help identify this.
  4. Assuming a Java process is identified as the cause of high CPU, the most common cause is java Garbage collection. Determine if the high CPU is caused by Java garbage collection by analyzing the garbage collection for long pause times and/or low throughput overall at the time of the issue.
    To find the garbage collection logging related to the issue, it is necessary to determine the number of seconds after JVM startup that the issue happens (that is the typical format of garbage collection logging timestamps). To determine the time elapsed, you can use the first timestamp in the high CPU data gathered and the first timestamp in the console log, boot.log (JBoss), server.log (JBoss), or catalina.out (Tomcat.)
    If you see long pause times and/or low throughput overall, refer to the following Knowledge Base article (Red Hat subscription required) https://access.redhat.com/knowledge/node/19932.
  5. If Garbage collection is not responsible for the high CPU, use the thread dump information gathered when validating CPU information to identify the threads.
One area that is not a direct consequence of an unoptimized mod_jk configuration but can still cause issues with mod_jk is JVM and garbage collection related problems. When there are high pause times and the JVM is not optimized for the app server, the pause times can cause mod_jk issues even when mod_jk is tuned.