Chapter 5. Monitoring and Troubleshooting Global Data Grid Clusters

Data Grid provides statistics for cross-site replication operations via JMX or the /metrics endpoint for Data Grid server.

Cross-site replication statistics are available at cache level so you must explicitly enable statistics for your caches. Likewise, if you want to collect statistics via JMX you must configure Data Grid to register MBeans.

Data Grid also includes an org.infinispan.XSITE logging category so you can monitor and troubleshoot common issues with networking and state transfer operations.

5.1. Enabling Data Grid Statistics

Data Grid lets you enable statistics for Cache Managers and caches. However, enabling statistics for a Cache Manager does not enable statistics for the caches that it controls. You must explicitly enable statistics for your caches.

Note

Data Grid server enables statistics for Cache Managers by default.

Procedure

  • Enable statistics declaratively or programmatically.

Declaratively

<cache-container statistics="true"> 1
  <local-cache name="mycache" statistics="true"/> 2
</cache-container>

1
Enables statistics for the Cache Manager.
2
Enables statistics for the named cache.

Programmatically

GlobalConfiguration globalConfig = new GlobalConfigurationBuilder()
  .cacheContainer().statistics(true) 1
  .build();

 ...

Configuration config = new ConfigurationBuilder()
  .statistics().enable() 2
  .build();

1
Enables statistics for the Cache Manager.
2
Enables statistics for the named cache.

5.2. Enabling Data Grid Metrics

Configure Data Grid to export gauges and histograms.

Procedure

  • Configure metrics declaratively or programmatically.

Declaratively

<cache-container statistics="true"> 1
  <metrics gauges="true" histograms="true" /> 2
</cache-container>

1
Computes and collects statistics about the Cache Manager.
2
Exports collected statistics as gauge and histogram metrics.

Programmatically

GlobalConfiguration globalConfig = new GlobalConfigurationBuilder()
  .statistics().enable() 1
  .metrics().gauges(true).histograms(true) 2
  .build();

1
Computes and collects statistics about the Cache Manager.
2
Exports collected statistics as gauge and histogram metrics.

5.2.1. Collecting Data Grid Metrics

Collect Data Grid metrics with monitoring tools such as Prometheus.

Prerequisites

  • Enable statistics. If you do not enable statistics, Data Grid provides 0 and -1 values for metrics.
  • Optionally enable histograms. By default Data Grid generates gauges but not histograms.

Procedure

  • Get metrics in Prometheus (OpenMetrics) format:

    $ curl -v http://localhost:11222/metrics
  • Get metrics in MicroProfile JSON format:

    $ curl --header "Accept: application/json" http://localhost:11222/metrics

Next steps

Configure monitoring applications to collect Data Grid metrics. For example, add the following to prometheus.yml:

static_configs:
    - targets: ['localhost:11222']

Reference

5.3. Configuring Data Grid to Register JMX MBeans

Data Grid can register JMX MBeans that you can use to collect statistics and perform administrative operations. However, you must enable statistics separately to JMX otherwise Data Grid provides 0 values for all statistic attributes.

Procedure

  • Enable JMX declaratively or programmatically.

Declaratively

<cache-container>
  <jmx enabled="true" /> 1
</cache-container>

1
Registers Data Grid JMX MBeans.

Programmatically

GlobalConfiguration globalConfig = new GlobalConfigurationBuilder()
  .jmx().enable() 1
  .build();

1
Registers Data Grid JMX MBeans.

5.3.1. JMX MBeans for Cross-Site Replication

Data Grid provides JMX MBeans for cross-site replication that let you gather statistics and perform remote operations.

The org.infinispan:type=Cache component provides the following JMX MBeans:

  • XSiteAdmin exposes cross-site operations that apply to specific cache instances.
  • StateTransferManager provides statistics for state transfer operations.
  • InboundInvocationHandler provides statistics and operations for asynchronous and synchronous cross-site requests.

The org.infinispan:type=CacheManager component includes the following JMX MBean:

  • GlobalXSiteAdminOperations exposes cross-site operations that apply to all caches in a cache container.

For details about JMX MBeans along with descriptions of available operations and statistics, see the Data Grid JMX Components documentation.

5.4. Collecting Logs and Troubleshooting Cross-Site Replication

Diagnose and resolve issues related to Data Grid cross-site replication. Use the Data Grid Command Line Interface (CLI) to adjust log levels at run-time and perform cross-site troubleshooting.

Procedure

  1. Open a terminal in $RHDG_HOME.
  2. Create a Data Grid CLI connection.
  3. Adjust run-time logging levels to capture DEBUG messages if necessary.

    For example, the following command enables DEBUG log messages for the org.infinispan.XSITE category:

    [//containers/default]> logging set --level=DEBUG org.infinispan.XSITE

    You can then check the Data Grid log files for cross-site messages in the ${rhdg.server.root}/log directory.

  4. Use the site command to view status for backup locations and perform troubleshooting.

For example, check the status of the "customers" cache that uses "LON" as a backup location:

[//containers/default]> site status --cache=customers
{
  "LON" : "online"
}

Another scenario for using the Data Grid CLI to troubleshoot is when the network connection between backup locations is broken during a state transfer operation.

If this occurs, Data Grid clusters that receive state transfer continually wait for the operation to complete. In this case you should cancel the state transfer to the receiving site to return it to a normal operational state.

For example, cancel state transfer for "NYC" as follows:

[//containers/default]> site cancel-receive-state --cache=mycache --site=NYC`

5.4.1. Cross-Site Log Messages

Find user actions for log messages related to cross-site replication.

Log levelIdentifierMessageDescription

DEBUG

ISPN000400

Node null was suspected

Data Grid prints this message when it cannot reach backup locations. Ensure that sites are online and check network status.

INFO

ISPN000439

Received new x-site view: ${site.name}

Data Grid prints this message when sites join and leave the global cluster.

INFO

ISPN100005

Site ${site.name} is online.

Data Grid prints this message when a site comes online.

INFO

ISPN100006

Site ${site.name} is offline.

Data Grid prints this message when a site goes offline. If you did not take the site offline manually, this message could indicate a failure has occurred. Check network status and try to bring the site back online.

WARN

ISPN000202

Problems backing up data for cache ${cache.name} to site ${site.name}:

Data Grid prints this message when issues occur with state transfer operations along with the exception. If necessary adjust Data Grid logging to get more fine-grained logging messages.

WARN

ISPN000289

Unable to send X-Site state chunk to ${site.name}.

Indicates that Data Grid cannot transfer a batch of cache entries during a state transfer operation. Ensure that sites are online and check network status.

WARN

ISPN000291

Unable to apply X-Site state chunk.

Indicates that Data Grid cannot apply a batch of cache entries during a state transfer operation. Ensure that sites are online and check network status.

WARN

ISPN000322

Unable to re-start x-site state transfer to site ${site.name}

Indicates that Data Grid cannot resume a state transfer operation to a backup location. Ensure that sites are online and check network status.

ERROR

ISPN000477

Unable to perform operation ${operation.name} for site ${site.name}

Indicates that Data Grid cannot successfully complete an operation on a backup location. If necessary adjust Data Grid logging to get more fine-grained logging messages.

FATAL

ISPN000449

XSite state transfer timeout must be higher or equals than 1 (one).

Results when the value of the timeout attribute is 0 or a negative number. Specify a value of at least 1 for the timeout attribute in the state transfer configuration for your cache definition.

FATAL

ISPN000450

XSite state transfer waiting time between retries must be higher or equals than 1 (one).

Results when the value of the wait-time attribute is 0 or a negative number. Specify a value of at least 1 for the wait-time attribute in the state transfer configuration for your cache definition.

FATAL

ISPN000576

Cross-site Replication not available for local cache.

Cross-site replication does not work with the local cache mode. Either remove the backup configuration from the local cache definition or use a distributed or replicated cache mode.