Chapter 4. Enabling and configuring Data Grid statistics and JMX monitoring

Data Grid can provide Cache Manager and cache statistics as well as export JMX MBeans.

4.1. Enabling statistics in embedded caches

Configure Data Grid to export statistics for the Cache Manager and embedded caches.

Procedure

  1. Open your Data Grid configuration for editing.
  2. Add the statistics="true" attribute or the .statistics(true) method.
  3. Save and close your Data Grid configuration.

Embedded cache statistics

XML

<infinispan>
  <cache-container statistics="true">
    <distributed-cache statistics="true"/>
    <replicated-cache statistics="true"/>
  </cache-container>
</infinispan>

GlobalConfigurationBuilder

GlobalConfigurationBuilder global = GlobalConfigurationBuilder.defaultClusteredBuilder().cacheContainer().statistics(true);
DefaultCacheManager cacheManager = new DefaultCacheManager(global.build());

Configuration builder = new ConfigurationBuilder();
builder.statistics().enable();

4.2. Configuring Data Grid metrics

Data Grid generates metrics that are compatible with any monitoring system.

  • Gauges provide values such as the average number of nanoseconds for write operations or JVM uptime.
  • Histograms provide details about operation execution times such as read, write, and remove times.

By default, Data Grid generates gauges when you enable statistics but you can also configure it to generate histograms.

Note

Data Grid metrics are provided at the vendor scope. Metrics related to the JVM are provided in the base scope.

Prerequisites

  • You must add Micrometer Core and Micrometer Registry Prometheus JARs to your classpath to export Data Grid metrics for embedded caches.

Procedure

  1. Open your Data Grid configuration for editing.
  2. Add the metrics element or object to the cache container.
  3. Enable or disable gauges with the gauges attribute or field.
  4. Enable or disable histograms with the histograms attribute or field.
  5. Save and close your client configuration.

Metrics configuration

XML

<infinispan>
  <cache-container statistics="true">
    <metrics gauges="true"
             histograms="true" />
  </cache-container>
</infinispan>

JSON

{
  "infinispan" : {
    "cache-container" : {
      "statistics" : "true",
      "metrics" : {
        "gauges" : "true",
        "histograms" : "true"
      }
    }
  }
}

YAML

infinispan:
  cacheContainer:
    statistics: "true"
    metrics:
      gauges: "true"
      histograms: "true"

GlobalConfigurationBuilder

GlobalConfiguration globalConfig = new GlobalConfigurationBuilder()
  //Computes and collects statistics for the Cache Manager.
  .statistics().enable()
  //Exports collected statistics as gauge and histogram metrics.
  .metrics().gauges(true).histograms(true)
  .build();

Additional resources

4.3. Registering JMX MBeans

Data Grid can register JMX MBeans that you can use to collect statistics and perform administrative operations. You must also enable statistics otherwise Data Grid provides 0 values for all statistic attributes in JMX MBeans.

Procedure

  1. Open your Data Grid configuration for editing.
  2. Add the jmx element or object to the cache container and specify true as the value for the enabled attribute or field.
  3. Add the domain attribute or field and specify the domain where JMX MBeans are exposed, if required.
  4. Save and close your client configuration.

JMX configuration

XML

<infinispan>
  <cache-container statistics="true">
    <jmx enabled="true"
         domain="example.com"/>
  </cache-container>
</infinispan>

JSON

{
  "infinispan" : {
    "cache-container" : {
      "statistics" : "true",
      "jmx" : {
        "enabled" : "true",
        "domain" : "example.com"
      }
    }
  }
}

YAML

infinispan:
  cacheContainer:
    statistics: "true"
    jmx:
      enabled: "true"
      domain: "example.com"

GlobalConfigurationBuilder

GlobalConfiguration global = GlobalConfigurationBuilder.defaultClusteredBuilder()
   .jmx().enable()
   .domain("org.mydomain");

4.3.1. Enabling JMX remote ports

Provide unique remote JMX ports to expose Data Grid MBeans through connections in JMXServiceURL format.

You can enable remote JMX ports using one of the following approaches:

  • Enable remote JMX ports that require authentication to one of the Data Grid Server security realms.
  • Enable remote JMX ports manually using the standard Java management configuration options.

Prerequisites

  • For remote JMX with authentication, define JMX specific user roles using the default security realm. Users must have controlRole with read/write access or the monitorRole with read-only access to access any JMX resources.

Procedure

Start Data Grid Server with a remote JMX port enabled using one of the following ways:

  • Enable remote JMX through port 9999.

    bin/server.sh --jmx 9999
    Warning

    Using remote JMX with SSL disabled is not intended for production environments.

  • Pass the following system properties to Data Grid Server at startup.

    bin/server.sh -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
    Warning

    Enabling remote JMX with no authentication or SSL is not secure and not recommended in any environment. Disabling authentication and SSL allows unauthorized users to connect to your server and access the data hosted there.

Additional resources

4.3.2. Data Grid MBeans

Data Grid exposes JMX MBeans that represent manageable resources.

org.infinispan:type=Cache
Attributes and operations available for cache instances.
org.infinispan:type=CacheManager
Attributes and operations available for Cache Managers, including Data Grid cache and cluster health statistics.

For a complete list of available JMX MBeans along with descriptions and available operations and attributes, see the Data Grid JMX Components documentation.

Additional resources

4.3.3. Registering MBeans in custom MBean servers

Data Grid includes an MBeanServerLookup interface that you can use to register MBeans in custom MBeanServer instances.

Prerequisites

  • Create an implementation of MBeanServerLookup so that the getMBeanServer() method returns the custom MBeanServer instance.
  • Configure Data Grid to register JMX MBeans.

Procedure

  1. Open your Data Grid configuration for editing.
  2. Add the mbean-server-lookup attribute or field to the JMX configuration for the Cache Manager.
  3. Specify fully qualified name (FQN) of your MBeanServerLookup implementation.
  4. Save and close your client configuration.
JMX MBean server lookup configuration

XML

<infinispan>
  <cache-container statistics="true">
    <jmx enabled="true"
         domain="example.com"
         mbean-server-lookup="com.example.MyMBeanServerLookup"/>
  </cache-container>
</infinispan>

JSON

{
  "infinispan" : {
    "cache-container" : {
      "statistics" : "true",
      "jmx" : {
        "enabled" : "true",
        "domain" : "example.com",
        "mbean-server-lookup" : "com.example.MyMBeanServerLookup"
      }
    }
  }
}

YAML

infinispan:
  cacheContainer:
    statistics: "true"
    jmx:
      enabled: "true"
      domain: "example.com"
      mbeanServerLookup: "com.example.MyMBeanServerLookup"

GlobalConfigurationBuilder

GlobalConfiguration global = GlobalConfigurationBuilder.defaultClusteredBuilder()
   .jmx().enable()
   .domain("org.mydomain")
   .mBeanServerLookup(new com.acme.MyMBeanServerLookup());

4.4. Exporting metrics during a state transfer operation

You can export time metrics for clustered caches that Data Grid redistributes across nodes.

A state transfer operation occurs when a clustered cache topology changes, such as a node joining or leaving a cluster. During a state transfer operation, Data Grid exports metrics from each cache, so that you can determine a cache’s status. A state transfer exposes attributes as properties, so that Data Grid can export metrics from each cache.

Note

You cannot perform a state transfer operation in invalidation mode.

Data Grid generates time metrics that are compatible with the REST API and the JMX API.

Prerequisites

  • Configure Data Grid metrics.
  • Enable metrics for your cache type, such as embedded cache or remote cache.
  • Initiate a state transfer operation by changing your clustered cache topology.

Procedure

  • Choose one of the following methods:

    • Configure Data Grid to use the REST API to collect metrics.
    • Configure Data Grid to use the JMX API to collect metrics.

4.5. Monitoring the status of cross-site replication

Monitor the site status of your backup locations to detect interruptions in the communication between the sites. When a remote site status changes to offline, Data Grid stops replicating your data to the backup location. Your data become out of sync and you must fix the inconsistencies before bringing the clusters back online.

Monitoring cross-site events is necessary for early problem detection. Use one of the following monitoring strategies:

Monitoring cross-site replication with the REST API

Monitor the status of cross-site replication for all caches using the REST endpoint. You can implement a custom script to poll the REST endpoint or use the following example.

Prerequisites

  • Enable cross-site replication.

Procedure

  1. Implement a script to poll the REST endpoint.

    The following example demonstrates how you can use a Python script to poll the site status every five seconds.

#!/usr/bin/python3
import time
import requests
from requests.auth import HTTPDigestAuth


class InfinispanConnection:

    def __init__(self, server: str = 'http://localhost:11222', cache_manager: str = 'default',
                 auth: tuple = ('admin', 'change_me')) -> None:
        super().__init__()
        self.__url = f'{server}/rest/v2/cache-managers/{cache_manager}/x-site/backups/'
        self.__auth = auth
        self.__headers = {
            'accept': 'application/json'
        }

    def get_sites_status(self):
        try:
            rsp = requests.get(self.__url, headers=self.__headers, auth=HTTPDigestAuth(self.__auth[0], self.__auth[1]))
            if rsp.status_code != 200:
                return None
            return rsp.json()
        except:
            return None


# Specify credentials for Data Grid user with permission to access the REST endpoint
USERNAME = 'admin'
PASSWORD = 'change_me'
# Set an interval between cross-site status checks
POLL_INTERVAL_SEC = 5
# Provide a list of servers
SERVERS = [
    InfinispanConnection('http://127.0.0.1:11222', auth=(USERNAME, PASSWORD)),
    InfinispanConnection('http://127.0.0.1:12222', auth=(USERNAME, PASSWORD))
]
#Specify the names of remote sites
REMOTE_SITES = [
    'nyc'
]
#Provide a list of caches to monitor
CACHES = [
    'work',
    'sessions'
]


def on_event(site: str, cache: str, old_status: str, new_status: str):
    # TODO implement your handling code here
    print(f'site={site} cache={cache} Status changed {old_status} -> {new_status}')


def __handle_mixed_state(state: dict, site: str, site_status: dict):
    if site not in state:
        state[site] = {c: 'online' if c in site_status['online'] else 'offline' for c in CACHES}
        return

    for cache in CACHES:
        __update_cache_state(state, site, cache, 'online' if cache in site_status['online'] else 'offline')


def __handle_online_or_offline_state(state: dict, site: str, new_status: str):
    if site not in state:
        state[site] = {c: new_status for c in CACHES}
        return

    for cache in CACHES:
        __update_cache_state(state, site, cache, new_status)


def __update_cache_state(state: dict, site: str, cache: str, new_status: str):
    old_status = state[site].get(cache)
    if old_status != new_status:
        on_event(site, cache, old_status, new_status)
        state[site][cache] = new_status


def update_state(state: dict):
    rsp = None
    for conn in SERVERS:
        rsp = conn.get_sites_status()
        if rsp:
            break
    if rsp is None:
        print('Unable to fetch site status from any server')
        return

    for site in REMOTE_SITES:
        site_status = rsp.get(site, {})
        new_status = site_status.get('status')
        if new_status == 'mixed':
            __handle_mixed_state(state, site, site_status)
        else:
            __handle_online_or_offline_state(state, site, new_status)


if __name__ == '__main__':
    _state = {}
    while True:
        update_state(_state)
        time.sleep(POLL_INTERVAL_SEC)

When a site status changes from online to offline or vice-versa, the function on_event is invoked.

If you want to use this script, you must specify the following variables:

  • USERNAME and PASSWORD: The username and password of Data Grid user with permission to access the REST endpoint.
  • POLL_INTERVAL_SEC: The number of seconds between polls.
  • SERVERS: The list of Data Grid Servers at this site. The script only requires a single valid response but the list is provided to allow fail over.
  • REMOTE_SITES: The list of remote sites to monitor on these servers.
  • CACHES: The list of cache names to monitor.

Monitoring cross-site replication with the Prometheus metrics

Prometheus, and other monitoring systems, let you configure alerts to detect when a site status changes to offline.

Tip

Monitoring cross-site latency metrics can help you to discover potential issues.

Prerequisites

  • Enable cross-site replication.

Procedure

  1. Configure Data Grid metrics.
  2. Configure alerting rules using the Prometheus metrics format.

    • For the site status, use 1 for online and 0 for offline.
    • For the expr filed, use the following format:
      vendor_cache_manager_default_cache_<cache name>_x_site_admin_<site name>_status.

      In the following example, Prometheus alerts you when the NYC site gets offline for cache named work or sessions.

      groups:
      - name: Cross Site Rules
        rules:
        - alert: Cache Work and Site NYC
          expr: vendor_cache_manager_default_cache_work_x_site_admin_nyc_status == 0
        - alert: Cache Sessions and Site NYC
          expr: vendor_cache_manager_default_cache_sessions_x_site_admin_nyc_status == 0

      The following image shows an alert that the NYC site is offline for cache work.

      Figure 4.1. Prometheus Alert

      prometheus xsite alert