Chapter 19. Running on Cloud Services

In order to turn on Cloud support for Red Hat Data Grid library mode, one needs to add a new dependency to the classpath:

Cloud support in library mode


Replace ${version.infinispan} with the appropriate version of Red Hat Data Grid.

The above dependency adds infinispan-core to the classpath as well as some default configurations.

19.1. Generic Discovery protocols

The main difference between running Red Hat Data Grid in a private environment and a cloud provider is that in the latter node discovery becomes a bit trickier because things like multicast don’t work. To circumvent this you can use alternate JGroups PING protocols. Before delving into the cloud-specific, lets look at some generic discovery protocols.

19.1.1. TCPPing

The TCPPing approach contains a static list of the IP address of each member of the cluster in the JGroups configuration file. While this works it doesn’t really help when cluster nodes are dynamically added to the cluster.

Sample TCPPing configuration

      <TCP bind_port="7800" />
      <TCPPING timeout="3000"

See JGroups TCPPING for more information about TCPPing.

19.1.2. GossipRouter

Another approach is to have a central server (Gossip, which each node will be configured to contact. This central server will tell each node in the cluster about each other node.

The address (ip:port) that the Gossip router is listening on can be injected into the JGroups configuration used by Red Hat Data Grid. To do this pass the gossip routers address as a system property to the JVM e.g. -DGossipRouterAddress="[12001]" and reference this property in the JGroups configuration that Red Hat Data Grid is using e.g.

Sample TCPGOSSIP configuration

    <TCP bind_port="7800" />
    <TCPGOSSIP timeout="3000" initial_hosts="${GossipRouterAddress}" num_initial_members="3" />

More on Gossip Router @

19.2. Amazon Web Services

When running on Amazon Web Service (AWS) platform and similar cloud based environment you can use the S3_PING protocol for discovery.


You can configure your JGroups instances to use a shared storage to exchange the details of the cluster nodes. NATIVE_S3_PING allows Amazon S3 to be used as the shared storage. Be sure that you have signed up for Amazon S3 as well as EC2 to use this method.

Sample NATIVE_S3_PING configuration

    <TCP bind_port="7800" />
            region_name="replace this with your region (e.g. eu-west-1)"
            bucket_name="replace this with your bucket name"
            bucket_prefix="replace this with a prefix to use for entries in the bucket (optional)" />

19.3.1. JDBC_PING

A similar approach to S3_PING, but using a JDBC connection to a shared database. On EC2 that is quite easy using Amazon RDS. See the JDBC_PING Wiki page for details.

19.4. Microsoft Azure

Red Hat Data Grid can be used on the Azure platform. Aside from using TCP_PING or GossipRouter, there is an Azure-specific discovery protocol:

19.4.1. AZURE_PING

AZURE_PING uses a shared Azure Blob Storage to store discovery information. Configuration is as follows:

	storage_account_name="replace this with your account name"
	storage_access_key="replace this with your access key"
	container="replace this with your container name"

19.5. Google Compute Engine

Red Hat Data Grid can be used on the Google Compute Engine (GCE) platform. Aside from using TCP_PING or GossipRouter, there is a GCE-specific discovery protocol:


GOOGLE_PING uses Google Cloud Storage (GCS) to store information about the cluster members.

<protocol type="GOOGLE_PING">
  <property name="location">The name of the bucket</property>
  <property name="access_key">The access key</property>
  <property name="secret_access_key">The secret access key</property>

19.6. Kubernetes

Red Hat Data Grid in Kubernetes environments, such as OKD or OpenShift, can use Kube_PING or [DNS_PING] for cluster discovery.

19.6.1. Kube_PING

The JGroups Kube_PING protocol uses the following configuration:

Example KUBE_PING configuration

    <TCP bind_addr="${match-interface:eth.*}" />
    <kubernetes.KUBE_PING />

The most important thing is to bind JGroups to eth0 interface, which is used by Docker containers for network communication.

KUBE_PING protocol is configured by environmental variables (which should be available inside a container). The most important thing is to set KUBERNETES_NAMESPACE to proper namespace. It might be either hardcoded or populated via Kubernetes' Downward API.

Since KUBE_PING uses Kubernetes API for obtaining available Pods, OpenShift requires adding additional privileges. Assuming that oc project -q returns current namespace and default is the service account name, one needs to run:

Adding additional OpenShift privileges

oc policy add-role-to-user view system:serviceaccount:$(oc project -q):default -n $(oc project -q)

After performing all above steps, the clustering should be enabled and all Pods should automatically form a cluster within a single namespace.

19.6.2. DNS_PING

The JGroups DNS_PING protocol uses the following configuration:

Example DNS_PING configuration

<stack name="dns-ping">
      dns_query="myservice.myproject.svc.cluster.local" />

DNS_PING runs the specified query against the DNS server to get the list of cluster members.

For information about creating DNS entries for nodes in a cluster, see DNS for Services and Pods.

19.6.3. Using Kubernetes and OpenShift Rolling Updates

Since Pods in Kubernetes and OpenShift are immutable, the only way to alter the configuration is to roll out a new deployment. There are several different strategies to do that but we suggest using Rolling Updates.

An example Deployment Configuration (Kubernetes uses very similar concept called Deployment) looks like the following:

DeploymentConfiguration for Rolling Updates

- apiVersion: v1
  kind: DeploymentConfig
    name: infinispan-cluster
    replicas: 3
      type: Rolling
        updatePeriodSeconds: 10
        intervalSeconds: 20
        timeoutSeconds: 600
        maxUnavailable: 1
        maxSurge: 1
        - args:
          - -Djboss.default.jgroups.stack=kubernetes
          image: jboss/infinispan-server:latest
          name: infinispan-server
          - containerPort: 8181
            protocol: TCP
          - containerPort: 9990
            protocol: TCP
          - containerPort: 11211
            protocol: TCP
          - containerPort: 11222
            protocol: TCP
          - containerPort: 57600
            protocol: TCP
          - containerPort: 7600
            protocol: TCP
          - containerPort: 8080
            protocol: TCP
          - name: KUBERNETES_NAMESPACE
            valueFrom: {fieldRef: {apiVersion: v1, fieldPath: metadata.namespace}}
          terminationMessagePath: /dev/termination-log
          terminationGracePeriodSeconds: 90
              - /usr/local/bin/
            initialDelaySeconds: 10
            timeoutSeconds: 80
            periodSeconds: 60
            successThreshold: 1
            failureThreshold: 5
                - /usr/local/bin/
             initialDelaySeconds: 10
             timeoutSeconds: 40
             periodSeconds: 30
             successThreshold: 2
             failureThreshold: 5

It is also highly recommended to adjust the JGroups stack to discover new nodes (or leaves) more quickly. One should at least adjust the value of FD_ALL timeout and adjust it to the longest GC Pause.

Other hints for tuning configuration parameters are:

  • OpenShift should replace running nodes one by one. This can be achieved by adjusting rollingParams (maxUnavailable: 1 and maxSurge: 1).
  • Depending on the cluster size, one needs to adjust updatePeriodSeconds and intervalSeconds. The bigger cluster size is, the bigger those values should be used.
  • When using Initial State Transfer, the initialDelaySeconds value for both probes should be set to higher value.
  • During Initial State Transfer nodes might not respond to probes. The best results are achieved with higher values of failureThreshold and successThreshold values.

19.6.4. Rolling upgrades with Kubernetes and OpenShift

Even though Rolling Upgrades and Rolling Update may sound similarly, they mean different things. The Rolling Update is a process of replacing old Pods with new ones. In other words it is a process of rolling out new version of an application. A typical example is a configuration change. Since Pods are immutable, Kubernetes/OpenShift needs to replace them one by one in order to use the updated configuration bits. On the other hand the Rolling Upgrade is a process of migrating data from one Red Hat Data Grid cluster to another one. A typical example is migrating from one version to another.

For both Kubernetes and OpenShift, the Rolling Upgrade procedure is almost the same. It is based on a standard Rolling Upgrade procedure with small changes.

Key differences when upgrading using OpenShift/Kubernetes are:

  • Depending on configuration, it is a good practice to use OpenShift Routes or Kubernetes Ingress API to expose services to the clients. During the upgrade the Route (or Ingress) used by the clients can be altered to point to the new cluster.
  • Invoking CLI commands can be done by using Kubernetes (kubectl exec) or OpenShift clients (oc exec). Here is an example: oc exec <POD_NAME> — '/opt/datagrid/bin/' '-c' '--controller=$(hostname -i):9990' '/subsystem=datagrid-infinispan/cache-container=clustered/distributed-cache=default:disconnect-source(migrator-name=hotrod)'

Key differences when upgrading using the library mode:

  • Client application needs to expose JMX. It usually depends on application and environment type but the easiest way to do it is to add the following switches into the Java boostrap script<PORT>.
  • Connecting to the JMX can be done by forwarding ports. With OpenShift this might be achieved by using oc port-forward command whereas in Kubernetes by kubectl port-forward.

The last step in the Rolling Upgrade (removing a Remote Cache Store) needs to be performed differently. We need to use Kubernetes/OpenShift Rolling update command and replace Pods configuration with the one which does not contain Remote Cache Store.

A detailed instruction might be found in ISPN-6673 ticket.