Automatically Taking Sites Offline with Asynchronous Cross-Site Replication in JDG cluster

Solution Verified - Updated -

Environment

  • Red Hat JBoss Data Grid (JDG)
    • 7.3.2.+

Issue

  • How to know if the site is taking offline automatically when configuring after-failures parameter?
  • How to know if the site is taking offline automatically when configuring min-wait parameter?

Resolution

Data Grid applies the take-offline configuration when using Cross-Site replication capabilities.
The following configuration provides an example to take sites offline automatically after 20 seconds:

<backups>
  <backup site="site01" strategy="ASYNC">
    <take-offline after-failures="-1" min-wait="20000"/>
  </backup>
</backups>
  • after-failures - the number of failed backup operations after which this site should be taken offline. Defaults to 0 (never). A negative value would mean that the site will be taken offline after minTimeToWait

  • min-wait - the number of milliseconds in which a site is not marked offline even if it is unreachable for after-failures number of times. If smaller or equal to 0, then only after-failures is considered.

NOTE: Automatically taking sites offline with strategy="ASYNC" is only available to JDG 7.3.2 upper, minor releases only apply strategy="SYNC".

Diagnostic Steps

Enable TRACE level log messages for classorg.infinispan.xsite.OfflineStatus.
When using after-failures parameter search for min failures reached in the server.log file as follows:

2019-07-12 16:08:09,654 TRACE [org.infinispan.xsite.OfflineStatus] (jgroups-45,jdg-d-cachesrv-01) Site is failed: min failures reached.
2019-07-12 16:08:09,654 INFO  [org.infinispan.CLUSTER] (jgroups-45,jdg-d-cachesrv-01) [Context=api-general-filestore][Context=jdg-d-cachesrv-01]ISPN100006: Site 'site02' is offline.

If setting up min-wait parameter search for The minTimeToWait has passed in the server.log file as follows:

2019-07-15 15:11:36,371 TRACE [org.infinispan.xsite.OfflineStatus] (HotRod-ServerHandler-7-56) The minTimeToWait has passed: minTime=20000, timeSinceFirstFailure=38378

Infinispan only updates the site status when it needs to replicate data (put operation) to the backup site.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments