State-Transfer issue during startup a new instance joining the cluster in RHDG

Solution Verified - Updated -

Issue

  • There are a huge number of caches configured and sometimes a startup from a new instance will fail and it get not ready, management, HTTP and HotRod ports are not listening.
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Adding inbound state transfer for segments [...] of cache demo
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Removing no longer owned entries for cache demo
DEBUG [org.infinispan.statetransfer.InboundTransferTask] Finished receiving state for segments [...] of cache demo
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Finished receiving of segments for cache demo for topology 123.
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Removing no longer owned entries for cache demo
INFO  [org.jboss.as.clustering.infinispan] DGISPN0001: Started demo cache from clustered container
  • Startup of a new instance will fail but all our caches are not waiting for initial-state-transfer or are configured with a proper timeout setting, what is the reason?
  • Startup will fail for state transfer after 60 seconds with a cache we don't use, so it is empty and should not cause issues with timeouts!
ERROR [org.jboss.msc.service.fail] (MSC service thread 1-5) MSC000001: Failed to start service jboss.datagrid-infinispan.clustered.memcachedCache: org.jboss.msc.service.StartException in service jboss.datagrid-infinispan.clustered.memcachedCache: Failed to start service
    at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1904) [jboss-msc-1.2.6.Final-redhat-1.jar:1.2.6.Final-redhat-1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_121]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_121]
    at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_121]
Caused by: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl
    ...
Caused by: org.infinispan.util.concurrent.TimeoutException: Replication timeout for 192.168.10.2 (flags=0), site-id=s1, rack-id=r2, machine-id=m2)
    ...
  • Startup will fail randomly with timeouts but if tried right after that it will start without any issue, what is the reason for this and how to avoid it?

Environment

  • Red Hat Data Grid (RHDG)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content