State-Transfer issue during startup a new instance joining the cluster in RHDG

Solution Verified - Updated -

Issue

  • There are a huge number of caches configured and sometimes a startup from a new instance will fail and it get not ready, management, HTTP and HotRod ports are not listening.
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Adding inbound state transfer for segments [...] of cache demo
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Removing no longer owned entries for cache demo
DEBUG [org.infinispan.statetransfer.InboundTransferTask] Finished receiving state for segments [...] of cache demo
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Finished receiving of segments for cache demo for topology 123.
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Removing no longer owned entries for cache demo
INFO  [org.jboss.as.clustering.infinispan] DGISPN0001: Started demo cache from clustered container
  • Startup of a new instance will fail but all our caches are not waiting for initial-state-transfer or are configured with a proper timeout setting, what is the reason?
  • Startup will fail for state transfer after 60 seconds with a cache we don't use, so it is empty and should not cause issues with timeouts!
ERROR [org.jboss.msc.service.fail] (MSC service thread 1-5) MSC000001: Failed to start service jboss.datagrid-infinispan.clustered.memcachedCache: org.jboss.msc.service.StartException in service jboss.datagrid-infinispan.clustered.memcachedCache: Failed to start service
    at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1904) [jboss-msc-1.2.6.Final-redhat-1.jar:1.2.6.Final-redhat-1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_121]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_121]
    at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_121]
Caused by: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl
    ...
Caused by: org.infinispan.util.concurrent.TimeoutException: Replication timeout for 192.168.10.2 (flags=0), site-id=s1, rack-id=r2, machine-id=m2)
    ...
  • Startup will fail randomly with timeouts but if tried right after that it will start without any issue, what is the reason for this and how to avoid it?

Environment

  • Red Hat Data Grid (RHDG)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In