State-Transfer issue during startup a new instance joining the cluster in RHDG
Issue
- There are a huge number of caches configured and sometimes a startup from a new instance will fail and it get not ready, management, HTTP and HotRod ports are not listening.
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Adding inbound state transfer for segments [...] of cache demo
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Removing no longer owned entries for cache demo
DEBUG [org.infinispan.statetransfer.InboundTransferTask] Finished receiving state for segments [...] of cache demo
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Finished receiving of segments for cache demo for topology 123.
DEBUG [org.infinispan.statetransfer.StateConsumerImpl] Removing no longer owned entries for cache demo
INFO [org.jboss.as.clustering.infinispan] DGISPN0001: Started demo cache from clustered container
- Startup of a new instance will fail but all our caches are not waiting for initial-state-transfer or are configured with a proper timeout setting, what is the reason?
- Startup will fail for state transfer after 60 seconds with a cache we don't use, so it is empty and should not cause issues with timeouts!
ERROR [org.jboss.msc.service.fail] (MSC service thread 1-5) MSC000001: Failed to start service jboss.datagrid-infinispan.clustered.memcachedCache: org.jboss.msc.service.StartException in service jboss.datagrid-infinispan.clustered.memcachedCache: Failed to start service
at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1904) [jboss-msc-1.2.6.Final-redhat-1.jar:1.2.6.Final-redhat-1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_121]
Caused by: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl
...
Caused by: org.infinispan.util.concurrent.TimeoutException: Replication timeout for 192.168.10.2 (flags=0), site-id=s1, rack-id=r2, machine-id=m2)
...
- Startup will fail randomly with timeouts but if tried right after that it will start without any issue, what is the reason for this and how to avoid it?
Environment
- Red Hat Data Grid (RHDG)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.