Our environment ran in Solr4.7. Recently hit a core recovery failure and then it retries to recover from tlog.
We noticed after 20:05:22 said Recovery failed, Solr server waited a long time before it started tlog replay. During that time, we have about 32 cores doing such tlog relay. The service took over 40 minutes to make whole service back. Some questions we want to know: 1. Is tlog replay a single thread activity? Can we configure to have multiple threads since in our deployment we have 64 cores for each solr server. 2. What might cause the tlog replay thread to wait for over 15 minutes before actual tlog replay? The actual replay seems very quick. 3. The last message "Log replay finished" does not tell which core it is finished. Given 32 cores to recover, we can not know which core the log is reporting. 4. We know 4.7 is pretty old, we'd like to know is this known issue and fixed in late release, any related JIRA? Line 4120: ERROR - 2015-09-16 20:05:22.396; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (0) core=collection3_shard11_replica2 WARN - 2015-09-16 20:22:50.343; org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0000000000000120498 refcount=2} active=true starting pos=25981 WARN - 2015-09-16 20:22:53.301; org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished. recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0 positionOfStart=25981} Thank you all~