Our environment ran in Solr4.7. Recently hit a core recovery failure and
then it retries to recover from tlog.

We noticed after  20:05:22 said Recovery failed, Solr server waited a long
time before it started tlog replay. During that time, we have about 32
cores doing such tlog relay. The service took over 40 minutes to make whole
service back.

Some questions we want to know:
1. Is tlog replay a single thread activity? Can we configure to have
multiple threads since in our deployment we have 64 cores for each solr
server.

2. What might cause the tlog replay thread to wait for over 15 minutes
before actual tlog replay?  The actual replay seems very quick.

3. The last message "Log replay finished" does not tell which core it is
finished. Given 32 cores to recover, we can not know which core the log is
reporting.

4. We know 4.7 is pretty old, we'd like to know is this known issue and
fixed in late release, any related JIRA?

Line 4120: ERROR - 2015-09-16 20:05:22.396;
org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again...
(0) core=collection3_shard11_replica2
WARN  - 2015-09-16 20:22:50.343;
org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0000000000000120498
refcount=2} active=true starting pos=25981
WARN  - 2015-09-16 20:22:53.301;
org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished.
recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0
positionOfStart=25981}

Thank you all~

Reply via email to