Hi Jeff, Comments inline:
On Mon, Sep 21, 2015 at 6:06 PM, Jeff Wu <wuhai...@gmail.com> wrote: > Our environment ran in Solr4.7. Recently hit a core recovery failure and > then it retries to recover from tlog. > > We noticed after 20:05:22 said Recovery failed, Solr server waited a long > time before it started tlog replay. During that time, we have about 32 > cores doing such tlog relay. The service took over 40 minutes to make whole > service back. > > Some questions we want to know: > 1. Is tlog replay a single thread activity? Can we configure to have > multiple threads since in our deployment we have 64 cores for each solr > server. Each core gets a separate recovery thread but each individual log replay is single-threaded > > 2. What might cause the tlog replay thread to wait for over 15 minutes > before actual tlog replay? The actual replay seems very quick. Before tlog replay, the replica will replicate any missing index files from the leader. I think that is what is causing the time between the two log messages. You have INFO logging turned off so there are no messages from the replication handler about it. > > 3. The last message "Log replay finished" does not tell which core it is > finished. Given 32 cores to recover, we can not know which core the log is > reporting. Yeah, many such issues were fixed in recent 5.x releases where we use MDC to log collection, shard, core etc for each message. Furthermore, tlog replay progress/status is also logged since 5.0 > > 4. We know 4.7 is pretty old, we'd like to know is this known issue and > fixed in late release, any related JIRA? > > Line 4120: ERROR - 2015-09-16 20:05:22.396; > org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... > (0) core=collection3_shard11_replica2 > WARN - 2015-09-16 20:22:50.343; > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay > tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0000000000000120498 > refcount=2} active=true starting pos=25981 > WARN - 2015-09-16 20:22:53.301; > org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished. > recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0 > positionOfStart=25981} > > Thank you all~ -- Regards, Shalin Shekhar Mangar.