Hi Jeff,

Comments inline:

On Mon, Sep 21, 2015 at 6:06 PM, Jeff Wu <wuhai...@gmail.com> wrote:
> Our environment ran in Solr4.7. Recently hit a core recovery failure and
> then it retries to recover from tlog.
>
> We noticed after  20:05:22 said Recovery failed, Solr server waited a long
> time before it started tlog replay. During that time, we have about 32
> cores doing such tlog relay. The service took over 40 minutes to make whole
> service back.
>
> Some questions we want to know:
> 1. Is tlog replay a single thread activity? Can we configure to have
> multiple threads since in our deployment we have 64 cores for each solr
> server.

Each core gets a separate recovery thread but each individual log
replay is single-threaded

>
> 2. What might cause the tlog replay thread to wait for over 15 minutes
> before actual tlog replay?  The actual replay seems very quick.

Before tlog replay, the replica will replicate any missing index files
from the leader. I think that is what is causing the time between the
two log messages. You have INFO logging turned off so there are no
messages from the replication handler about it.

>
> 3. The last message "Log replay finished" does not tell which core it is
> finished. Given 32 cores to recover, we can not know which core the log is
> reporting.

Yeah, many such issues were fixed in recent 5.x releases where we use
MDC to log collection, shard, core etc for each message. Furthermore,
tlog replay progress/status is also logged since 5.0

>
> 4. We know 4.7 is pretty old, we'd like to know is this known issue and
> fixed in late release, any related JIRA?
>
> Line 4120: ERROR - 2015-09-16 20:05:22.396;
> org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again...
> (0) core=collection3_shard11_replica2
> WARN  - 2015-09-16 20:22:50.343;
> org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
> tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0000000000000120498
> refcount=2} active=true starting pos=25981
> WARN  - 2015-09-16 20:22:53.301;
> org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished.
> recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0
> positionOfStart=25981}
>
> Thank you all~



-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to