I just repeated the procedure, same effect. I'm an hour in and it's still recovering. Looked at the autoscaling API, but it's configured not to do anything, which makes sense given the previous output.
One thing I did see, just now: solr | 2018-12-05 20:02:37.922 INFO (qtp2131952342-5208) [ x:igls_shard3_replica_n8] o.a.s.h.a.CoreAdminOperation It has been requested that we recover: core=igls_shard3_replica_n8 solr | 2018-12-05 20:02:37.923 ERROR (qtp2131952342-5208) [ x:igls_shard3_replica_n8] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Unable to locate core igls_shard3_replica_n8 solr | at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$5(CoreAdminOperation.java:167) solr | at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360) solr | at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:395) solr | at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180) solr | at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199) solr | at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734) solr | at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715) solr | at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496) solr | at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377) solr | at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323) solr | at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634) solr | at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) solr | at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146) solr | at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) solr | at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) solr | at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) solr | at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) solr | at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) solr | at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) solr | at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) solr | at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) solr | at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) solr | at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) solr | at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) solr | at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) solr | at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) solr | at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) solr | at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) solr | at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) solr | at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) solr | at org.eclipse.jetty.server.Server.handle(Server.java:531) solr | at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) solr | at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) solr | at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) solr | at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) solr | at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) solr | at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) solr | at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) solr | at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) solr | at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) solr | at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) solr | at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) solr | at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) solr | at java.lang.Thread.run(Thread.java:748) But I'm not sure why it can't locate the core, because it's there in ${SOLR_HOME} and also exists on hdfs... Thanks! Kyle On Wed, 5 Dec 2018 at 13:12, lstusr 5u93n4 <lstusr...@gmail.com> wrote: > Hi Kevin, > > We do have logs. Grepping for peersync, I can see > > solr | 2018-12-05 03:31:41.301 INFO > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/] > nUpdates=100 > solr | 2018-12-05 03:31:41.307 INFO > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= > http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no > versions. sync failed. > solr | 2018-12-05 03:31:44.007 INFO > (coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7) > [c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4] > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url= > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > http://solr.node4.metaonly01.eca.local:8983/solr/iglshistory_shard2_replica_n6/] > nUpdates=100 > solr | 2018-12-05 03:31:44.008 INFO > (coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7) > [c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4] > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url= > http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no > versions. sync failed. > solr | 2018-12-05 05:48:44.202 INFO > (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1 > r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync: > core=igls_shard1_replica_n1 url= > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/] > nUpdates=100 > solr | 2018-12-05 05:48:44.646 WARN > (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1 > r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync: > core=igls_shard1_replica_n1 url= > http://solr.node2.metaonly01.eca.local:8983/solr got a 503 from > http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/, > counting as success > solr | 2018-12-05 05:48:44.682 INFO > (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1 > r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync: > core=igls_shard1_replica_n1 url= > http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync succeeded > solr | 2018-12-05 05:51:18.369 INFO (qtp2131952342-10232) > [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync > PeerSync: core=igls_shard2_replica_n6 url= > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] > nUpdates=100 > solr | 2018-12-05 06:01:18.496 WARN (qtp2131952342-10232) > [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync > Replica did not return a fingerprint - possibly an older Solr version or > exception > solr | 2018-12-05 06:01:31.126 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting > to PeerSync from [ > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] > - recoveringAfterStartup=[true] > solr | 2018-12-05 06:01:31.129 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > PeerSync: core=igls_shard2_replica_n6 url= > http://solr.node2.metaonly01.eca.local:8983/solr START leader= > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ > nUpdates=100 > solr | 2018-12-05 06:11:18.535 WARN (qtp2131952342-10232) > [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync > PeerSync: core=igls_shard2_replica_n6 url= > http://solr.node2.metaonly01.eca.local:8983/solr exception talking to > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/, > failed > solr | 2018-12-05 06:11:18.583 INFO (qtp2131952342-10232) > [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync > PeerSync: core=igls_shard2_replica_n6 url= > http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed > solr | at > org.apache.solr.update.PeerSyncWithLeader.request(PeerSyncWithLeader.java:321) > solr | at > org.apache.solr.update.PeerSyncWithLeader.alreadyInSync(PeerSyncWithLeader.java:341) > solr | at > org.apache.solr.update.PeerSyncWithLeader.sync(PeerSyncWithLeader.java:135) > solr | 2018-12-05 09:21:31.831 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting > to PeerSync from [ > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] > - recoveringAfterStartup=[true] > solr | 2018-12-05 09:21:31.831 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > PeerSync: core=igls_shard2_replica_n6 url= > http://solr.node2.metaonly01.eca.local:8983/solr START leader= > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ > nUpdates=100 > solr | 2018-12-05 12:28:26.698 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > Fingerprint comparison result: 1 > solr | 2018-12-05 12:28:26.698 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader > fingerprint: {maxVersionSpecified=9223372036854775807, > maxVersionEncountered=1619003108526915584, maxInHash=1619003108526915584, > versionsHash=-2647181038600325382, numVersions=701868706, > numDocs=701868706, maxDoc=13478974}, Our fingerprint: > {maxVersionSpecified=9223372036854775807, > maxVersionEncountered=1618994939968880640, maxInHash=1618994939968880640, > versionsHash=-2861577316323569427, numVersions=701868706, > numDocs=701868706, maxDoc=13478974} > solr | 2018-12-05 12:30:38.183 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader > fingerprint {maxVersionSpecified=9223372036854775807, > maxVersionEncountered=1619014810634027008, maxInHash=1619014810634027008, > versionsHash=-5367247708654169186, numVersions=701868706, > numDocs=701868706, maxDoc=13478974} > solr | 2018-12-05 12:30:38.183 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > PeerSync: core=igls_shard2_replica_n6 url= > http://solr.node2.metaonly01.eca.local:8983/solr Received 13 versions > from > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ > solr | 2018-12-05 12:30:38.184 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync PeerSync: > core=igls_shard2_replica_n6 url= > http://solr.node2.metaonly01.eca.local:8983/solr Our versions are too > old. ourHighThreshold=1618976334275411968 > otherLowThreshold=1619012608476577792 ourHighest=1619014810634027008 > otherHighest=1619014936356192256 > solr | 2018-12-05 12:30:38.185 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > PeerSync: core=igls_shard2_replica_n6 url= > http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed > solr | 2018-12-05 12:30:38.185 INFO > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy PeerSync > Recovery was not successful - trying replication. > > > > Expanding one of them, we have: > > > solr | 2018-12-05 03:31:39.421 INFO > (AutoscalingActionExecutor-8-thread-1) [ ] o.a.s.c.a.ExecutePlanAction No > operations to execute for event: { > solr | "id":"11402ae38efc8T1dr1vcl6ut11qh9fq6ncle94t", > solr | "source":".auto_add_replicas", > solr | "eventTime":303476722167752, > solr | "eventType":"NODELOST", > solr | "properties":{ > solr | "eventTimes":[303476722167752], > solr | "preferredOperation":"movereplica", > solr | "_enqueue_time_":303959110445696, > solr | > "nodeNames":["solr.node4.metaonly01.eca.local:8983_solr"]}} > solr | 2018-12-05 03:31:41.290 INFO > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > o.a.s.c.ShardLeaderElectionContext Enough replicas found to continu > e. > solr | 2018-12-05 03:31:41.291 INFO > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > o.a.s.c.ShardLeaderElectionContext I may be the new leader - try an > d sync > solr | 2018-12-05 03:31:41.291 INFO > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > o.a.s.c.SyncStrategy Sync replicas to http://solr.node2.metaonly01. > eca.local:8983/solr/iglshistory_shard3_replica_n10/ > solr | 2018-12-05 03:31:41.301 INFO > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/] > nUpdates=100 > solr | 2018-12-05 03:31:41.307 INFO > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= > http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no > versions. sync failed. > > > Not quite sure why it would be executing the auto-add-replicas trigger > since it's definitely set to false on the collection. (I checked). Also, > this log comes from very early in the startup procedure, probably before > solr has been launched on the other servers. > > > > Kyle > > On Wed, 5 Dec 2018 at 12:58, Kevin Risden <kris...@apache.org> wrote: > >> Do you have logs right before the following? >> >> "we notice that the nodes go into "Recovering" state for about 10-12 hours >> before finally coming alive." >> >> Is there a peersync failure or something else in the logs indicating why >> there is a full recovery? >> >> Kevin Risden >> >> >> On Wed, Dec 5, 2018 at 12:53 PM lstusr 5u93n4 <lstusr...@gmail.com> >> wrote: >> >> > Hi All, >> > >> > We have a collection: >> > - solr 7.5 >> > - 3 shards, replication factor 2 for a total of 6 NRT replicas >> > - 3 servers, 16GB ram each >> > - 2 billion documents >> > - autoAddReplicas: false >> > - 2.1 TB on-disk index size >> > - index stored on hdfs on separate servers. >> > >> > If we (gracefully) shut down solr on all 3 servers, when we re-launch >> solr >> > we notice that the nodes go into "Recovering" state for about 10-12 >> hours >> > before finally coming alive. >> > >> > During this recovery time, we notice high network traffic outbound from >> our >> > HDFS servers to our solr servers. The sum total of which is roughly >> > equivalent to the index size on disk. >> > >> > So it seems to us that on startup, solr has to re-read the entire index >> > before coming back alive. >> > >> > 1. is this assumption correct? >> > 2. is there any way to mitigate this, so that solr can launch faster? >> > >> > Thanks! >> > >> > Kyle >> > >> >