Hi Kevin, We do have logs. Grepping for peersync, I can see
solr | 2018-12-05 03:31:41.301 INFO (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/] nUpdates=100 solr | 2018-12-05 03:31:41.307 INFO (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no versions. sync failed. solr | 2018-12-05 03:31:44.007 INFO (coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7) [c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4] o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url= http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ http://solr.node4.metaonly01.eca.local:8983/solr/iglshistory_shard2_replica_n6/] nUpdates=100 solr | 2018-12-05 03:31:44.008 INFO (coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7) [c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4] o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url= http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no versions. sync failed. solr | 2018-12-05 05:48:44.202 INFO (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1 r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync: core=igls_shard1_replica_n1 url= http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/] nUpdates=100 solr | 2018-12-05 05:48:44.646 WARN (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1 r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync: core=igls_shard1_replica_n1 url= http://solr.node2.metaonly01.eca.local:8983/solr got a 503 from http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/, counting as success solr | 2018-12-05 05:48:44.682 INFO (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1 r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync: core=igls_shard1_replica_n1 url= http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync succeeded solr | 2018-12-05 05:51:18.369 INFO (qtp2131952342-10232) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync PeerSync: core=igls_shard2_replica_n6 url= http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] nUpdates=100 solr | 2018-12-05 06:01:18.496 WARN (qtp2131952342-10232) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync Replica did not return a fingerprint - possibly an older Solr version or exception solr | 2018-12-05 06:01:31.126 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting to PeerSync from [ http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] - recoveringAfterStartup=[true] solr | 2018-12-05 06:01:31.129 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader PeerSync: core=igls_shard2_replica_n6 url= http://solr.node2.metaonly01.eca.local:8983/solr START leader= http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ nUpdates=100 solr | 2018-12-05 06:11:18.535 WARN (qtp2131952342-10232) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync PeerSync: core=igls_shard2_replica_n6 url= http://solr.node2.metaonly01.eca.local:8983/solr exception talking to http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/, failed solr | 2018-12-05 06:11:18.583 INFO (qtp2131952342-10232) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync PeerSync: core=igls_shard2_replica_n6 url= http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed solr | at org.apache.solr.update.PeerSyncWithLeader.request(PeerSyncWithLeader.java:321) solr | at org.apache.solr.update.PeerSyncWithLeader.alreadyInSync(PeerSyncWithLeader.java:341) solr | at org.apache.solr.update.PeerSyncWithLeader.sync(PeerSyncWithLeader.java:135) solr | 2018-12-05 09:21:31.831 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting to PeerSync from [ http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] - recoveringAfterStartup=[true] solr | 2018-12-05 09:21:31.831 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader PeerSync: core=igls_shard2_replica_n6 url= http://solr.node2.metaonly01.eca.local:8983/solr START leader= http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ nUpdates=100 solr | 2018-12-05 12:28:26.698 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Fingerprint comparison result: 1 solr | 2018-12-05 12:28:26.698 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader fingerprint: {maxVersionSpecified=9223372036854775807, maxVersionEncountered=1619003108526915584, maxInHash=1619003108526915584, versionsHash=-2647181038600325382, numVersions=701868706, numDocs=701868706, maxDoc=13478974}, Our fingerprint: {maxVersionSpecified=9223372036854775807, maxVersionEncountered=1618994939968880640, maxInHash=1618994939968880640, versionsHash=-2861577316323569427, numVersions=701868706, numDocs=701868706, maxDoc=13478974} solr | 2018-12-05 12:30:38.183 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader fingerprint {maxVersionSpecified=9223372036854775807, maxVersionEncountered=1619014810634027008, maxInHash=1619014810634027008, versionsHash=-5367247708654169186, numVersions=701868706, numDocs=701868706, maxDoc=13478974} solr | 2018-12-05 12:30:38.183 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader PeerSync: core=igls_shard2_replica_n6 url= http://solr.node2.metaonly01.eca.local:8983/solr Received 13 versions from http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ solr | 2018-12-05 12:30:38.184 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync PeerSync: core=igls_shard2_replica_n6 url= http://solr.node2.metaonly01.eca.local:8983/solr Our versions are too old. ourHighThreshold=1618976334275411968 otherLowThreshold=1619012608476577792 ourHighest=1619014810634027008 otherHighest=1619014936356192256 solr | 2018-12-05 12:30:38.185 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader PeerSync: core=igls_shard2_replica_n6 url= http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed solr | 2018-12-05 12:30:38.185 INFO (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy PeerSync Recovery was not successful - trying replication. Expanding one of them, we have: solr | 2018-12-05 03:31:39.421 INFO (AutoscalingActionExecutor-8-thread-1) [ ] o.a.s.c.a.ExecutePlanAction No operations to execute for event: { solr | "id":"11402ae38efc8T1dr1vcl6ut11qh9fq6ncle94t", solr | "source":".auto_add_replicas", solr | "eventTime":303476722167752, solr | "eventType":"NODELOST", solr | "properties":{ solr | "eventTimes":[303476722167752], solr | "preferredOperation":"movereplica", solr | "_enqueue_time_":303959110445696, solr | "nodeNames":["solr.node4.metaonly01.eca.local:8983_solr"]}} solr | 2018-12-05 03:31:41.290 INFO (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] o.a.s.c.ShardLeaderElectionContext Enough replicas found to continu e. solr | 2018-12-05 03:31:41.291 INFO (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] o.a.s.c.ShardLeaderElectionContext I may be the new leader - try an d sync solr | 2018-12-05 03:31:41.291 INFO (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] o.a.s.c.SyncStrategy Sync replicas to http://solr.node2.metaonly01. eca.local:8983/solr/iglshistory_shard3_replica_n10/ solr | 2018-12-05 03:31:41.301 INFO (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/] nUpdates=100 solr | 2018-12-05 03:31:41.307 INFO (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no versions. sync failed. Not quite sure why it would be executing the auto-add-replicas trigger since it's definitely set to false on the collection. (I checked). Also, this log comes from very early in the startup procedure, probably before solr has been launched on the other servers. Kyle On Wed, 5 Dec 2018 at 12:58, Kevin Risden <kris...@apache.org> wrote: > Do you have logs right before the following? > > "we notice that the nodes go into "Recovering" state for about 10-12 hours > before finally coming alive." > > Is there a peersync failure or something else in the logs indicating why > there is a full recovery? > > Kevin Risden > > > On Wed, Dec 5, 2018 at 12:53 PM lstusr 5u93n4 <lstusr...@gmail.com> wrote: > > > Hi All, > > > > We have a collection: > > - solr 7.5 > > - 3 shards, replication factor 2 for a total of 6 NRT replicas > > - 3 servers, 16GB ram each > > - 2 billion documents > > - autoAddReplicas: false > > - 2.1 TB on-disk index size > > - index stored on hdfs on separate servers. > > > > If we (gracefully) shut down solr on all 3 servers, when we re-launch > solr > > we notice that the nodes go into "Recovering" state for about 10-12 hours > > before finally coming alive. > > > > During this recovery time, we notice high network traffic outbound from > our > > HDFS servers to our solr servers. The sum total of which is roughly > > equivalent to the index size on disk. > > > > So it seems to us that on startup, solr has to re-read the entire index > > before coming back alive. > > > > 1. is this assumption correct? > > 2. is there any way to mitigate this, so that solr can launch faster? > > > > Thanks! > > > > Kyle > > >