Hey Erick, Some thoughts:
Solr should _not_ have to replicate the index or go into peer sync on > startup. > Okay, that's good to know! Tells us that this is a problem that can be solved. > > > are you stopping indexing before you shut your servers down? > By indexing, you mean adding new entries to solr, right? We havent' explicity stopped it, but it's very quiet (rate of 1 doc / minute). > > > Be very sure you have passed your autocommit interval after you've > stopped indexing and before you stop Solr. > This we have not done... Autocommit is set fairly long on our system (5 mins) I'll do another test where I completely quiesce first. > > > How are you shutting down? bin/solr stop is the proper way, and you > should NOT be seeing any kind of message about force killing the Solr > process. There are timeouts in solr.in.sh (unix) that you can lengthen if > you are seeing any message about force killing. > Nothing about force killing shows up in the logs, but I'll extend the timeouts anyways and monitor the next shutdown (yes with bin/solr stop). One thought: could it be that we're waiting TOO LONG between stopping the solr processes on the different servers? Like by killing it on server 1, we're causing servers 2 and 3 to go into a non-stable state? So they're not perfectly "green" at the time of shutdown? > > > Bring all the nodes back up before you start indexing again. > if an entry were to trickle in before the last node were online, this could cause the behaviour? I'll block it anyways and see. > > With the above, you should not see peer sync on startup. Of course you > have to have all your Solr nodes in the "active" state before shutting > down. > > That said, some observations: > > > your indexes are very large. Very very large given your hardware. You > have roughly 700M docs per replica, I'm not quite sure whether your 2.1T > is aggregate across all 6 replicas or not. Even if so (and that's the best > case), each replica is roughly 350G in size.You have, at best, 700G per > server trying to run with 16G of RAM. This is extremely undersized in my > experience. > Your math is good. :D The replicas range between 337G and 355G in size. That said, I'm curious as to why you say this is undersized. We're optimized for index speed, not query speed. Based on the amount of data we want to hold, we have to accept the fact that the full index will not fit in RAM. (We're not going to provision 2-3T of RAM on the servers...) We also have unpredictable query patterns, so we don't benefit much by caching. Some data will have to be read from disk on nearly every query. The queries on this system DO take a little bit long (like 10 seconds or so, up to 100 seconds at the worst) but that's acceptable for our use case. Resource-wise, each server has 6GB allocated for solr's jvm, and we're not pushing that beyond about 50% utilization (jvm memory usage hovers around 3G, typically.) So, do we really gain anything by adding more hardware? > > > I suspect you're hitting a lot of GC. What kind of GC observations have > you made? > None, yet. Given that the jvm isn't pushing its limits, we haven't profiled the GC. We will one day, certainly, to squeeze more performance out of the system, but haven't gone down that road so far. > > > I would turn off autoscaling, your index sizes and hardware are too > small to even consider automatically moving that much data around. > Yeah, absolutely. We explicity set autoAddReplicas to false on the collection, and we have no other triggers setup. Anyways, I'll report back with some findings, after the system comes back online and I can kill it again. :D Thanks for all your help! > > In summary, this seems on the surface to be an extremely undersized > system, and unless and until you properly size it you'll have problems > > Best, > Erick > On Wed, Dec 5, 2018 at 10:12 AM lstusr 5u93n4 <lstusr...@gmail.com> wrote: > > > > Hi Kevin, > > > > We do have logs. Grepping for peersync, I can see > > > > solr | 2018-12-05 03:31:41.301 INFO > > > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= > > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > > > http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/ > ] > > nUpdates=100 > > solr | 2018-12-05 03:31:41.307 INFO > > > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= > > http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no > > versions. sync failed. > > solr | 2018-12-05 03:31:44.007 INFO > > > (coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7) > > [c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4] > > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url= > > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > > > http://solr.node4.metaonly01.eca.local:8983/solr/iglshistory_shard2_replica_n6/ > ] > > nUpdates=100 > > solr | 2018-12-05 03:31:44.008 INFO > > > (coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7) > > [c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4] > > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url= > > http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no > > versions. sync failed. > > solr | 2018-12-05 05:48:44.202 INFO > > > (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1 > > r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync: > > core=igls_shard1_replica_n1 url= > > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > > http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/ > ] > > nUpdates=100 > > solr | 2018-12-05 05:48:44.646 WARN > > > (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1 > > r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync: > > core=igls_shard1_replica_n1 url= > > http://solr.node2.metaonly01.eca.local:8983/solr got a 503 from > > http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/ > , > > counting as success > > solr | 2018-12-05 05:48:44.682 INFO > > > (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1 > > r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync: > > core=igls_shard1_replica_n1 url= > > http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync succeeded > > solr | 2018-12-05 05:51:18.369 INFO (qtp2131952342-10232) > > [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync > > PeerSync: core=igls_shard2_replica_n6 url= > > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ > ] > > nUpdates=100 > > solr | 2018-12-05 06:01:18.496 WARN (qtp2131952342-10232) > > [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync > > Replica did not return a fingerprint - possibly an older Solr version or > > exception > > solr | 2018-12-05 06:01:31.126 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy > Attempting > > to PeerSync from [ > > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] > - > > recoveringAfterStartup=[true] > > solr | 2018-12-05 06:01:31.129 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > > PeerSync: core=igls_shard2_replica_n6 url= > > http://solr.node2.metaonly01.eca.local:8983/solr START leader= > > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ > > nUpdates=100 > > solr | 2018-12-05 06:11:18.535 WARN (qtp2131952342-10232) > > [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync > > PeerSync: core=igls_shard2_replica_n6 url= > > http://solr.node2.metaonly01.eca.local:8983/solr exception talking to > > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ > , > > failed > > solr | 2018-12-05 06:11:18.583 INFO (qtp2131952342-10232) > > [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync > > PeerSync: core=igls_shard2_replica_n6 url= > > http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed > > solr | at > > > org.apache.solr.update.PeerSyncWithLeader.request(PeerSyncWithLeader.java:321) > > solr | at > > > org.apache.solr.update.PeerSyncWithLeader.alreadyInSync(PeerSyncWithLeader.java:341) > > solr | at > > > org.apache.solr.update.PeerSyncWithLeader.sync(PeerSyncWithLeader.java:135) > > solr | 2018-12-05 09:21:31.831 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy > Attempting > > to PeerSync from [ > > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] > - > > recoveringAfterStartup=[true] > > solr | 2018-12-05 09:21:31.831 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > > PeerSync: core=igls_shard2_replica_n6 url= > > http://solr.node2.metaonly01.eca.local:8983/solr START leader= > > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ > > nUpdates=100 > > solr | 2018-12-05 12:28:26.698 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > > Fingerprint comparison result: 1 > > solr | 2018-12-05 12:28:26.698 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader > > fingerprint: {maxVersionSpecified=9223372036854775807, > > maxVersionEncountered=1619003108526915584, maxInHash=1619003108526915584, > > versionsHash=-2647181038600325382, numVersions=701868706, > > numDocs=701868706, maxDoc=13478974}, Our fingerprint: > > {maxVersionSpecified=9223372036854775807, > > maxVersionEncountered=1618994939968880640, maxInHash=1618994939968880640, > > versionsHash=-2861577316323569427, numVersions=701868706, > > numDocs=701868706, maxDoc=13478974} > > solr | 2018-12-05 12:30:38.183 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader > > fingerprint {maxVersionSpecified=9223372036854775807, > > maxVersionEncountered=1619014810634027008, maxInHash=1619014810634027008, > > versionsHash=-5367247708654169186, numVersions=701868706, > > numDocs=701868706, maxDoc=13478974} > > solr | 2018-12-05 12:30:38.183 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > > PeerSync: core=igls_shard2_replica_n6 url= > > http://solr.node2.metaonly01.eca.local:8983/solr Received 13 versions > from > > http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/ > > solr | 2018-12-05 12:30:38.184 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync PeerSync: > > core=igls_shard2_replica_n6 url= > > http://solr.node2.metaonly01.eca.local:8983/solr Our versions are too > old. > > ourHighThreshold=1618976334275411968 > otherLowThreshold=1619012608476577792 > > ourHighest=1619014810634027008 otherHighest=1619014936356192256 > > solr | 2018-12-05 12:30:38.185 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader > > PeerSync: core=igls_shard2_replica_n6 url= > > http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed > > solr | 2018-12-05 12:30:38.185 INFO > > > (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr > > x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2 > > r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy PeerSync > > Recovery was not successful - trying replication. > > > > > > > > Expanding one of them, we have: > > > > > > solr | 2018-12-05 03:31:39.421 INFO > > (AutoscalingActionExecutor-8-thread-1) [ ] o.a.s.c.a.ExecutePlanAction > No > > operations to execute for event: { > > solr | "id":"11402ae38efc8T1dr1vcl6ut11qh9fq6ncle94t", > > solr | "source":".auto_add_replicas", > > solr | "eventTime":303476722167752, > > solr | "eventType":"NODELOST", > > solr | "properties":{ > > solr | "eventTimes":[303476722167752], > > solr | "preferredOperation":"movereplica", > > solr | "_enqueue_time_":303959110445696, > > solr | > > "nodeNames":["solr.node4.metaonly01.eca.local:8983_solr"]}} > > solr | 2018-12-05 03:31:41.290 INFO > > > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > > o.a.s.c.ShardLeaderElectionContext Enough replicas found to continu > > e. > > solr | 2018-12-05 03:31:41.291 INFO > > > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > > o.a.s.c.ShardLeaderElectionContext I may be the new leader - try an > > d sync > > solr | 2018-12-05 03:31:41.291 INFO > > > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > > o.a.s.c.SyncStrategy Sync replicas to http://solr.node2.metaonly01. > > eca.local:8983/solr/iglshistory_shard3_replica_n10/ > > solr | 2018-12-05 03:31:41.301 INFO > > > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= > > http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[ > > > http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/ > ] > > nUpdates=100 > > solr | 2018-12-05 03:31:41.307 INFO > > > (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr) > > [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10] > > o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url= > > http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no > > versions. sync failed. > > > > > > Not quite sure why it would be executing the auto-add-replicas trigger > > since it's definitely set to false on the collection. (I checked). Also, > > this log comes from very early in the startup procedure, probably before > > solr has been launched on the other servers. > > > > > > > > Kyle > > > > On Wed, 5 Dec 2018 at 12:58, Kevin Risden <kris...@apache.org> wrote: > > > > > Do you have logs right before the following? > > > > > > "we notice that the nodes go into "Recovering" state for about 10-12 > hours > > > before finally coming alive." > > > > > > Is there a peersync failure or something else in the logs indicating > why > > > there is a full recovery? > > > > > > Kevin Risden > > > > > > > > > On Wed, Dec 5, 2018 at 12:53 PM lstusr 5u93n4 <lstusr...@gmail.com> > wrote: > > > > > > > Hi All, > > > > > > > > We have a collection: > > > > - solr 7.5 > > > > - 3 shards, replication factor 2 for a total of 6 NRT replicas > > > > - 3 servers, 16GB ram each > > > > - 2 billion documents > > > > - autoAddReplicas: false > > > > - 2.1 TB on-disk index size > > > > - index stored on hdfs on separate servers. > > > > > > > > If we (gracefully) shut down solr on all 3 servers, when we re-launch > > > solr > > > > we notice that the nodes go into "Recovering" state for about 10-12 > hours > > > > before finally coming alive. > > > > > > > > During this recovery time, we notice high network traffic outbound > from > > > our > > > > HDFS servers to our solr servers. The sum total of which is roughly > > > > equivalent to the index size on disk. > > > > > > > > So it seems to us that on startup, solr has to re-read the entire > index > > > > before coming back alive. > > > > > > > > 1. is this assumption correct? > > > > 2. is there any way to mitigate this, so that solr can launch faster? > > > > > > > > Thanks! > > > > > > > > Kyle > > > > > > > >