Re: Replication on startup takes a long time

Emir Arnautović Mon, 25 Sep 2017 01:04:40 -0700

Hi Eric,
I don’t think that there are some bugs with searcher reopening - this is a 
scenario with a new slave:


“But when I add a *new* slave pointing to the master…”

So expected to have zero results until replication finishes.

Regards,
Emir

> On 23 Sep 2017, at 19:21, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> First I'd like to say that I wish more people would take the time like
> you have to fully describe the problem and your observations, it makes
> it soooo much nicer than having half-a-dozen back and forths! Thanks!
> 
> Just so it doesn't get buried in the rest of the response, I do tend
> to go on.... I suspect you have a suggester configured. The
> index-based suggesters read through your _entire_ index, all the
> stored fields from all the documents and process them into an FST or
> "sidecar" index. See:
> https://lucidworks.com/2015/03/04/solr-suggester/. If this is true
> they might be being built on the slaves whenever a replication
> happens. Hmmm, if this is true, let us know. You can tell by removing
> the suggester from the config and timing again. It seems like in the
> master/slave config we should copy these down but don't know if it's
> been tested.
> 
> If they are being built on the slaves, you might try commenting out
> all of the buildOn.... bits on the slave configurations. Frankly I
> don't know if building the suggester structures on the master would
> propagate them to the slave correctly if the slave doesn't build them,
> but it would certainly be a fat clue if it changed the load time on
> the slaves and we could look some more at options.
> 
> Observation 1: Allocating 40G of memory for an index only 12G seems
> like overkill. This isn't the root of your problem, but a 12G index
> shouldn't need near 40G of JVM. In fact, due to MMapDirectory being
> used (see Uwe Schindler's blog here:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html)
> I'd guess you can get away with MUCH less memory, maybe as low as 8G
> or so. The wildcard here would be the size of your caches, especially
> your filterCache configured in solrconfig.xml. Like I mentioned, this
> isn't the root of your replication issue, just sayin'.
> 
> Observation 2: Hard commits (the <autocommit> setting is not a very
> expensive operation with openSearcher=false. Again this isn't the root
> of your problem but consider removing the number of docs limitation
> and just making it time-based, say every minute. Long blog on the
> topic here: 
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/.
> You might be accumulating pretty large transaction logs (assuming you
> haven't disabled them) to no good purpose. Given your observation that
> the actual transmission of the index takes 2 minutes, this is probably
> not something to worry about much, but is worth checking.
> 
> Question 1:
> 
> Solr should be doing nothing other than opening a new searcher, which
> should be roughly the "autowarm" time on master plus (perhaps)
> suggester build. Your observation that autowarming takes quite a bit
> of time (evidenced by much shorter times when you set the counts to
> zero) is a smoking gun that you're probably doing far too much
> autowarming. HOWEVER, during this interval the replica should be
> serving queries from the old searcher so something else is going on
> here. Autowarming is actually pretty simple, perhaps this will help
> you to keep in mind while tuning:
> 
> The queryResultCache and filterCache are essentially maps where the
> key is just the text of the clause (simplifying here). So for the
> queryResultCache the key is the entire search request. For the
> filterCache, the key is just the "fq" clause. autowarm count in each
> just means the number of keys that are replayed when a new searcher is
> opened. I usually start with a pretty small number, on the order of
> 10-20. The purpose of them is just to keep from experiencing a delay
> when the first few searches are performed after a searcher is opened.
> 
> My bet: you won't notice a measurable difference when dropping the
> atuowarm counts drastically in terms of query response, but you will
> save the startup time. I also suspect you can reduce the size of the
> caches drastically, but don't know what you have them set to, it's a
> guess.
> 
> As to what's happening such that you serve queries with zero counts,
> my best guess at this point is that you are rebuilding
> autosuggesters..... We shouldn't be serving queries from the new
> searcher during this interval, if confirmed we need to raise a JIRA.
> 
> Question 2: see above, autosuggester?
> 
> Question 3a: documents should become searchable on the slave when 1>
> all the segments are copied, 2> autowarm is completed. As above, the
> fact that you get 0-hit responses isn't what _should_ be happening.
> 
> Autocommit settings are pretty irrelevant on the slave.
> 
> Question 3b: soft commit on the master shouldn't affect the slave at all.
> 
> The fact that you have 500 fields shouldn't matter that much in this
> scenario. Again, the fact that removing your autowarm settings makes
> such a difference indicates the counts are excessive, and I have a
> secondary assumption that you probably have your cache settings far
> higher than you need, but you'll have to test if you try to reduce
> them.... BTW, I often find the 512 default setting more than ample,
> monitor via admin UI>>core>>plugins/stats to see the hit ratio...
> 
> As I told you, I do go on....
> 
> Best,
> Erick
> 
> On Sat, Sep 23, 2017 at 6:40 AM, yasoobhaider <yasoobhaid...@gmail.com> wrote:
>> Hi
>> 
>> We have setup a master-slave architecture for our Solr instance.
>> 
>> Number of docs: 2 million
>> Collection size: ~12GB when optimized
>> Heap size: 40G
>> Machine specs: 60G, 8 cores
>> 
>> We are using Solr 6.2.1.
>> 
>> Autocommit Configuration:
>> 
>> <autoCommit>
>>      <maxDocs>40000</maxDocs>
>>      <maxTime>900000</maxTime>
>>      <openSearcher>false</openSearcher>
>> </autoCommit>
>> 
>> <autoSoftCommit>
>>      <maxTime>${solr.autoSoftCommit.maxTime:3600000}</maxTime>
>> </autoSoftCommit>
>> 
>> I have setup the maxDocs at 40k because we do a heavy weekly indexing, and I
>> didn't want a lot of commits happening too fast.
>> 
>> Indexing runs smoothly on master. But when I add a new slave pointing to the
>> master, it takes about 20 minutes for the slave to become queryable.
>> 
>> There are two parts to this latency. First, it takes approximately 13
>> minutes for the generation of the slave to be same as master. Then it takes
>> another 7 minutes for the instance to become queryable (it returns 0 hits in
>> these 7 minutes).
>> 
>> I checked the logs and the collection is downloaded within two minutes.
>> After that, there is nothing in the logs for next few minutes, even with
>> LoggingInfoSteam set to 'ALL'.
>> 
>> Question 1. What happens after all the files have been downloaded on slave
>> from master? What is Solr doing internally that the generation sync up with
>> master takes so long? Whatever it is doing, should it take that long? (~5
>> minutes).
>> 
>> After the generation sync up happens, it takes another 7 minutes to start
>> giving results. I set the autowarm count in all caches to 0, which brought
>> it down to 3 minutes.
>> 
>> Question 2. What is happening here in the 3 minutes? Can this also be
>> optimized?
>> 
>> And I wanted to ask another unrelated question regarding when a slave become
>> searchable. I understand that documents on master become searchable if a
>> hard commit happens with openSearcher set to true, or when a soft commit
>> happens. But when do documents become searchable on a slave?
>> 
>> Question 3a. When do documents become searchable on a slave? As soon as a
>> segment is copied over from master? Does softcommit make any sense on a
>> slave, as we are not indexing anything? Does autocommit with opensearcher
>> true affect slave in any way?
>> 
>> Question 3b. Does a softcommit on master affect slave in any way? (I only
>> have commit and startup options in my replicateAfter field in solrconfig)
>> 
>> Would appreciate any help.
>> 
>> PS: One of my colleague said that the latency may be because our schema.xml
>> is huge (~500 fields). Question 4. Could that be a reason?
>> 
>> Thanks
>> Yasoob Haider
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Replication on startup takes a long time

Reply via email to