Re: Replication on startup takes a long time

Erick Erickson Mon, 25 Sep 2017 09:27:47 -0700

Emir:

OK, thanks for pointing that out, that relieves me a lot!


Erick

On Mon, Sep 25, 2017 at 1:03 AM, Emir Arnautović
<emir.arnauto...@sematext.com> wrote:
> Hi Eric,
> I don’t think that there are some bugs with searcher reopening - this is a 
> scenario with a new slave:
>
> “But when I add a *new* slave pointing to the master…”
>
> So expected to have zero results until replication finishes.
>
> Regards,
> Emir
>
>> On 23 Sep 2017, at 19:21, Erick Erickson <erickerick...@gmail.com> wrote:
>>
>> First I'd like to say that I wish more people would take the time like
>> you have to fully describe the problem and your observations, it makes
>> it soooo much nicer than having half-a-dozen back and forths! Thanks!
>>
>> Just so it doesn't get buried in the rest of the response, I do tend
>> to go on.... I suspect you have a suggester configured. The
>> index-based suggesters read through your _entire_ index, all the
>> stored fields from all the documents and process them into an FST or
>> "sidecar" index. See:
>> https://lucidworks.com/2015/03/04/solr-suggester/. If this is true
>> they might be being built on the slaves whenever a replication
>> happens. Hmmm, if this is true, let us know. You can tell by removing
>> the suggester from the config and timing again. It seems like in the
>> master/slave config we should copy these down but don't know if it's
>> been tested.
>>
>> If they are being built on the slaves, you might try commenting out
>> all of the buildOn.... bits on the slave configurations. Frankly I
>> don't know if building the suggester structures on the master would
>> propagate them to the slave correctly if the slave doesn't build them,
>> but it would certainly be a fat clue if it changed the load time on
>> the slaves and we could look some more at options.
>>
>> Observation 1: Allocating 40G of memory for an index only 12G seems
>> like overkill. This isn't the root of your problem, but a 12G index
>> shouldn't need near 40G of JVM. In fact, due to MMapDirectory being
>> used (see Uwe Schindler's blog here:
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html)
>> I'd guess you can get away with MUCH less memory, maybe as low as 8G
>> or so. The wildcard here would be the size of your caches, especially
>> your filterCache configured in solrconfig.xml. Like I mentioned, this
>> isn't the root of your replication issue, just sayin'.
>>
>> Observation 2: Hard commits (the <autocommit> setting is not a very
>> expensive operation with openSearcher=false. Again this isn't the root
>> of your problem but consider removing the number of docs limitation
>> and just making it time-based, say every minute. Long blog on the
>> topic here: 
>> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/.
>> You might be accumulating pretty large transaction logs (assuming you
>> haven't disabled them) to no good purpose. Given your observation that
>> the actual transmission of the index takes 2 minutes, this is probably
>> not something to worry about much, but is worth checking.
>>
>> Question 1:
>>
>> Solr should be doing nothing other than opening a new searcher, which
>> should be roughly the "autowarm" time on master plus (perhaps)
>> suggester build. Your observation that autowarming takes quite a bit
>> of time (evidenced by much shorter times when you set the counts to
>> zero) is a smoking gun that you're probably doing far too much
>> autowarming. HOWEVER, during this interval the replica should be
>> serving queries from the old searcher so something else is going on
>> here. Autowarming is actually pretty simple, perhaps this will help
>> you to keep in mind while tuning:
>>
>> The queryResultCache and filterCache are essentially maps where the
>> key is just the text of the clause (simplifying here). So for the
>> queryResultCache the key is the entire search request. For the
>> filterCache, the key is just the "fq" clause. autowarm count in each
>> just means the number of keys that are replayed when a new searcher is
>> opened. I usually start with a pretty small number, on the order of
>> 10-20. The purpose of them is just to keep from experiencing a delay
>> when the first few searches are performed after a searcher is opened.
>>
>> My bet: you won't notice a measurable difference when dropping the
>> atuowarm counts drastically in terms of query response, but you will
>> save the startup time. I also suspect you can reduce the size of the
>> caches drastically, but don't know what you have them set to, it's a
>> guess.
>>
>> As to what's happening such that you serve queries with zero counts,
>> my best guess at this point is that you are rebuilding
>> autosuggesters..... We shouldn't be serving queries from the new
>> searcher during this interval, if confirmed we need to raise a JIRA.
>>
>> Question 2: see above, autosuggester?
>>
>> Question 3a: documents should become searchable on the slave when 1>
>> all the segments are copied, 2> autowarm is completed. As above, the
>> fact that you get 0-hit responses isn't what _should_ be happening.
>>
>> Autocommit settings are pretty irrelevant on the slave.
>>
>> Question 3b: soft commit on the master shouldn't affect the slave at all.
>>
>> The fact that you have 500 fields shouldn't matter that much in this
>> scenario. Again, the fact that removing your autowarm settings makes
>> such a difference indicates the counts are excessive, and I have a
>> secondary assumption that you probably have your cache settings far
>> higher than you need, but you'll have to test if you try to reduce
>> them.... BTW, I often find the 512 default setting more than ample,
>> monitor via admin UI>>core>>plugins/stats to see the hit ratio...
>>
>> As I told you, I do go on....
>>
>> Best,
>> Erick
>>
>> On Sat, Sep 23, 2017 at 6:40 AM, yasoobhaider <yasoobhaid...@gmail.com> 
>> wrote:
>>> Hi
>>>
>>> We have setup a master-slave architecture for our Solr instance.
>>>
>>> Number of docs: 2 million
>>> Collection size: ~12GB when optimized
>>> Heap size: 40G
>>> Machine specs: 60G, 8 cores
>>>
>>> We are using Solr 6.2.1.
>>>
>>> Autocommit Configuration:
>>>
>>> <autoCommit>
>>>      <maxDocs>40000</maxDocs>
>>>      <maxTime>900000</maxTime>
>>>      <openSearcher>false</openSearcher>
>>> </autoCommit>
>>>
>>> <autoSoftCommit>
>>>      <maxTime>${solr.autoSoftCommit.maxTime:3600000}</maxTime>
>>> </autoSoftCommit>
>>>
>>> I have setup the maxDocs at 40k because we do a heavy weekly indexing, and I
>>> didn't want a lot of commits happening too fast.
>>>
>>> Indexing runs smoothly on master. But when I add a new slave pointing to the
>>> master, it takes about 20 minutes for the slave to become queryable.
>>>
>>> There are two parts to this latency. First, it takes approximately 13
>>> minutes for the generation of the slave to be same as master. Then it takes
>>> another 7 minutes for the instance to become queryable (it returns 0 hits in
>>> these 7 minutes).
>>>
>>> I checked the logs and the collection is downloaded within two minutes.
>>> After that, there is nothing in the logs for next few minutes, even with
>>> LoggingInfoSteam set to 'ALL'.
>>>
>>> Question 1. What happens after all the files have been downloaded on slave
>>> from master? What is Solr doing internally that the generation sync up with
>>> master takes so long? Whatever it is doing, should it take that long? (~5
>>> minutes).
>>>
>>> After the generation sync up happens, it takes another 7 minutes to start
>>> giving results. I set the autowarm count in all caches to 0, which brought
>>> it down to 3 minutes.
>>>
>>> Question 2. What is happening here in the 3 minutes? Can this also be
>>> optimized?
>>>
>>> And I wanted to ask another unrelated question regarding when a slave become
>>> searchable. I understand that documents on master become searchable if a
>>> hard commit happens with openSearcher set to true, or when a soft commit
>>> happens. But when do documents become searchable on a slave?
>>>
>>> Question 3a. When do documents become searchable on a slave? As soon as a
>>> segment is copied over from master? Does softcommit make any sense on a
>>> slave, as we are not indexing anything? Does autocommit with opensearcher
>>> true affect slave in any way?
>>>
>>> Question 3b. Does a softcommit on master affect slave in any way? (I only
>>> have commit and startup options in my replicateAfter field in solrconfig)
>>>
>>> Would appreciate any help.
>>>
>>> PS: One of my colleague said that the latency may be because our schema.xml
>>> is huge (~500 fields). Question 4. Could that be a reason?
>>>
>>> Thanks
>>> Yasoob Haider
>>>
>>>
>>>
>>> --
>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Replication on startup takes a long time

Reply via email to