Re: Replication on startup takes a long time

Erick Erickson Sat, 23 Sep 2017 10:22:37 -0700

First I'd like to say that I wish more people would take the time like
you have to fully describe the problem and your observations, it makes
it soooo much nicer than having half-a-dozen back and forths! Thanks!

Just so it doesn't get buried in the rest of the response, I do tend
to go on.... I suspect you have a suggester configured. The
index-based suggesters read through your _entire_ index, all the
stored fields from all the documents and process them into an FST or
"sidecar" index. See:
https://lucidworks.com/2015/03/04/solr-suggester/. If this is true
they might be being built on the slaves whenever a replication
happens. Hmmm, if this is true, let us know. You can tell by removing
the suggester from the config and timing again. It seems like in the
master/slave config we should copy these down but don't know if it's
been tested.

If they are being built on the slaves, you might try commenting out
all of the buildOn.... bits on the slave configurations. Frankly I
don't know if building the suggester structures on the master would
propagate them to the slave correctly if the slave doesn't build them,
but it would certainly be a fat clue if it changed the load time on
the slaves and we could look some more at options.

Observation 1: Allocating 40G of memory for an index only 12G seems
like overkill. This isn't the root of your problem, but a 12G index
shouldn't need near 40G of JVM. In fact, due to MMapDirectory being
used (see Uwe Schindler's blog here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html)
I'd guess you can get away with MUCH less memory, maybe as low as 8G
or so. The wildcard here would be the size of your caches, especially
your filterCache configured in solrconfig.xml. Like I mentioned, this
isn't the root of your replication issue, just sayin'.

Observation 2: Hard commits (the <autocommit> setting is not a very
expensive operation with openSearcher=false. Again this isn't the root
of your problem but consider removing the number of docs limitation
and just making it time-based, say every minute. Long blog on the
topic here: 
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/.
You might be accumulating pretty large transaction logs (assuming you
haven't disabled them) to no good purpose. Given your observation that
the actual transmission of the index takes 2 minutes, this is probably
not something to worry about much, but is worth checking.

Question 1:

Solr should be doing nothing other than opening a new searcher, which
should be roughly the "autowarm" time on master plus (perhaps)
suggester build. Your observation that autowarming takes quite a bit
of time (evidenced by much shorter times when you set the counts to
zero) is a smoking gun that you're probably doing far too much
autowarming. HOWEVER, during this interval the replica should be
serving queries from the old searcher so something else is going on
here. Autowarming is actually pretty simple, perhaps this will help
you to keep in mind while tuning:

The queryResultCache and filterCache are essentially maps where the
key is just the text of the clause (simplifying here). So for the
queryResultCache the key is the entire search request. For the
filterCache, the key is just the "fq" clause. autowarm count in each
just means the number of keys that are replayed when a new searcher is
opened. I usually start with a pretty small number, on the order of
10-20. The purpose of them is just to keep from experiencing a delay
when the first few searches are performed after a searcher is opened.

My bet: you won't notice a measurable difference when dropping the
atuowarm counts drastically in terms of query response, but you will
save the startup time. I also suspect you can reduce the size of the
caches drastically, but don't know what you have them set to, it's a
guess.

As to what's happening such that you serve queries with zero counts,
my best guess at this point is that you are rebuilding
autosuggesters..... We shouldn't be serving queries from the new
searcher during this interval, if confirmed we need to raise a JIRA.

Question 2: see above, autosuggester?

Question 3a: documents should become searchable on the slave when 1>
all the segments are copied, 2> autowarm is completed. As above, the
fact that you get 0-hit responses isn't what _should_ be happening.

Autocommit settings are pretty irrelevant on the slave.

Question 3b: soft commit on the master shouldn't affect the slave at all.

The fact that you have 500 fields shouldn't matter that much in this
scenario. Again, the fact that removing your autowarm settings makes
such a difference indicates the counts are excessive, and I have a
secondary assumption that you probably have your cache settings far
higher than you need, but you'll have to test if you try to reduce
them.... BTW, I often find the 512 default setting more than ample,
monitor via admin UI>>core>>plugins/stats to see the hit ratio...

As I told you, I do go on....

Best,
Erick

On Sat, Sep 23, 2017 at 6:40 AM, yasoobhaider <yasoobhaid...@gmail.com> wrote:
> Hi
>
> We have setup a master-slave architecture for our Solr instance.
>
> Number of docs: 2 million
> Collection size: ~12GB when optimized
> Heap size: 40G
> Machine specs: 60G, 8 cores
>
> We are using Solr 6.2.1.
>
> Autocommit Configuration:
>
> <autoCommit>
>       <maxDocs>40000</maxDocs>
>       <maxTime>900000</maxTime>
>       <openSearcher>false</openSearcher>
> </autoCommit>
>
> <autoSoftCommit>
>       <maxTime>${solr.autoSoftCommit.maxTime:3600000}</maxTime>
> </autoSoftCommit>
>
> I have setup the maxDocs at 40k because we do a heavy weekly indexing, and I
> didn't want a lot of commits happening too fast.
>
> Indexing runs smoothly on master. But when I add a new slave pointing to the
> master, it takes about 20 minutes for the slave to become queryable.
>
> There are two parts to this latency. First, it takes approximately 13
> minutes for the generation of the slave to be same as master. Then it takes
> another 7 minutes for the instance to become queryable (it returns 0 hits in
> these 7 minutes).
>
> I checked the logs and the collection is downloaded within two minutes.
> After that, there is nothing in the logs for next few minutes, even with
> LoggingInfoSteam set to 'ALL'.
>
> Question 1. What happens after all the files have been downloaded on slave
> from master? What is Solr doing internally that the generation sync up with
> master takes so long? Whatever it is doing, should it take that long? (~5
> minutes).
>
> After the generation sync up happens, it takes another 7 minutes to start
> giving results. I set the autowarm count in all caches to 0, which brought
> it down to 3 minutes.
>
> Question 2. What is happening here in the 3 minutes? Can this also be
> optimized?
>
> And I wanted to ask another unrelated question regarding when a slave become
> searchable. I understand that documents on master become searchable if a
> hard commit happens with openSearcher set to true, or when a soft commit
> happens. But when do documents become searchable on a slave?
>
> Question 3a. When do documents become searchable on a slave? As soon as a
> segment is copied over from master? Does softcommit make any sense on a
> slave, as we are not indexing anything? Does autocommit with opensearcher
> true affect slave in any way?
>
> Question 3b. Does a softcommit on master affect slave in any way? (I only
> have commit and startup options in my replicateAfter field in solrconfig)
>
> Would appreciate any help.
>
> PS: One of my colleague said that the latency may be because our schema.xml
> is huge (~500 fields). Question 4. Could that be a reason?
>
> Thanks
> Yasoob Haider
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Replication on startup takes a long time

Reply via email to