Re: SolrClould 6.6 stability challenges

Erick Erickson Sun, 05 Nov 2017 07:37:02 -0800

Check the leader and follower logs for anything like "leader initiated
recovery" (LIR). One thing I have seen where followers go into
recovery is if, for some reason, the time it takes to respond to an
update exceeds the timeout. The scenario is this:
> leader sends an update
> follower fails to respond for _any_ reason within the timeout
> leader says "sick follower, make it recover"


In the particular case I'm thinking of, indexing the packet took
minutes. I strongly doubt that your documents are pathological enough
to hit this, but there's at least a chance that the update are
queueing up on the follower and the updates are timing out.

Best,
Erick


On Sun, Nov 5, 2017 at 7:14 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 11/3/2017 10:15 PM, Rick Dig wrote:
>>
>> we are trying to run solrcloud 6.6 in a production setting.
>> here's our config and issue
>> 1) 3 nodes, 1 shard, replication factor 3
>> 2) all nodes are 16GB RAM, 4 core
>> 3) Our production load is about 2000 requests per minute
>> 4) index is fairly small, index size is around 400 MB with 300k documents
>> 5) autocommit is currently set to 5 minutes (even though ideally we would
>> like a smaller interval).
>> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
>> 7) all of this runs perfectly ok when indexing isn't happening. as soon as
>> we start "nrt" indexing one of the follower nodes goes down within 10 to
>> 20
>> minutes. from this point on the nodes never recover unless we stop
>> indexing.  the master usually is the last one to fall.
>> 8) there are maybe 5 to 7 processes indexing at the same time with
>> document
>> batch sizes of 500.
>> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
>> 10) no cpu and / or oom issues that we can see.
>> 11) cpu load does go fairly high 15 to 20 at times.
>
>
> My two cents to add to what you've already seen:
>
> With 300K documents and 400MB of index size, an 8GB heap seems very
> excessive, even with complex queries.  What evidence do you have that you
> need a heap that size?  Are you just following a best practice
> recommendation you saw somewhere to give half your memory to Java?
>
> This is a *tiny* index by both document count and size.  Each document
> cannot be very big.
>
> Your GC log doesn't show any issues that concern me.  There are a few slow
> GCs, but when you index, that's probably to be expected, especially with an
> 8GB heap.
>
> What exactly do you mean by "one of the follower nodes goes down"?  When
> this happens, are there error messages at the time of the event?  What
> symptoms are there pertaining to that specific node?
>
> A query load of 2000 per minute is about 33 per second.  Are these queries
> steady for the full minute, or is it bursty?  33 qps is high, but not
> insane, and with such a tiny index, is probably well within Solr's
> capabilities.
>
> There should be no reason to *ever* increase maxWarmingSearchers.  If you
> see the warning about this, the fix is to reduce your commit frequency, not
> increase the value.  Increasing the value can lead to memory and performance
> problems.  The fact that this value is even being discussed, and that the
> value has been changed on your setup, has me thinking that there may be more
> commits happening than the every-five-minute autocommit.
>
> For automatic commits, I have some recommendations for everyone to start
> with, and then adjust if necessary:  autoCommit: maxTime of 60000,
> openSearcher false.  autoSoftCommit, maxTime of 120000.  Neither one should
> have maxDocs configured.
>
> It should take far less than 20 seconds to index a 500 document batch,
> especially when they are small enough for 300K of them to produce a 400MB
> index.  There are only a few problems I can imagine right now that could
> cause such slow indexing, having no real information to go on:  1) The
> analysis chains in your schema are exceptionally heavy and take a long time
> to run.  2) There is a performance issue happening that we have not yet
> figured out.  3) Your indexing request includes a commit, and the commit is
> happening very slowly.
>
> Here is a log entry on one of my indexes showing 1000 documents being added
> in 777 milliseconds.  The index that this is happening on is about 40GB in
> size, with about 30 million documents.  I have redacted part of the
> uniqueKey values in this log, to hide the sources of our data:
>
> 2017-11-04 09:30:14.325 INFO  (qtp1394336709-42397) [   x:spark6live]
> o.a.s.u.p.LogUpdateProcessorFactory [spark6live]  webapp=/solr path=/update
> params={wt=javabin&version=2}{add=[REDACTEDsix557224 (1583127266377859072),
> REDACTEDsix557228 (1583127266381004800), REDACTEDtwo979483
> (1583127266381004801), REDACTEDtwo979488 (1583127266382053376),
> REDACTEDtwo979490 (1583127266383101952), REDACTEDsix557260
> (1583127266383101953), REDACTEDsix557242 (1583127266384150528),
> REDACTEDsix557258 (1583127266385199104), REDACTEDsix557247
> (1583127266385199105), REDACTEDsix557276 (1583127266394636288), ... (1000
> adds)]} 0 777
>
> The rate I'm getting here of 1000 docs in 777 milliseconds is a rate that I
> consider to be pretty slow, especially because my indexing is
> single-threaded.  But it works for us.  What you're seeing where 500
> documents takes 20 seconds is slower than I've EVER seen, except in
> situations where there's a serious problem.  On a system in good health,
> with multiple threads indexing, Solr should be able to index several
> thousand documents every second.
>
> Is the indexing program running on the same machine as Solr, or on another
> machine?  For best results, it should be on a different machine, accessing
> Solr via HTTP.  This is so that whatever load the indexing program creates
> does not take CPU, memory, and I/O resources away from Solr.
>
> What OS is Solr running on?  If more information is needed, it will be a
> good idea to know precisely how to gather that information.
>
> Overall, based on the information currently available, you should not be
> having the problems you are.  So there must be something about your setup
> that's not configured correctly beyond the information we've already got.
> It could be directly Solr-related, or something else indirectly causing
> problems.  I do not yet know exactly what information we might need to help.
>
> Can you share an entire solr.log file that covers enough time so that there
> is both indexing and querying happening?  If it also covers that node going
> down, that would be even better.  You'll probably need to use a file-sharing
> website to share the log -- I'm surprised your GC log made it to the list.
>
> Thanks,
> Shawn

Re: SolrClould 6.6 stability challenges

Reply via email to