Re: Solr 4.6.1 Cloud Stops Replication

Erick Erickson Mon, 31 Aug 2015 13:30:06 -0700

OK, thanks for wrapping this up!


On Mon, Aug 31, 2015 at 10:08 AM, Rallavagu <rallav...@gmail.com> wrote:
> Erick,
>
> Apologies for missing out on status on indexing (replication) issues as I
> have originally started this thread. After implementing CloudSolrServer
> instead of CouncurrentUpdateSolrServer things were much better. I simply
> wanted to follow up on understanding the memory behavior better though we
> have tuned both heap and physical memory a while ago.
>
> Thanks
>
> On 8/24/15 9:09 AM, Erick Erickson wrote:
>>
>> bq: As a follow up, the default is set to "NRTCachingDirectoryFactory"
>> for DirectoryFactory but not MMapDirectory. It is mentioned that
>> NRTCachingDirectoryFactory "caches small files in memory for better
>> NRT performance".
>>
>> NRTCachingDirectoryFactory also uses MMapDirectory under the covers as
>> well as "caches small files in memory...."
>> so you really can't separate out the two.
>>
>> I didn't mention this explicitly, but your original problem should
>> _not_ be happening in a well-tuned
>> system. Why your nodes go into a down state needs to be understood.
>> The connection timeout is
>> the only clue so far, and the usual reason here is that very long GC
>> pauses are happening. If this
>> continually happens, you might try turning on GC reporting options.
>>
>> Best,
>> Erick
>>
>>
>> On Mon, Aug 24, 2015 at 2:47 AM, Rallavagu <rallav...@gmail.com> wrote:
>>>
>>> As a follow up, the default is set to "NRTCachingDirectoryFactory" for
>>> DirectoryFactory but not MMapDirectory. It is mentioned that
>>> NRTCachingDirectoryFactory "caches small files in memory for better NRT
>>> performance".
>>>
>>> Wondering if the this would also consume physical memory to the amount of
>>> MMap directory. Thoughts?
>>>
>>> On 8/18/15 9:29 AM, Erick Erickson wrote:
>>>>
>>>>
>>>> Couple of things:
>>>>
>>>> 1> Here's an excellent backgrounder for MMapDirectory, which is
>>>> what makes it appear that Solr is consuming all the physical memory....
>>>>
>>>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>
>>>> 2> It's possible that your transaction log was huge. Perhaps not likely,
>>>> but possible. If Solr abnormally terminates (kill -9 is a prime way to
>>>> do
>>>> this),
>>>> then upon restart the transaction log is replayed. This log is rolled
>>>> over
>>>> upon
>>>> every hard commit (openSearcher true or false doesn't matter). So, in
>>>> the
>>>> scenario where you are indexing a whole lot of stuff without committing,
>>>> then
>>>> it can take a very long time to replay the log. Not only that, but as
>>>> you
>>>> do
>>>> replay the log, any incoming updates are written to the end of the
>>>> tlog..
>>>> That
>>>> said, nothing in your e-mails indicates this could be a problem and it's
>>>> frankly not consistent with the errors you _do_ report but I thought
>>>> I'd mention it.
>>>> See:
>>>>
>>>> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>>>> You can avoid the possibility of this by configuring your autoCommit
>>>> interval
>>>> to be relatively short (say 60 seconds) with openSearcher=false....
>>>>
>>>> 3> ConcurrentUpdateSolrServer isn't the best thing for bulk loading
>>>> SolrCloud,
>>>> CloudSolrServer (renamed CloudSolrClient in 5.x) is better. CUSS sends
>>>> all
>>>> the docs to some node, and from there that node figures out which
>>>> shard each doc belongs on and forwards the doc (actually in batches) to
>>>> the
>>>> appropriate leader. So doing what you're doing creates a lot of cross
>>>> chatter
>>>> amongst nodes. CloudSolrServer/Client figures that out on the client
>>>> side
>>>> and
>>>> only sends packets to each leader that consist of only the docs the
>>>> belong
>>>> on
>>>> that shard. You can getnearly linear throughput with increasing numbers
>>>> of
>>>> shards this way.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Tue, Aug 18, 2015 at 9:03 AM, Rallavagu <rallav...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Thanks Shawn.
>>>>>
>>>>> All participating cloud nodes are running Tomcat and as you suggested
>>>>> will
>>>>> review the number of threads and increase them as needed.
>>>>>
>>>>> Essentially, what I have noticed was that two of four nodes caught up
>>>>> with
>>>>> "bulk" updates instantly while other two nodes took almost 3 hours to
>>>>> completely in sync with "leader". I have "tickled" other nodes by
>>>>> sending
>>>>> an
>>>>> update thinking that it would initiate the replication but not sure if
>>>>> that
>>>>> caused other two nodes to eventually catch up.
>>>>>
>>>>> On similar note, I was using "CouncurrentUpdateSolrServer" directly
>>>>> pointing
>>>>> to leader to bulk load Solr cloud. I have configured the chunk size and
>>>>> thread count for the same. Is this the right practice to bulk load
>>>>> SolrCloud?
>>>>>
>>>>> Also, the maximum number of connections per host parameter for
>>>>> "HttpShardHandler" is in solrconfig.xml I suppose?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> On 8/18/15 8:28 AM, Shawn Heisey wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/18/2015 8:18 AM, Rallavagu wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the response. Does this cache behavior influence the delay
>>>>>>> in catching up with cloud? How can we explain solr cloud replication
>>>>>>> and what are the option to monitor and take proactive action (such as
>>>>>>> initializing, pausing etc) if needed?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I don't know enough about your setup to speculate.
>>>>>>
>>>>>> I did notice this exception in a previous reply:
>>>>>>
>>>>>> org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>>>>>> for
>>>>>> connection from pool
>>>>>>
>>>>>> I can think of two things that would cause this.
>>>>>>
>>>>>> One cause is that your servlet container is limiting the number of
>>>>>> available threads.  A typical jetty or tomcat default for maxThreads
>>>>>> is
>>>>>> 200, which can easily be exceeded by a small Solr install, especially
>>>>>> if
>>>>>> it's SolrCloud.  The jetty included with Solr sets maxThreads to
>>>>>> 10000,
>>>>>> which is effectively unlimited except for extremely large installs.
>>>>>> If
>>>>>> you are providing your own container, this will almost certainly need
>>>>>> to
>>>>>> be raised.
>>>>>>
>>>>>> The other cause is that your install is extremely busy and you have
>>>>>> run
>>>>>> out of available HttpClient connections.  The solution in this case is
>>>>>> to increase the maximum number of connections per host in the
>>>>>> HttpShardHandler config, which defaults to 20.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://wiki.apache.org/solr/SolrConfigXml#Configuration_of_Shard_Handlers_for_Distributed_searches
>>>>>>
>>>>>> There might be other causes for that exception, but I think those are
>>>>>> the most common causes.  Depending on how things are set up, you have
>>>>>> problems with both.
>>>>>>
>>>>>> Thanks,
>>>>>> Shawn
>>>>>>
>>>>>
>>>
>

Re: Solr 4.6.1 Cloud Stops Replication

Reply via email to