Re: Problem XY - X = SolrCloud 4.8 replicas down, Y = SolrCloud upgrade to a new version

Erick Erickson Thu, 02 Jul 2015 10:07:05 -0700

bq: and we do a full update of all documents during the night.

How fast are you sending documents? Prior to Solr 5.2 the replicas
would do a twice the amount of work for indexing that the leader
did (odd, but...) See:


http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/

Still, focusing on the GC pauses is probably the most fruitful. You just
shouldn't be getting pauses that long with 16G heaps. How long does it
take you to re-index? I've seen situations where indexing at an
_extremely_ high rate will force replicas into recovery. This took 150 threads
all firing queries as fast as possible to hit, but I thought I'd mention it.

Best,
Erick

On Thu, Jul 2, 2015 at 12:56 PM, Vincenzo D'Amore <v.dam...@gmail.com> wrote:
> Hi Erick,
>
> thanks for your answer.
>
> We use java 8 and allocate a 16GB heap size
>
>  -Xms2g -Xmx16g
>
> There are 1.5M docs and about 16 GB index size on disk.
>
> Let me also say, during the day we have a lot of little update, from 1k to
> 50k docs every time, and we do a full update of all documents during the
> night. And during this full update the 20 seconds GC happened.
>
> I haven't read completely the Uwe's post just because was too long, all I
> got was that I have to use MMapDirectory.
> But I was still unable to restart the production with this new component.
> After the change it is not clear if we only need to restart the core/node
> or if a full reindex must be done.
>
> Thanks for your time, I'll read very carefully Uwe's post.
>
>
> On Thu, Jul 2, 2015 at 5:39 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Vincenzo:
>>
>> First and foremost, figure out why you're having 20 second GC pauses. For
>> indexes like you're describing, this is unusual. How big is the heap
>> you allocate to the JVM?
>>
>> Check your Zookeeper timeout. In earlier versions of SolrCloud it
>> defaulted to
>> 15 seconds. Going into leader election would happen for no obvious reason,
>> and lengthening it to 30-60 seconds seemed to help a lot of people.
>>
>> The disks should be largely irrelevant to the origin or cure for this
>> problem...
>>
>> Here's a good article on why you want to allocate "just enough" heap
>> for your app. Of course, "just enough" can be interesting to actually
>> define:
>>
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>
>> Best,
>> Erick
>>
>> On Thu, Jul 2, 2015 at 5:45 AM, Vincenzo D'Amore <v.dam...@gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > In the latest months my SolrCloud clusters, sometimes (one/two times a
>> > week), have few replicas down.
>> > Usually all the replicas goes down on the same node.
>> > I'm unable to understand why a 3 nodes cluster with 8 core/32 GB and high
>> > performance disks have this problem. The main index is small, about 1.5 M
>> > of documents with very small text inside.
>> > I don't know if having 3 shards with 3 replicas is too much, to me it
>> seems
>> > a fair high high availability, but anyway this should not compromise the
>> > cluster stability.
>> > All the queries are under the second, so it is responsive.
>> >
>> > Few months ago I begun to think the problem was related to an old and
>> > bugged version of SolrCloud that we have to upgrade.
>> > But reading in this list about the classic XY problem I changed my mind,
>> > maybe there a much better solution.
>> >
>> > This night I had, again, a couple of replicas down around 1.07 AM, this
>> is
>> > the SolrCloud log file:
>> >
>> > http://pastebin.com/raw.php?i=bCHnqnXD
>> >
>> > At end of exceptions list there are few "cancelElection did not find
>> > election node to remove" errors and this morning I found the replicas
>> down.
>> >
>> > Looking GC log file I found that at same moment there is a GC that takes
>> > about 20 seconds. Now I'm using CMS (ConcurrentMarkSweep) Collector taken
>> > from Shawn Hensey suggestions:
>> >
>> https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector
>> >
>> >
>> > http://pastebin.com/raw.php?i=VuSrg4uz
>> >
>> > At last, looking around in the latest months I found this bug, that seems
>> > to me be related to with this problems.
>> > So I begun to think that I need an upgrade, am I right? What do you think
>> > about ?
>> >
>> > https://issues.apache.org/jira/browse/SOLR-6159
>> >
>> > Any help is very appreciated.
>> >
>> > Thanks,
>> > Vincenzo
>>
>
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251

Re: Problem XY - X = SolrCloud 4.8 replicas down, Y = SolrCloud upgrade to a new version

Reply via email to