Re: Problem XY - X = SolrCloud 4.8 replicas down, Y = SolrCloud upgrade to a new version

Erick Erickson Thu, 02 Jul 2015 11:21:43 -0700

1.5M docs in an hour isn't near the rates I saw that trigger the
LIR problem, so I strongly doubt that's the issue, never mind ;)


On Thu, Jul 2, 2015 at 1:47 PM, Vincenzo D'Amore <v.dam...@gmail.com> wrote:
> We are trying to send documents as fast as we can, we wrote a multi-thread
> Solrj application that read from file, solr, or rdbms and update a
> collection.
> But if we have too much threads during the day servers become unresponsive.
> Now, in the night, with a low number of search, we reindex the entire
> collection (1.5M docs) with 2 threads in about 1 h.
>
> As I wrote, now I'm using CMS (ConcurrentMarkSweep), and I supposed that
> use Shawn's suggestions about GC was enough to have the right
> configuration.
>
>
>
>
> On Thu, Jul 2, 2015 at 7:05 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> bq: and we do a full update of all documents during the night.
>>
>> How fast are you sending documents? Prior to Solr 5.2 the replicas
>> would do a twice the amount of work for indexing that the leader
>> did (odd, but...) See:
>>
>> http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
>>
>> Still, focusing on the GC pauses is probably the most fruitful. You just
>> shouldn't be getting pauses that long with 16G heaps. How long does it
>> take you to re-index? I've seen situations where indexing at an
>> _extremely_ high rate will force replicas into recovery. This took 150
>> threads
>> all firing queries as fast as possible to hit, but I thought I'd mention
>> it.
>>
>> Best,
>> Erick
>>
>> On Thu, Jul 2, 2015 at 12:56 PM, Vincenzo D'Amore <v.dam...@gmail.com>
>> wrote:
>> > Hi Erick,
>> >
>> > thanks for your answer.
>> >
>> > We use java 8 and allocate a 16GB heap size
>> >
>> >  -Xms2g -Xmx16g
>> >
>> > There are 1.5M docs and about 16 GB index size on disk.
>> >
>> > Let me also say, during the day we have a lot of little update, from 1k
>> to
>> > 50k docs every time, and we do a full update of all documents during the
>> > night. And during this full update the 20 seconds GC happened.
>> >
>> > I haven't read completely the Uwe's post just because was too long, all I
>> > got was that I have to use MMapDirectory.
>> > But I was still unable to restart the production with this new component.
>> > After the change it is not clear if we only need to restart the core/node
>> > or if a full reindex must be done.
>> >
>> > Thanks for your time, I'll read very carefully Uwe's post.
>> >
>> >
>> > On Thu, Jul 2, 2015 at 5:39 PM, Erick Erickson <erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Vincenzo:
>> >>
>> >> First and foremost, figure out why you're having 20 second GC pauses.
>> For
>> >> indexes like you're describing, this is unusual. How big is the heap
>> >> you allocate to the JVM?
>> >>
>> >> Check your Zookeeper timeout. In earlier versions of SolrCloud it
>> >> defaulted to
>> >> 15 seconds. Going into leader election would happen for no obvious
>> reason,
>> >> and lengthening it to 30-60 seconds seemed to help a lot of people.
>> >>
>> >> The disks should be largely irrelevant to the origin or cure for this
>> >> problem...
>> >>
>> >> Here's a good article on why you want to allocate "just enough" heap
>> >> for your app. Of course, "just enough" can be interesting to actually
>> >> define:
>> >>
>> >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Jul 2, 2015 at 5:45 AM, Vincenzo D'Amore <v.dam...@gmail.com>
>> >> wrote:
>> >> > Hi All,
>> >> >
>> >> > In the latest months my SolrCloud clusters, sometimes (one/two times a
>> >> > week), have few replicas down.
>> >> > Usually all the replicas goes down on the same node.
>> >> > I'm unable to understand why a 3 nodes cluster with 8 core/32 GB and
>> high
>> >> > performance disks have this problem. The main index is small, about
>> 1.5 M
>> >> > of documents with very small text inside.
>> >> > I don't know if having 3 shards with 3 replicas is too much, to me it
>> >> seems
>> >> > a fair high high availability, but anyway this should not compromise
>> the
>> >> > cluster stability.
>> >> > All the queries are under the second, so it is responsive.
>> >> >
>> >> > Few months ago I begun to think the problem was related to an old and
>> >> > bugged version of SolrCloud that we have to upgrade.
>> >> > But reading in this list about the classic XY problem I changed my
>> mind,
>> >> > maybe there a much better solution.
>> >> >
>> >> > This night I had, again, a couple of replicas down around 1.07 AM,
>> this
>> >> is
>> >> > the SolrCloud log file:
>> >> >
>> >> > http://pastebin.com/raw.php?i=bCHnqnXD
>> >> >
>> >> > At end of exceptions list there are few "cancelElection did not find
>> >> > election node to remove" errors and this morning I found the replicas
>> >> down.
>> >> >
>> >> > Looking GC log file I found that at same moment there is a GC that
>> takes
>> >> > about 20 seconds. Now I'm using CMS (ConcurrentMarkSweep) Collector
>> taken
>> >> > from Shawn Hensey suggestions:
>> >> >
>> >>
>> https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector
>> >> >
>> >> >
>> >> > http://pastebin.com/raw.php?i=VuSrg4uz
>> >> >
>> >> > At last, looking around in the latest months I found this bug, that
>> seems
>> >> > to me be related to with this problems.
>> >> > So I begun to think that I need an upgrade, am I right? What do you
>> think
>> >> > about ?
>> >> >
>> >> > https://issues.apache.org/jira/browse/SOLR-6159
>> >> >
>> >> > Any help is very appreciated.
>> >> >
>> >> > Thanks,
>> >> > Vincenzo
>> >>
>> >
>> >
>> >
>> > --
>> > Vincenzo D'Amore
>> > email: v.dam...@gmail.com
>> > skype: free.dev
>> > mobile: +39 349 8513251
>>
>
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251

Re: Problem XY - X = SolrCloud 4.8 replicas down, Y = SolrCloud upgrade to a new version

Reply via email to