bq: and we do a full update of all documents during the night. How fast are you sending documents? Prior to Solr 5.2 the replicas would do a twice the amount of work for indexing that the leader did (odd, but...) See:
http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/ Still, focusing on the GC pauses is probably the most fruitful. You just shouldn't be getting pauses that long with 16G heaps. How long does it take you to re-index? I've seen situations where indexing at an _extremely_ high rate will force replicas into recovery. This took 150 threads all firing queries as fast as possible to hit, but I thought I'd mention it. Best, Erick On Thu, Jul 2, 2015 at 12:56 PM, Vincenzo D'Amore <v.dam...@gmail.com> wrote: > Hi Erick, > > thanks for your answer. > > We use java 8 and allocate a 16GB heap size > > -Xms2g -Xmx16g > > There are 1.5M docs and about 16 GB index size on disk. > > Let me also say, during the day we have a lot of little update, from 1k to > 50k docs every time, and we do a full update of all documents during the > night. And during this full update the 20 seconds GC happened. > > I haven't read completely the Uwe's post just because was too long, all I > got was that I have to use MMapDirectory. > But I was still unable to restart the production with this new component. > After the change it is not clear if we only need to restart the core/node > or if a full reindex must be done. > > Thanks for your time, I'll read very carefully Uwe's post. > > > On Thu, Jul 2, 2015 at 5:39 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> Vincenzo: >> >> First and foremost, figure out why you're having 20 second GC pauses. For >> indexes like you're describing, this is unusual. How big is the heap >> you allocate to the JVM? >> >> Check your Zookeeper timeout. In earlier versions of SolrCloud it >> defaulted to >> 15 seconds. Going into leader election would happen for no obvious reason, >> and lengthening it to 30-60 seconds seemed to help a lot of people. >> >> The disks should be largely irrelevant to the origin or cure for this >> problem... >> >> Here's a good article on why you want to allocate "just enough" heap >> for your app. Of course, "just enough" can be interesting to actually >> define: >> >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >> >> Best, >> Erick >> >> On Thu, Jul 2, 2015 at 5:45 AM, Vincenzo D'Amore <v.dam...@gmail.com> >> wrote: >> > Hi All, >> > >> > In the latest months my SolrCloud clusters, sometimes (one/two times a >> > week), have few replicas down. >> > Usually all the replicas goes down on the same node. >> > I'm unable to understand why a 3 nodes cluster with 8 core/32 GB and high >> > performance disks have this problem. The main index is small, about 1.5 M >> > of documents with very small text inside. >> > I don't know if having 3 shards with 3 replicas is too much, to me it >> seems >> > a fair high high availability, but anyway this should not compromise the >> > cluster stability. >> > All the queries are under the second, so it is responsive. >> > >> > Few months ago I begun to think the problem was related to an old and >> > bugged version of SolrCloud that we have to upgrade. >> > But reading in this list about the classic XY problem I changed my mind, >> > maybe there a much better solution. >> > >> > This night I had, again, a couple of replicas down around 1.07 AM, this >> is >> > the SolrCloud log file: >> > >> > http://pastebin.com/raw.php?i=bCHnqnXD >> > >> > At end of exceptions list there are few "cancelElection did not find >> > election node to remove" errors and this morning I found the replicas >> down. >> > >> > Looking GC log file I found that at same moment there is a GC that takes >> > about 20 seconds. Now I'm using CMS (ConcurrentMarkSweep) Collector taken >> > from Shawn Hensey suggestions: >> > >> https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector >> > >> > >> > http://pastebin.com/raw.php?i=VuSrg4uz >> > >> > At last, looking around in the latest months I found this bug, that seems >> > to me be related to with this problems. >> > So I begun to think that I need an upgrade, am I right? What do you think >> > about ? >> > >> > https://issues.apache.org/jira/browse/SOLR-6159 >> > >> > Any help is very appreciated. >> > >> > Thanks, >> > Vincenzo >> > > > > -- > Vincenzo D'Amore > email: v.dam...@gmail.com > skype: free.dev > mobile: +39 349 8513251