Hi Erick, thanks for your answer.
We use java 8 and allocate a 16GB heap size -Xms2g -Xmx16g There are 1.5M docs and about 16 GB index size on disk. Let me also say, during the day we have a lot of little update, from 1k to 50k docs every time, and we do a full update of all documents during the night. And during this full update the 20 seconds GC happened. I haven't read completely the Uwe's post just because was too long, all I got was that I have to use MMapDirectory. But I was still unable to restart the production with this new component. After the change it is not clear if we only need to restart the core/node or if a full reindex must be done. Thanks for your time, I'll read very carefully Uwe's post. On Thu, Jul 2, 2015 at 5:39 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Vincenzo: > > First and foremost, figure out why you're having 20 second GC pauses. For > indexes like you're describing, this is unusual. How big is the heap > you allocate to the JVM? > > Check your Zookeeper timeout. In earlier versions of SolrCloud it > defaulted to > 15 seconds. Going into leader election would happen for no obvious reason, > and lengthening it to 30-60 seconds seemed to help a lot of people. > > The disks should be largely irrelevant to the origin or cure for this > problem... > > Here's a good article on why you want to allocate "just enough" heap > for your app. Of course, "just enough" can be interesting to actually > define: > > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > Best, > Erick > > On Thu, Jul 2, 2015 at 5:45 AM, Vincenzo D'Amore <v.dam...@gmail.com> > wrote: > > Hi All, > > > > In the latest months my SolrCloud clusters, sometimes (one/two times a > > week), have few replicas down. > > Usually all the replicas goes down on the same node. > > I'm unable to understand why a 3 nodes cluster with 8 core/32 GB and high > > performance disks have this problem. The main index is small, about 1.5 M > > of documents with very small text inside. > > I don't know if having 3 shards with 3 replicas is too much, to me it > seems > > a fair high high availability, but anyway this should not compromise the > > cluster stability. > > All the queries are under the second, so it is responsive. > > > > Few months ago I begun to think the problem was related to an old and > > bugged version of SolrCloud that we have to upgrade. > > But reading in this list about the classic XY problem I changed my mind, > > maybe there a much better solution. > > > > This night I had, again, a couple of replicas down around 1.07 AM, this > is > > the SolrCloud log file: > > > > http://pastebin.com/raw.php?i=bCHnqnXD > > > > At end of exceptions list there are few "cancelElection did not find > > election node to remove" errors and this morning I found the replicas > down. > > > > Looking GC log file I found that at same moment there is a GC that takes > > about 20 seconds. Now I'm using CMS (ConcurrentMarkSweep) Collector taken > > from Shawn Hensey suggestions: > > > https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector > > > > > > http://pastebin.com/raw.php?i=VuSrg4uz > > > > At last, looking around in the latest months I found this bug, that seems > > to me be related to with this problems. > > So I begun to think that I need an upgrade, am I right? What do you think > > about ? > > > > https://issues.apache.org/jira/browse/SOLR-6159 > > > > Any help is very appreciated. > > > > Thanks, > > Vincenzo > -- Vincenzo D'Amore email: v.dam...@gmail.com skype: free.dev mobile: +39 349 8513251