Vincenzo: First and foremost, figure out why you're having 20 second GC pauses. For indexes like you're describing, this is unusual. How big is the heap you allocate to the JVM?
Check your Zookeeper timeout. In earlier versions of SolrCloud it defaulted to 15 seconds. Going into leader election would happen for no obvious reason, and lengthening it to 30-60 seconds seemed to help a lot of people. The disks should be largely irrelevant to the origin or cure for this problem... Here's a good article on why you want to allocate "just enough" heap for your app. Of course, "just enough" can be interesting to actually define: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best, Erick On Thu, Jul 2, 2015 at 5:45 AM, Vincenzo D'Amore <v.dam...@gmail.com> wrote: > Hi All, > > In the latest months my SolrCloud clusters, sometimes (one/two times a > week), have few replicas down. > Usually all the replicas goes down on the same node. > I'm unable to understand why a 3 nodes cluster with 8 core/32 GB and high > performance disks have this problem. The main index is small, about 1.5 M > of documents with very small text inside. > I don't know if having 3 shards with 3 replicas is too much, to me it seems > a fair high high availability, but anyway this should not compromise the > cluster stability. > All the queries are under the second, so it is responsive. > > Few months ago I begun to think the problem was related to an old and > bugged version of SolrCloud that we have to upgrade. > But reading in this list about the classic XY problem I changed my mind, > maybe there a much better solution. > > This night I had, again, a couple of replicas down around 1.07 AM, this is > the SolrCloud log file: > > http://pastebin.com/raw.php?i=bCHnqnXD > > At end of exceptions list there are few "cancelElection did not find > election node to remove" errors and this morning I found the replicas down. > > Looking GC log file I found that at same moment there is a GC that takes > about 20 seconds. Now I'm using CMS (ConcurrentMarkSweep) Collector taken > from Shawn Hensey suggestions: > https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector > > > http://pastebin.com/raw.php?i=VuSrg4uz > > At last, looking around in the latest months I found this bug, that seems > to me be related to with this problems. > So I begun to think that I need an upgrade, am I right? What do you think > about ? > > https://issues.apache.org/jira/browse/SOLR-6159 > > Any help is very appreciated. > > Thanks, > Vincenzo