Hi All, In the latest months my SolrCloud clusters, sometimes (one/two times a week), have few replicas down. Usually all the replicas goes down on the same node. I'm unable to understand why a 3 nodes cluster with 8 core/32 GB and high performance disks have this problem. The main index is small, about 1.5 M of documents with very small text inside. I don't know if having 3 shards with 3 replicas is too much, to me it seems a fair high high availability, but anyway this should not compromise the cluster stability. All the queries are under the second, so it is responsive.
Few months ago I begun to think the problem was related to an old and bugged version of SolrCloud that we have to upgrade. But reading in this list about the classic XY problem I changed my mind, maybe there a much better solution. This night I had, again, a couple of replicas down around 1.07 AM, this is the SolrCloud log file: http://pastebin.com/raw.php?i=bCHnqnXD At end of exceptions list there are few "cancelElection did not find election node to remove" errors and this morning I found the replicas down. Looking GC log file I found that at same moment there is a GC that takes about 20 seconds. Now I'm using CMS (ConcurrentMarkSweep) Collector taken from Shawn Hensey suggestions: https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector http://pastebin.com/raw.php?i=VuSrg4uz At last, looking around in the latest months I found this bug, that seems to me be related to with this problems. So I begun to think that I need an upgrade, am I right? What do you think about ? https://issues.apache.org/jira/browse/SOLR-6159 Any help is very appreciated. Thanks, Vincenzo