Hi All,

In the latest months my SolrCloud clusters, sometimes (one/two times a
week), have few replicas down.
Usually all the replicas goes down on the same node.
I'm unable to understand why a 3 nodes cluster with 8 core/32 GB and high
performance disks have this problem. The main index is small, about 1.5 M
of documents with very small text inside.
I don't know if having 3 shards with 3 replicas is too much, to me it seems
a fair high high availability, but anyway this should not compromise the
cluster stability.
All the queries are under the second, so it is responsive.

Few months ago I begun to think the problem was related to an old and
bugged version of SolrCloud that we have to upgrade.
But reading in this list about the classic XY problem I changed my mind,
maybe there a much better solution.

This night I had, again, a couple of replicas down around 1.07 AM, this is
the SolrCloud log file:

http://pastebin.com/raw.php?i=bCHnqnXD

At end of exceptions list there are few "cancelElection did not find
election node to remove" errors and this morning I found the replicas down.

Looking GC log file I found that at same moment there is a GC that takes
about 20 seconds. Now I'm using CMS (ConcurrentMarkSweep) Collector taken
from Shawn Hensey suggestions:
https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector


http://pastebin.com/raw.php?i=VuSrg4uz

At last, looking around in the latest months I found this bug, that seems
to me be related to with this problems.
So I begun to think that I need an upgrade, am I right? What do you think
about ?

https://issues.apache.org/jira/browse/SOLR-6159

Any help is very appreciated.

Thanks,
Vincenzo

Reply via email to