Without more data, I'd guess one of two things:

1> you're seeing stop-the-world GC pauses that cause Zookeeper to
think the node is unresponsive, which puts a node into recovery and
things go bad from there.

2> Somewhere in your solr logs you'll see OutOfMemory errors which can
also cascade a bunch of problems.

In general it's an anti-pattern to allocate such a large portion of
our physical memory to the JVM, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html



Best,
Erick



On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <bjoernhaeu...@gmail.com> wrote:
> Hey there,
>
> we are running a SolrCloud, which has 4 nodes, same config. Each node
> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
> worked for a long time.
>
> We currently run with 2 shards, 2 replicas and 11 collections. The
> complete data-dir is about 5.3 GB.
> I think we should move some JVM heap back to the OS.
>
> We are running Solr 5.2.1., as I could not see any related bugs to
> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
> to upgrade first.
>
> One of our nodes (node A) reports these errors:
>
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
> version (expected 2, but 101) or the data in not in 'javabin' format
>
> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
>
> And shortly after (4 seconds) this happens on a *different* node (Node B):
>
> Stopping recovery for core=suggestion coreNodeName=core_node2
>
> No Stacktrace for this, but happens for all 11 collections.
>
> 6 seconds after that Node C reports these errors:
>
> org.apache.solr.common.SolrException:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /configs/customers/params.json
>
> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
>
> This also happens for 11 collections.
>
> And then different errors happen:
>
> OverseerAutoReplicaFailoverThread had an error in its thread work
> loop.:org.apache.solr.common.SolrException: Error reading cluster
> properties
>
> cancelElection did not find election node to remove
> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
>
> At that point the cluster is broken and stops responding to the most
> queries. In the same time zookeeper looks okay.
>
> The cluster cannot selfheal from that situation and we are forced to
> take manual action and restart node after node and hope that solrcloud
> eventually recovers. Which sometimes takes several minutes and several
> restarts from various nodes.
>
> We can provide more logdata if needed.
>
> Is there anything where we can start digging to find the underlying
> error for that problem?
>
> Thanks in advance
> Björn

Reply via email to