Without more data, I'd guess one of two things: 1> you're seeing stop-the-world GC pauses that cause Zookeeper to think the node is unresponsive, which puts a node into recovery and things go bad from there.
2> Somewhere in your solr logs you'll see OutOfMemory errors which can also cascade a bunch of problems. In general it's an anti-pattern to allocate such a large portion of our physical memory to the JVM, see: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best, Erick On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <bjoernhaeu...@gmail.com> wrote: > Hey there, > > we are running a SolrCloud, which has 4 nodes, same config. Each node > has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but > worked for a long time. > > We currently run with 2 shards, 2 replicas and 11 collections. The > complete data-dir is about 5.3 GB. > I think we should move some JVM heap back to the OS. > > We are running Solr 5.2.1., as I could not see any related bugs to > SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother > to upgrade first. > > One of our nodes (node A) reports these errors: > > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid > version (expected 2, but 101) or the data in not in 'javabin' format > > Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171 > > And shortly after (4 seconds) this happens on a *different* node (Node B): > > Stopping recovery for core=suggestion coreNodeName=core_node2 > > No Stacktrace for this, but happens for all 11 collections. > > 6 seconds after that Node C reports these errors: > > org.apache.solr.common.SolrException: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /configs/customers/params.json > > Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8 > > This also happens for 11 collections. > > And then different errors happen: > > OverseerAutoReplicaFailoverThread had an error in its thread work > loop.:org.apache.solr.common.SolrException: Error reading cluster > properties > > cancelElection did not find election node to remove > /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112 > > At that point the cluster is broken and stops responding to the most > queries. In the same time zookeeper looks okay. > > The cluster cannot selfheal from that situation and we are forced to > take manual action and restart node after node and hope that solrcloud > eventually recovers. Which sometimes takes several minutes and several > restarts from various nodes. > > We can provide more logdata if needed. > > Is there anything where we can start digging to find the underlying > error for that problem? > > Thanks in advance > Björn