Hi! Thank you for your super fast answer.
I can provide more data, the question is which data :-) These are the config parameters solr runs with: https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from the admin ui) These are the log files: https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b I think your first obversation is correct: SolrCloud looses the connection to zookeeper, because the connection times out. But why isn't solrcloud able to recover it self? Thanks Björn 2015-11-02 22:32 GMT+01:00 Erick Erickson <erickerick...@gmail.com>: > Without more data, I'd guess one of two things: > > 1> you're seeing stop-the-world GC pauses that cause Zookeeper to > think the node is unresponsive, which puts a node into recovery and > things go bad from there. > > 2> Somewhere in your solr logs you'll see OutOfMemory errors which can > also cascade a bunch of problems. > > In general it's an anti-pattern to allocate such a large portion of > our physical memory to the JVM, see: > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > > > Best, > Erick > > > > On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <bjoernhaeu...@gmail.com> wrote: >> Hey there, >> >> we are running a SolrCloud, which has 4 nodes, same config. Each node >> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but >> worked for a long time. >> >> We currently run with 2 shards, 2 replicas and 11 collections. The >> complete data-dir is about 5.3 GB. >> I think we should move some JVM heap back to the OS. >> >> We are running Solr 5.2.1., as I could not see any related bugs to >> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother >> to upgrade first. >> >> One of our nodes (node A) reports these errors: >> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid >> version (expected 2, but 101) or the data in not in 'javabin' format >> >> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171 >> >> And shortly after (4 seconds) this happens on a *different* node (Node B): >> >> Stopping recovery for core=suggestion coreNodeName=core_node2 >> >> No Stacktrace for this, but happens for all 11 collections. >> >> 6 seconds after that Node C reports these errors: >> >> org.apache.solr.common.SolrException: >> org.apache.zookeeper.KeeperException$SessionExpiredException: >> KeeperErrorCode = Session expired for /configs/customers/params.json >> >> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8 >> >> This also happens for 11 collections. >> >> And then different errors happen: >> >> OverseerAutoReplicaFailoverThread had an error in its thread work >> loop.:org.apache.solr.common.SolrException: Error reading cluster >> properties >> >> cancelElection did not find election node to remove >> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112 >> >> At that point the cluster is broken and stops responding to the most >> queries. In the same time zookeeper looks okay. >> >> The cluster cannot selfheal from that situation and we are forced to >> take manual action and restart node after node and hope that solrcloud >> eventually recovers. Which sometimes takes several minutes and several >> restarts from various nodes. >> >> We can provide more logdata if needed. >> >> Is there anything where we can start digging to find the underlying >> error for that problem? >> >> Thanks in advance >> Björn