Hi!

Thank you for your super fast answer.

I can provide more data, the question is which data :-)

These are the config parameters solr runs with:
https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from
the admin ui)

These are the log files:

https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b

I think your first obversation is correct: SolrCloud looses the
connection to zookeeper, because the connection times out.

But why isn't solrcloud able to recover it self?

Thanks
Björn


2015-11-02 22:32 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
> Without more data, I'd guess one of two things:
>
> 1> you're seeing stop-the-world GC pauses that cause Zookeeper to
> think the node is unresponsive, which puts a node into recovery and
> things go bad from there.
>
> 2> Somewhere in your solr logs you'll see OutOfMemory errors which can
> also cascade a bunch of problems.
>
> In general it's an anti-pattern to allocate such a large portion of
> our physical memory to the JVM, see:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
>
>
> Best,
> Erick
>
>
>
> On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <bjoernhaeu...@gmail.com> wrote:
>> Hey there,
>>
>> we are running a SolrCloud, which has 4 nodes, same config. Each node
>> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
>> worked for a long time.
>>
>> We currently run with 2 shards, 2 replicas and 11 collections. The
>> complete data-dir is about 5.3 GB.
>> I think we should move some JVM heap back to the OS.
>>
>> We are running Solr 5.2.1., as I could not see any related bugs to
>> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
>> to upgrade first.
>>
>> One of our nodes (node A) reports these errors:
>>
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
>> version (expected 2, but 101) or the data in not in 'javabin' format
>>
>> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
>>
>> And shortly after (4 seconds) this happens on a *different* node (Node B):
>>
>> Stopping recovery for core=suggestion coreNodeName=core_node2
>>
>> No Stacktrace for this, but happens for all 11 collections.
>>
>> 6 seconds after that Node C reports these errors:
>>
>> org.apache.solr.common.SolrException:
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for /configs/customers/params.json
>>
>> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
>>
>> This also happens for 11 collections.
>>
>> And then different errors happen:
>>
>> OverseerAutoReplicaFailoverThread had an error in its thread work
>> loop.:org.apache.solr.common.SolrException: Error reading cluster
>> properties
>>
>> cancelElection did not find election node to remove
>> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
>>
>> At that point the cluster is broken and stops responding to the most
>> queries. In the same time zookeeper looks okay.
>>
>> The cluster cannot selfheal from that situation and we are forced to
>> take manual action and restart node after node and hope that solrcloud
>> eventually recovers. Which sometimes takes several minutes and several
>> restarts from various nodes.
>>
>> We can provide more logdata if needed.
>>
>> Is there anything where we can start digging to find the underlying
>> error for that problem?
>>
>> Thanks in advance
>> Björn

Reply via email to