SOLR-5243 and SOLR-5240 will likely improve the situation. Both fixes are in 
4.5 - the first RC for 4.5 will likely come tomorrow.

Thanks to yonik for sussing these out.

- Mark

On Sep 17, 2013, at 2:43 PM, Mark Miller <markrmil...@gmail.com> wrote:

> 
> On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic 
> <vladimir.veljko...@boxalino.com> wrote:
> 
>> Hello there,
>> 
>> we have following setup:
>> 
>> SolrCloud 4.4.0 (3 nodes, physical machines)
>> Zookeeper 3.4.5 (3 nodes, physical machines)
>> 
>> We have a number of rather small collections (~10K or ~100K of documents), 
>> that we would like to load to all Solr instances (numShards=1, 
>> replication_factor=3), and access them through local network interface, as 
>> the load balancing is done in layers above.
>> 
>> We can live (and we actually do it in the test phase) with updating the 
>> entire collections whenever we need it, switching collection aliases and 
>> removing the old collections.
>> 
>> We stumbled across following problem: as soon as all three Solr nodes become 
>> a leader to at least one collection, restarting any node makes it completely 
>> unresponsive (timeout), both though admin interface and for replication. If 
>> we restart all solr nodes the cluster end up in some kind of deadlock and 
>> only remedy we found is Solr clean installation, removing ZooKeeper data and 
>> re-posting collections.
>> 
>> Apparently, leader is waiting for replicas to come up and they try to 
>> synchronize but timeout on http requests, so everything ends up in some kind 
>> of dead lock, maybe related to:
>> 
>> https://issues.apache.org/jira/browse/SOLR-5240
> 
> Yup, that sounds exactly what you would expect with SOLR-5240. A fix for that 
> is coming in 4.5, which is a probably a week or so away.
> 
>> 
>> Eventually (after few minutes), leader takes over, mark collections "active" 
>> but remains blocked on http interface, so other nodes can not synchronize.
>> 
>> In further tests, we loaded 4 collections with numShards=1 and 
>> replication_factor=2. By chance, one node become the leader for all 4 
>> collections. Restarting the node which was not the leader is done without 
>> the problem, but when we restarted the leader it happened that:
>> - leader shut down, other nodes became leaders of 2 collections each
>> - leader starts up, 3 collections on it become "active", one collection 
>> remains ”down” and node becomes unresponsive and timeouts on http requests.
> 
> Hard to say - I'll experiment with 4.5 and see if I can duplicate this.
> 
> - Mark
> 
>> 
>> As this behavior is completely unexpected for one cluster solution, I wonder 
>> if somebody else experienced same problems or we are doing something 
>> entirely wrong.
>> 
>> Best regards
>> 
>> -- 
>> 
>> Vladimir Veljkovic
>> Senior Java Entwickler
>> 
>> Boxalino AG
>> 
>> vladimir.veljko...@boxalino.com 
>> www.boxalino.com 
>> 
>> 
>> Tuning Kit for your Online Shop
>> 
>> Product Search - Recommendations - Landing Pages - Data intelligence - 
>> Mobile Commerce 
>> 
>> 
> 

Reply via email to