SOLR-5243 and SOLR-5240 will likely improve the situation. Both fixes are in 4.5 - the first RC for 4.5 will likely come tomorrow.
Thanks to yonik for sussing these out. - Mark On Sep 17, 2013, at 2:43 PM, Mark Miller <markrmil...@gmail.com> wrote: > > On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic > <vladimir.veljko...@boxalino.com> wrote: > >> Hello there, >> >> we have following setup: >> >> SolrCloud 4.4.0 (3 nodes, physical machines) >> Zookeeper 3.4.5 (3 nodes, physical machines) >> >> We have a number of rather small collections (~10K or ~100K of documents), >> that we would like to load to all Solr instances (numShards=1, >> replication_factor=3), and access them through local network interface, as >> the load balancing is done in layers above. >> >> We can live (and we actually do it in the test phase) with updating the >> entire collections whenever we need it, switching collection aliases and >> removing the old collections. >> >> We stumbled across following problem: as soon as all three Solr nodes become >> a leader to at least one collection, restarting any node makes it completely >> unresponsive (timeout), both though admin interface and for replication. If >> we restart all solr nodes the cluster end up in some kind of deadlock and >> only remedy we found is Solr clean installation, removing ZooKeeper data and >> re-posting collections. >> >> Apparently, leader is waiting for replicas to come up and they try to >> synchronize but timeout on http requests, so everything ends up in some kind >> of dead lock, maybe related to: >> >> https://issues.apache.org/jira/browse/SOLR-5240 > > Yup, that sounds exactly what you would expect with SOLR-5240. A fix for that > is coming in 4.5, which is a probably a week or so away. > >> >> Eventually (after few minutes), leader takes over, mark collections "active" >> but remains blocked on http interface, so other nodes can not synchronize. >> >> In further tests, we loaded 4 collections with numShards=1 and >> replication_factor=2. By chance, one node become the leader for all 4 >> collections. Restarting the node which was not the leader is done without >> the problem, but when we restarted the leader it happened that: >> - leader shut down, other nodes became leaders of 2 collections each >> - leader starts up, 3 collections on it become "active", one collection >> remains ”down” and node becomes unresponsive and timeouts on http requests. > > Hard to say - I'll experiment with 4.5 and see if I can duplicate this. > > - Mark > >> >> As this behavior is completely unexpected for one cluster solution, I wonder >> if somebody else experienced same problems or we are doing something >> entirely wrong. >> >> Best regards >> >> -- >> >> Vladimir Veljkovic >> Senior Java Entwickler >> >> Boxalino AG >> >> vladimir.veljko...@boxalino.com >> www.boxalino.com >> >> >> Tuning Kit for your Online Shop >> >> Product Search - Recommendations - Landing Pages - Data intelligence - >> Mobile Commerce >> >> >