With 6 zookeeper instances you need at least 4 instances running at the same time. How can you decide to stop 4 instances and have only 2 instances running ? Zookeeper can't work anymore in these conditions.
Dominique Le 25 juil. 2013 à 00:16, "Joshi, Shital" <shital.jo...@gs.com> a écrit : > We have SolrCloud cluster (5 shards and 2 replicas) on 10 dynamic compute > boxes (cloud), where 5 machines (leaders) are in datacenter1 and replicas on > datacenter2. We have 6 zookeeper instances - 4 on datacenter1 and 2 on > datacenter2. The zookeeper instances are on same hosts as Solr nodes. We're > using local disk (/local/data) to store solr index files. > > Infrastructure team wanted to rebuild dynamic compute boxes on datacenter1 so > we handed over all leader hosts to them. By doing so, We lost 4 zookeeper > instances. We were expecting to see all replicas acting as leader. In order > to confirm that, I went to admin console -> cloud page but the page never > returned (kept hanging). I checked log and saw constant zookeeper host > connection exceptions (the zkHost system property had all 6 zookeeper > instances). I restarted cloud on all replicas but got same error again. This > exception is I think due to the zookeeper bug: > https://issues.apache.org/jira/browse/SOLR-4899 I guess zookeeper never > registered the replicas as leader. > > After dynamic compute machines were re-built (lost all local data) I > restarted entire cloud (with 6 zookeeper and 10 nodes), the original leaders > were still the leaders (I think zookeeper config never got updated with > replicas being leader, though 2 zookeeper instances were still up). Since all > leaders' /local/data/solr_data was empty, it got replicated to all replicas > and we lost all data in our replica. We lost 26 million documents on replica. > This was very awful. > > In our start up script (which brings up solr on all nodes one by one), the > leaders are listed first. > > Any solution to this until Solr 4.4 release? > > Many Thanks! > > > > >