On 5/15/2013 3:56 AM, pankaj.pand...@wipro.com wrote: > Thanks Shawn for explaining everything in such detail, it was really helpful. > > Have few more queries on the same. Can you please explain the purpose of the > 3rd box in minimal configuration, with the standalone zookeeper?
A zookeeper ensemble works best with an odd number of hosts. If you want redundancy, it requires a minimum instance count of three. To ensure that it can survive hardware failure, they must all be on different physical machines. If you've only got two Solr servers, then you need a third host to complete zookeeper. > On separate note, if we go with ahead with 4 box(8 shard with replication > factor 2 for each): > 1. Would it be ok to maintain the replica on the same box or we would > need separate box for that? You do not want replicas on the same host. Failures are inevitable, and if a host with both replicas of a shard were to fail, that data would be either temporarily inaccessible or just gone. With 8 shards and a replication factor of two, you'll have 16 total replicas, with four replicas on each server. The replicas for shard 1 might be on server 1 and server 2, the replicas for shard 2 might be on server3 and server4, etc. > 2. Is the above configuration sufficient enough to guarantee failover > and high availability? If you use the collections API to create your collection, it will automatically place the replicas so that everything will fail over correctly. > 3. How can I configure my application to query always against the > replica and let the master be used only for ingestion. Replica will be synced > with master after working hours(overnight). You can't. SolrCloud's basic operational model doesn't work this way. When you index, the document is forwarded to the replica that is the current elected leader for the proper shard. The leader will index the document and forward it to all other replicas for that shard, which will also index the document. Normal SolrCloud operation does not use replication, each copy does its own indexing. You just have to index the data to any machine in the cluster and SolrCloud takes care of the rest. When you query, the machine that receives the request will automatically farm out different requests to itself and other machines, giving you some aspects of load balancing for free. You may be confused by the replication comment above, because SolrCloud actually does require that you enable replication in your config. The reason that it requires this config is that replication may be required when recovering the index on a replica that goes down and then comes back up. It is ONLY used for index recovery, not normal operation. Thanks, Shawn