Re: Billion document index

Shawn Heisey Wed, 15 May 2013 08:40:59 -0700

On 5/15/2013 3:56 AM, pankaj.pand...@wipro.com wrote:
> Thanks Shawn for explaining everything in such detail, it was really helpful.
> 
> Have few more queries on the same. Can you please explain the purpose of the 
> 3rd box in minimal configuration, with the standalone zookeeper?


A zookeeper ensemble works best with an odd number of hosts.  If you
want redundancy, it requires a minimum instance count of three.  To
ensure that it can survive hardware failure, they must all be on
different physical machines.  If you've only got two Solr servers, then
you need a third host to complete zookeeper.

> On separate note, if we go with ahead with 4 box(8 shard with replication 
> factor 2 for each):
>       1. Would it be ok to maintain the replica on the same box or we would 
> need separate box for that?

You do not want replicas on the same host.  Failures are inevitable, and
if a host with both replicas of a shard were to fail, that data would be
either temporarily inaccessible or just gone.

With 8 shards and a replication factor of two, you'll have 16 total
replicas, with four replicas on each server.  The replicas for shard 1
might be on server 1 and server 2, the replicas for shard 2 might be on
server3 and server4, etc.

>       2. Is the above configuration sufficient enough to guarantee failover 
> and high availability?

If you use the collections API to create your collection, it will
automatically place the replicas so that everything will fail over
correctly.

>       3. How can I configure my application to query always against the 
> replica and let the master be used only for ingestion. Replica will be synced 
> with    master after working hours(overnight).

You can't.  SolrCloud's basic operational model doesn't work this way.
When you index, the document is forwarded to the replica that is the
current elected leader for the proper shard.  The leader will index the
document and forward it to all other replicas for that shard, which will
also index the document.  Normal SolrCloud operation does not use
replication, each copy does its own indexing.  You just have to index
the data to any machine in the cluster and SolrCloud takes care of the rest.

When you query, the machine that receives the request will automatically
farm out different requests to itself and other machines, giving you
some aspects of load balancing for free.

You may be confused by the replication comment above, because SolrCloud
actually does require that you enable replication in your config.  The
reason that it requires this config is that replication may be required
when recovering the index on a replica that goes down and then comes
back up.  It is ONLY used for index recovery, not normal operation.

Thanks,
Shawn

Re: Billion document index

Reply via email to