1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use
Map/Reduce jobs you split your workload, process it, and then reduce step
takes into account. Let me explain you new SolrCloud architecture. You
start your SolrCluoud with a numShards parameter. Let's assume that you
have 5 shards. Then you will have 5 leader at your SolrCloud. These leaders
will be responsible for indexing your data. It means that your indexing
workload will divided into 5 so it means that you have parallelized your
data as like Map/Reduce jobs.

Let's assume that you have added 10 new Solr nodes into your SolrCloud.
They will be added as a replica for each shard. Then you will have 5
shards, 5 leaders of them and every shard has 2 replica. When you send a
query into a SolrCloud every replica will help you for searching and if you
add more replicas to your SolrCloud your search performance will improve.


2013/5/6 David Parks <davidpark...@yahoo.com>

> I've had trouble figuring out what options exist if I want to perform all
> indexing off of the production servers (I'd like to keep them only for user
> queries).
>
>
>
> We index data in batches roughly daily, ideally I'd index all solr cloud
> shards offline, then move the final index files to the solr cloud instance
> that needs it and flip a switch and have it use the new index.
>
>
>
> Is this possible via either:
>
> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have a
> significant investment in a hadoop cluster already), or
>
> 2.       Maintaining a separate "master" server that handles indexing and
> the nodes that receive user queries update their index from there (I seem
> to
> recall reading about this configuration in 3.x, but now we're using solr
> cloud)
>
>
>
> Is there some ideal solution I can use to "protect" the production solr
> instances from degraded performance during large index processing periods?
>
>
>
> Thanks!
>
> David
>
>

Reply via email to