Hi Erick;

I think that even if you use Map/Reduce you will not parallelize you
indexing because indexing will parallelize as much as how many leaders you
have at your SolrCloud, isn't it?

2013/5/6 Erick Erickson <erickerick...@gmail.com>

> The only problem with using Hadoop (or whatever) is that you
> need to be sure that documents end up on the same shard, which
> means that you have to use the same routing mechanism that
> SolrCloud uses. The custom doc routing may help here....
>
> My very first question, though, would be whether this is necessary.
> It might be sufficient to just throttle the rate of indexing, or just do
> the
> indexing during off hours or.... Have you measured an indexing
> degradation during your heavy indexing? Indexing has costs, no
> question, but it's worth asking whether the costs are heavy enough
> to be worth the bother..
>
> Best
> Erick
>
> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <furkankam...@gmail.com>
> wrote:
> > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use
> > Map/Reduce jobs you split your workload, process it, and then reduce step
> > takes into account. Let me explain you new SolrCloud architecture. You
> > start your SolrCluoud with a numShards parameter. Let's assume that you
> > have 5 shards. Then you will have 5 leader at your SolrCloud. These
> leaders
> > will be responsible for indexing your data. It means that your indexing
> > workload will divided into 5 so it means that you have parallelized your
> > data as like Map/Reduce jobs.
> >
> > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > They will be added as a replica for each shard. Then you will have 5
> > shards, 5 leaders of them and every shard has 2 replica. When you send a
> > query into a SolrCloud every replica will help you for searching and if
> you
> > add more replicas to your SolrCloud your search performance will improve.
> >
> >
> > 2013/5/6 David Parks <davidpark...@yahoo.com>
> >
> >> I've had trouble figuring out what options exist if I want to perform
> all
> >> indexing off of the production servers (I'd like to keep them only for
> user
> >> queries).
> >>
> >>
> >>
> >> We index data in batches roughly daily, ideally I'd index all solr cloud
> >> shards offline, then move the final index files to the solr cloud
> instance
> >> that needs it and flip a switch and have it use the new index.
> >>
> >>
> >>
> >> Is this possible via either:
> >>
> >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have
> a
> >> significant investment in a hadoop cluster already), or
> >>
> >> 2.       Maintaining a separate "master" server that handles indexing
> and
> >> the nodes that receive user queries update their index from there (I
> seem
> >> to
> >> recall reading about this configuration in 3.x, but now we're using solr
> >> cloud)
> >>
> >>
> >>
> >> Is there some ideal solution I can use to "protect" the production solr
> >> instances from degraded performance during large index processing
> periods?
> >>
> >>
> >>
> >> Thanks!
> >>
> >> David
> >>
> >>
>

Reply via email to