Re: Indexing off of the production servers

Upayavira Mon, 06 May 2013 06:06:22 -0700

In non-SolrCloud mode, you can index to another core, and then swap
cores. You could index on another box, ship the index files to your
production server, create a core pointing at these files, then swap this
core with the original one.


If you can tell your search app to switch to using a different
collection, you could achieve what you want with solrcloud.

You index to a different collection, which is running on different set
of SolrCloud nodes from your production search. Once indexing is
complete, you create cores on your production boxes for this new
collection. Once indexes have synced, you can switch your app to use
this new collection, thus publishing your new index. You can then delete
the cores on the boxes you were using for indexing.

Now, that's not transparent, but would be do-able.

Upayavira

On Mon, May 6, 2013, at 01:37 PM, David Parks wrote:
> I'm less concerned with fully utilizing a hadoop cluster (due to having
> fewer shards than I have hadoop reduce slots) as I am with just
> off-loading
> the whole indexing process. We may just want to re-index the whole thing
> to
> add some index time boosts or whatever else we conjure up to make queries
> faster and better quality. We're doing a lot of work on optimization
> right
> now.
> 
> To re-index the whole thing is a 5-10 hour process for us, so when we
> move
> some update to production that requires full re-indexing (every week or
> so),
> right now we're just re-building new instances of solr to handle the
> re-indexing and then copying the final VMs to the production environment
> (slow process). I'm leery of letting a heavy duty full re-index process
> loose for 10 hours on production on a regular basis.
> 
> It doesn't sound like there are any pre-built processes for doing this
> now
> though. I thought I had heard of master/slave hierarchy in 3.x that would
> allow us to designate a master to do indexing and let the slaves pull
> finished indexes from the master, so I thought maybe something like that
> followed into solr cloud. Eric might be right in that it's not worth the
> effort if there isn't some existing strategy.
> 
> Dave
> 
> 
> -----Original Message-----
> From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
> Sent: Monday, May 06, 2013 7:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing off of the production servers
> 
> Hi Erick;
> 
> I think that even if you use Map/Reduce you will not parallelize you
> indexing because indexing will parallelize as much as how many leaders
> you
> have at your SolrCloud, isn't it?
> 
> 2013/5/6 Erick Erickson <erickerick...@gmail.com>
> 
> > The only problem with using Hadoop (or whatever) is that you need to 
> > be sure that documents end up on the same shard, which means that you 
> > have to use the same routing mechanism that SolrCloud uses. The custom 
> > doc routing may help here....
> >
> > My very first question, though, would be whether this is necessary.
> > It might be sufficient to just throttle the rate of indexing, or just 
> > do the indexing during off hours or.... Have you measured an indexing 
> > degradation during your heavy indexing? Indexing has costs, no 
> > question, but it's worth asking whether the costs are heavy enough to 
> > be worth the bother..
> >
> > Best
> > Erick
> >
> > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <furkankam...@gmail.com>
> > wrote:
> > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you 
> > > use Map/Reduce jobs you split your workload, process it, and then 
> > > reduce step takes into account. Let me explain you new SolrCloud 
> > > architecture. You start your SolrCluoud with a numShards parameter. 
> > > Let's assume that you have 5 shards. Then you will have 5 leader at 
> > > your SolrCloud. These
> > leaders
> > > will be responsible for indexing your data. It means that your 
> > > indexing workload will divided into 5 so it means that you have 
> > > parallelized your data as like Map/Reduce jobs.
> > >
> > > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > > They will be added as a replica for each shard. Then you will have 5 
> > > shards, 5 leaders of them and every shard has 2 replica. When you 
> > > send a query into a SolrCloud every replica will help you for 
> > > searching and if
> > you
> > > add more replicas to your SolrCloud your search performance will
> improve.
> > >
> > >
> > > 2013/5/6 David Parks <davidpark...@yahoo.com>
> > >
> > >> I've had trouble figuring out what options exist if I want to 
> > >> perform
> > all
> > >> indexing off of the production servers (I'd like to keep them only 
> > >> for
> > user
> > >> queries).
> > >>
> > >>
> > >>
> > >> We index data in batches roughly daily, ideally I'd index all solr 
> > >> cloud shards offline, then move the final index files to the solr 
> > >> cloud
> > instance
> > >> that needs it and flip a switch and have it use the new index.
> > >>
> > >>
> > >>
> > >> Is this possible via either:
> > >>
> > >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have
> > a
> > >> significant investment in a hadoop cluster already), or
> > >>
> > >> 2.       Maintaining a separate "master" server that handles indexing
> > and
> > >> the nodes that receive user queries update their index from there 
> > >> (I
> > seem
> > >> to
> > >> recall reading about this configuration in 3.x, but now we're using 
> > >> solr
> > >> cloud)
> > >>
> > >>
> > >>
> > >> Is there some ideal solution I can use to "protect" the production 
> > >> solr instances from degraded performance during large index 
> > >> processing
> > periods?
> > >>
> > >>
> > >>
> > >> Thanks!
> > >>
> > >> David
> > >>
> > >>
> >
>

Re: Indexing off of the production servers

Reply via email to