Re: Indexing off of the production servers

Andre Bois-Crettez Mon, 06 May 2013 06:55:20 -0700

Excellent idea !
And it is possible to use collection aliasing with the CREATEALIAS to
make this transparent for the query side.


ex. with 2 collections named :
collection_1
collection_2

/collections?action=CREATEALIAS&name=collectionalias&collections=collection_1
"collectionalias" is now a virtual collection pointing to collection_1.

Index on collection_2, then :
/collections?action=CREATEALIAS&name=collectionalias&collections=collection_2
"collectionalias" now is an alias to collection_2.

http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API


André

On 05/06/2013 03:05 PM, Upayavira wrote:

In non-SolrCloud mode, you can index to another core, and then swap
cores. You could index on another box, ship the index files to your
production server, create a core pointing at these files, then swap this
core with the original one.

If you can tell your search app to switch to using a different
collection, you could achieve what you want with solrcloud.

You index to a different collection, which is running on different set
of SolrCloud nodes from your production search. Once indexing is
complete, you create cores on your production boxes for this new
collection. Once indexes have synced, you can switch your app to use
this new collection, thus publishing your new index. You can then delete
the cores on the boxes you were using for indexing.

Now, that's not transparent, but would be do-able.

Upayavira

On Mon, May 6, 2013, at 01:37 PM, David Parks wrote:

I'm less concerned with fully utilizing a hadoop cluster (due to having
fewer shards than I have hadoop reduce slots) as I am with just
off-loading
the whole indexing process. We may just want to re-index the whole thing
to
add some index time boosts or whatever else we conjure up to make queries
faster and better quality. We're doing a lot of work on optimization
right
now.

To re-index the whole thing is a 5-10 hour process for us, so when we
move
some update to production that requires full re-indexing (every week or
so),
right now we're just re-building new instances of solr to handle the
re-indexing and then copying the final VMs to the production environment
(slow process). I'm leery of letting a heavy duty full re-index process
loose for 10 hours on production on a regular basis.

It doesn't sound like there are any pre-built processes for doing this
now
though. I thought I had heard of master/slave hierarchy in 3.x that would
allow us to designate a master to do indexing and let the slaves pull
finished indexes from the master, so I thought maybe something like that
followed into solr cloud. Eric might be right in that it's not worth the
effort if there isn't some existing strategy.

Dave


-----Original Message-----
From: Furkan KAMACI [mailto:furkankam...@gmail.com]
Sent: Monday, May 06, 2013 7:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing off of the production servers

Hi Erick;

I think that even if you use Map/Reduce you will not parallelize you
indexing because indexing will parallelize as much as how many leaders
you
have at your SolrCloud, isn't it?

2013/5/6 Erick Erickson<erickerick...@gmail.com>

The only problem with using Hadoop (or whatever) is that you need to
be sure that documents end up on the same shard, which means that you
have to use the same routing mechanism that SolrCloud uses. The custom
doc routing may help here....

My very first question, though, would be whether this is necessary.
It might be sufficient to just throttle the rate of indexing, or just
do the indexing during off hours or.... Have you measured an indexing
degradation during your heavy indexing? Indexing has costs, no
question, but it's worth asking whether the costs are heavy enough to
be worth the bother..

Best
Erick

On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI<furkankam...@gmail.com>
wrote:

1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
use Map/Reduce jobs you split your workload, process it, and then
reduce step takes into account. Let me explain you new SolrCloud
architecture. You start your SolrCluoud with a numShards parameter.
Let's assume that you have 5 shards. Then you will have 5 leader at
your SolrCloud. These

leaders

will be responsible for indexing your data. It means that your
indexing workload will divided into 5 so it means that you have
parallelized your data as like Map/Reduce jobs.

Let's assume that you have added 10 new Solr nodes into your SolrCloud.
They will be added as a replica for each shard. Then you will have 5
shards, 5 leaders of them and every shard has 2 replica. When you
send a query into a SolrCloud every replica will help you for
searching and if

you

add more replicas to your SolrCloud your search performance will

improve.


2013/5/6 David Parks<davidpark...@yahoo.com>

I've had trouble figuring out what options exist if I want to
perform

all

indexing off of the production servers (I'd like to keep them only
for

user

queries).



We index data in batches roughly daily, ideally I'd index all solr
cloud shards offline, then move the final index files to the solr
cloud

instance

that needs it and flip a switch and have it use the new index.



Is this possible via either:

1.       Doing the indexing in Hadoop?? (this would be ideal as we have

significant investment in a hadoop cluster already), or

2.       Maintaining a separate "master" server that handles indexing

and

the nodes that receive user queries update their index from there
(I

seem

to
recall reading about this configuration in 3.x, but now we're using
solr
cloud)



Is there some ideal solution I can use to "protect" the production
solr instances from degraded performance during large index
processing

periods?



Thanks!

David


--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Re: Indexing off of the production servers

Reply via email to