Re: Solrcloud Batch Indexing

Erick Erickson Mon, 07 Mar 2016 12:58:46 -0800

I'm wondering if you need map reduce at all ;)...

The achilles heel with M/R viz: Solr is all the copying around
that's done at the end of the cycle. For really large bulk indexing
jobs, that's a reasonable price to pay..

How many docs and how would you characterize them as far
as size, fields, etc? And what are your time requirements? What
kind of docs?

I'm thinking this may be an "XY Problem". You're asking about
a specific solution before explaining the problem.

Why do you say that Solr is not really optimized for bulk loading?
I took a quick look at <2> and the approach is sound. It batches
up the docs in groups of 1,000 and uses CloudSolrServer as it should.
Have you tried it? At the end of the day, MapReduceIndexerTool does
the same work to index a doc as a regular Solr server would via
EmbeddedSolrServer so if the number of tasks you have running is
roughly equal to the number of shards, it _should_ be roughly
comparable.

Still, though, I have to repeat my question about how many docs you're
talking here. Using M/R inevitably adds complexity, what are you trying
to gain here that you can't get with several threads in a SolrJ client?

Best,
Erick

On Mon, Mar 7, 2016 at 12:28 PM, Bin Wang <binwang...@gmail.com> wrote:
> Hi there,
>
> I have a fairly big data set that I need to quick index into Solrcloud.
>
> I have done some research and none of them looked really good to me.
>
> (1) Kite Morphline: I managed to get it working, the mapreduce finished in
> a few minutes which is good, however, it took a really long time, like one
> hour (60 million), to merge the indexes into Solrcloud, the go-live part.
>
> (2) Mapreduce Using Solrcloud Server:
> <http://techuserhadoop.blogspot.com/2014/09/mapreduce-job-for-indexing-documents-to.html>
> this
> approach is pretty straightforward, however, every document has to funnel
> through the solrserver which is really not optimized for bulk loading.
>
> Here is what I am thinking, is it possible to use Mapreduce to create a few
> Lucene indexes first, for example, using 3 reducers to write three indexes.
> Then create a Solr collection with three shards pointing to the generated
> indexes. Can Solr easily pick up generated indexes?
>
> I am really new to Solr and wondering if this is feasible, and if there is
> any work that has already been done. I am not really interested in cutting
> the edge and any existing work should be appreciated!
>
> Best regards,
>
> Bin

Re: Solrcloud Batch Indexing

Reply via email to