Indexing rates scale pretty linearly with the number of shards, so one
way to increase throughput is to simply create a collection with
more shards. For the initial bulk-indexing operations, you can 
go with a 1-replica-per-shard scenario then ADDREPLICA if you need
to build things out.

However… that may leave you with more shards than you really want, but
that’s usually not an impediment.

The MapReduceIndexerTool uses something called the embedded solr server,
so it’s really using Solr under the covers.

All that said, I’m not yet convinced you need to go there. How are you
sure that you’re really driving Solr hard? Are you pegging all the CPUs on
all your Solr nodes while indexing? Very often I see “slow indexing” be the
result of the collection process not being able to feed Solr docs fast
enough. So here’s a couple of things to look at:

1> are your CPUs on the Solr nodes running flat out? If not, you need to
work on your ingestion process. Perhaps parallelize it on the client side
so you have multiple threads throwing docs at Solr. 

2>  Comment out the bit in your SolrJ program where you call
CloudSolrClient.add(doclist). If that doesn’t change the rate you can
process your docs, then you’re spending all your time on the client
side.

Also, sanity checks: You’re not committing after every batch or anything
else like that, right? Speaking of autocommit, I’d set them in my solrconfig
be autoCommit every, say, 60 seconds with openSearcher=true and leave
it at that until proven you need something different.

You also haven’t told us about your topology. How many shards? How many
machines? I pretty much guarantee you won’t be able to fit all that data on a
single shard...

Best,
Erick

> On Feb 13, 2020, at 8:17 PM, vivek chaurasiya <vivek....@gmail.com> wrote:
> 
> Hi there,
> 
> We are using AWS EMR as our big data processing cluster. We have like 3TB
> of text files where each line denotes a json record which I want to be
> indexed into Solr.
> 
> I have tried this by batching them and pushing to Solr index using
> SolrJClient. But I feel thats really slow.
> 
> My doubt is 2 fold:
> 
> 1. Is there a ready-to-use tool which can be used to create a Solr index
> offline and store in say S3 or somewhere.
> 2. That offline solr index file if possible in (1), how can i push it to a
> live Solr cluster?
> 
> 
> I found this tool:
> https://docs.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html
> 
> but its really cumbersome to use and looks like at the time of creating
> offline index you need to put in shard/schema information.
> 
> Some suggestions would be greatly appreciated.
> 
> -Vivek

Reply via email to