Indexing rates scale pretty linearly with the number of shards, so one way to increase throughput is to simply create a collection with more shards. For the initial bulk-indexing operations, you can go with a 1-replica-per-shard scenario then ADDREPLICA if you need to build things out.
However… that may leave you with more shards than you really want, but that’s usually not an impediment. The MapReduceIndexerTool uses something called the embedded solr server, so it’s really using Solr under the covers. All that said, I’m not yet convinced you need to go there. How are you sure that you’re really driving Solr hard? Are you pegging all the CPUs on all your Solr nodes while indexing? Very often I see “slow indexing” be the result of the collection process not being able to feed Solr docs fast enough. So here’s a couple of things to look at: 1> are your CPUs on the Solr nodes running flat out? If not, you need to work on your ingestion process. Perhaps parallelize it on the client side so you have multiple threads throwing docs at Solr. 2> Comment out the bit in your SolrJ program where you call CloudSolrClient.add(doclist). If that doesn’t change the rate you can process your docs, then you’re spending all your time on the client side. Also, sanity checks: You’re not committing after every batch or anything else like that, right? Speaking of autocommit, I’d set them in my solrconfig be autoCommit every, say, 60 seconds with openSearcher=true and leave it at that until proven you need something different. You also haven’t told us about your topology. How many shards? How many machines? I pretty much guarantee you won’t be able to fit all that data on a single shard... Best, Erick > On Feb 13, 2020, at 8:17 PM, vivek chaurasiya <vivek....@gmail.com> wrote: > > Hi there, > > We are using AWS EMR as our big data processing cluster. We have like 3TB > of text files where each line denotes a json record which I want to be > indexed into Solr. > > I have tried this by batching them and pushing to Solr index using > SolrJClient. But I feel thats really slow. > > My doubt is 2 fold: > > 1. Is there a ready-to-use tool which can be used to create a Solr index > offline and store in say S3 or somewhere. > 2. That offline solr index file if possible in (1), how can i push it to a > live Solr cluster? > > > I found this tool: > https://docs.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html > > but its really cumbersome to use and looks like at the time of creating > offline index you need to put in shard/schema information. > > Some suggestions would be greatly appreciated. > > -Vivek