Hi Erick, Thanks for detailed answer.
The producer can sustain producing with that rate, it's not a spikes. So, I can ran more clients that write to Solr although I got that maximum utilization with a single client? Do you think it will increase throughput? And you advice me to add more shards on the same two nodes until I get the best throughput? autocommit is 15000 and softcommit is 60000 Thanks, Mahmoud > On Mar 13, 2017, at 9:28 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > OK, so you can get a 360% speedup by commenting out the solr.add. That > indicates that, indeed, you're pretty much running Solr flat out, not > surprising. You _might_ squeeze a little more out of Solr by adding > more client indexers, but that's not going to drive you to the numbers > you need. I do have one observation though. You say "...can produce > 13,000+ docs per second...". Is this sustained or occasional spikes? > If the latter, can you let Solr fall behind and pick up the extra > files when producer slows down? > > Second, you'll have to have at least three clients running to even do > the upstream processing without Solr in the picture at all. IOW, you > can't gather and generate the Solr documents fast enough with one > client, much less index them too. > > bq: Regarding more shards, you mean use 2 nodes with 8 shards per node so we > got 16 shards on the same 2 nodes or spread shards over more nodes? > > Yes ;). Once you have enough shards/replicas on a box that you're > running all the CPUs flat out, adding more shards won't do you any > good. And we're just skipping over what that'll do to your ability to > run queries. Plus, 13,000 docs/second will mount up pretty quickly, so > you have to do your capacity planning for the projected maximum number > of docs you'll host on this collection. My bet: If you size your > cluster appropriately for the eventual total size, your indexing > throughput will hit your numbers. Unless you have a very short > retention. > > See: > https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ > > Best, > Erick > > On Mon, Mar 13, 2017 at 12:01 PM, Mahmoud Almokadem > <prog.mahm...@gmail.com> wrote: >> Thanks Erick, >> >> I've commented out the line SolrClient.add(doclist) and get 5500+ docs per >> second from single producer. >> >> Regarding more shards, you mean use 2 nodes with 8 shards per node so we >> got 16 shards on the same 2 nodes or spread shards over more nodes? >> >> I'm using solr 6.4.1 with zookeeper on the same nodes. >> >> Here's what I got from sematext profiler >> >> 51% >> Thread.java:745java.lang.Thread#run >> >> 42% >> QueuedThreadPool.java:589 >> org.eclipse.jetty.util.thread.QueuedThreadPool$2#run >> Collapsed 29 calls (Expand) >> >> 43% >> UpdateRequestHandler.java:97 >> org.apache.solr.handler.UpdateRequestHandler$1#load >> >> 30% >> JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load >> >> 30% >> JsonLoader.java:115 >> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load >> >> 13% >> JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load >> >> 9% >> ThreadPoolExecutor.java:617 >> java.util.concurrent.ThreadPoolExecutor$Worker#run >> >> 9% >> ThreadPoolExecutor.java:1142 >> java.util.concurrent.ThreadPoolExecutor#runWorker >> >> 33% >> ConcurrentMergeScheduler.java:626 >> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run >> >> 33% >> ConcurrentMergeScheduler.java:588 >> org.apache.lucene.index.ConcurrentMergeScheduler#doMerge >> >> 33% >> SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge >> >> 33% >> IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge >> >> 33% >> IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle >> >> 20% >> SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge >> >> 11% >> SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge >> >> 2% >> SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge >> >> >> On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >>> Note that 70,000 docs/second pretty much guarantees that there are >>> multiple shards. Lots of shards. >>> >>> But since you're using SolrJ, the very first thing I'd try would be >>> to comment out the SolrClient.add(doclist) call so you're doing >>> everything _except_ send the docs to Solr. That'll tell you whether >>> there's any bottleneck on getting the docs from the system of record. >>> The fact that you're pegging the CPUs argues that you are feeding Solr >>> as fast as Solr can go so this is just a sanity check. But it's >>> simple/fast. >>> >>> As far as what on Solr could be the bottleneck, no real way to know >>> without profiling. But 300+ fields per doc probably just means you're >>> doing a lot of processing, I'm not particularly hopeful you'll be able >>> to speed things up without either more shards or simplifying your >>> schema. >>> >>> Best, >>> Erick >>> >>> On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem >>> <prog.mahm...@gmail.com> wrote: >>>> Hi great community, >>>> >>>> I have a SolrCloud with the following configuration: >>>> >>>> - 2 nodes (r3.2xlarge 61GB RAM) >>>> - 4 shards. >>>> - The producer can produce 13,000+ docs per second >>>> - The schema contains about 300+ fields and the document size is about >>>> 3KB. >>>> - Using SolrJ and SolrCloudClient, each batch to solr contains 500 >>> docs. >>>> >>>> When I start my bulk indexer program the CPU utilization is 100% on each >>>> server but the rate of the indexer is about 1500 docs per second. >>>> >>>> I know that some solr benchmarks reached 70,000+ doc. per second. >>>> >>>> The question: What is the best way to determine the bottleneck on solr >>>> indexing rate? >>>> >>>> Thanks, >>>> Mahmoud >>>