Hi Erick,

Thanks for detailed answer. 

The producer can sustain producing with that rate, it's not a spikes.

So, I can ran more clients that write to Solr although I got that maximum 
utilization with a single client? Do you think it will increase throughput?

And you advice me to add more shards on the same two nodes until I get the best 
throughput?

 autocommit is 15000 and softcommit is 60000

Thanks,
Mahmoud

> On Mar 13, 2017, at 9:28 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> OK, so you can get a 360% speedup by commenting out the solr.add. That
> indicates that, indeed, you're pretty much running Solr flat out, not
> surprising. You _might_ squeeze a little more out of Solr by adding
> more client indexers, but that's not going to drive you to the numbers
> you need. I do have one observation though. You say "...can produce
> 13,000+ docs per second...". Is this sustained or occasional spikes?
> If the latter, can you let Solr fall behind and pick up the extra
> files when producer slows down?
> 
> Second, you'll have to have at least three clients running to even do
> the upstream processing without Solr in the picture at all. IOW, you
> can't gather and generate the Solr documents fast enough with one
> client, much less index them too.
> 
> bq: Regarding more shards, you mean use 2 nodes with 8 shards per node so we
> got 16 shards on the same 2 nodes or spread shards over more nodes?
> 
> Yes ;). Once you have enough shards/replicas on a box that you're
> running all the CPUs flat out, adding more shards won't do you any
> good. And we're just skipping over what that'll do to your ability to
> run queries. Plus, 13,000 docs/second will mount up pretty quickly, so
> you have to do your capacity planning for the projected maximum number
> of docs you'll host on this collection. My bet: If you size your
> cluster appropriately for the eventual total size, your indexing
> throughput will hit your numbers. Unless you have a very short
> retention.
> 
> See: 
> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> Best,
> Erick
> 
> On Mon, Mar 13, 2017 at 12:01 PM, Mahmoud Almokadem
> <prog.mahm...@gmail.com> wrote:
>> Thanks Erick,
>> 
>> I've commented out the line SolrClient.add(doclist) and get 5500+ docs per
>> second from single producer.
>> 
>> Regarding more shards, you mean use 2 nodes with 8 shards per node so we
>> got 16 shards on the same 2 nodes or spread shards over more nodes?
>> 
>> I'm using solr 6.4.1 with zookeeper on the same nodes.
>> 
>> Here's what I got from sematext profiler
>> 
>> 51%
>> Thread.java:745java.lang.Thread#run
>> 
>> 42%
>> QueuedThreadPool.java:589
>> org.eclipse.jetty.util.thread.QueuedThreadPool$2#run
>> Collapsed 29 calls (Expand)
>> 
>> 43%
>> UpdateRequestHandler.java:97
>> org.apache.solr.handler.UpdateRequestHandler$1#load
>> 
>> 30%
>> JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load
>> 
>> 30%
>> JsonLoader.java:115
>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load
>> 
>> 13%
>> JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load
>> 
>> 9%
>> ThreadPoolExecutor.java:617
>> java.util.concurrent.ThreadPoolExecutor$Worker#run
>> 
>> 9%
>> ThreadPoolExecutor.java:1142
>> java.util.concurrent.ThreadPoolExecutor#runWorker
>> 
>> 33%
>> ConcurrentMergeScheduler.java:626
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run
>> 
>> 33%
>> ConcurrentMergeScheduler.java:588
>> org.apache.lucene.index.ConcurrentMergeScheduler#doMerge
>> 
>> 33%
>> SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge
>> 
>> 33%
>> IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge
>> 
>> 33%
>> IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle
>> 
>> 20%
>> SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge
>> 
>> 11%
>> SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge
>> 
>> 2%
>> SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge
>> 
>> 
>> On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>>> Note that 70,000 docs/second pretty much guarantees that there are
>>> multiple shards. Lots of shards.
>>> 
>>> But since you're using SolrJ, the  very first thing I'd try would be
>>> to comment out the SolrClient.add(doclist) call so you're doing
>>> everything _except_ send the docs to Solr. That'll tell you whether
>>> there's any bottleneck on getting the docs from the system of record.
>>> The fact that you're pegging the CPUs argues that you are feeding Solr
>>> as fast as Solr can go so this is just a sanity check. But it's
>>> simple/fast.
>>> 
>>> As far as what on Solr could be the bottleneck, no real way to know
>>> without profiling. But 300+ fields per doc probably just means you're
>>> doing a lot of processing, I'm not particularly hopeful you'll be able
>>> to speed things up without either more shards or simplifying your
>>> schema.
>>> 
>>> Best,
>>> Erick
>>> 
>>> On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem
>>> <prog.mahm...@gmail.com> wrote:
>>>> Hi great community,
>>>> 
>>>> I have a SolrCloud with the following configuration:
>>>> 
>>>>   - 2 nodes (r3.2xlarge 61GB RAM)
>>>>   - 4 shards.
>>>>   - The producer can produce 13,000+ docs per second
>>>>   - The schema contains about 300+ fields and the document size is about
>>>>   3KB.
>>>>   - Using SolrJ and SolrCloudClient, each batch to solr contains 500
>>> docs.
>>>> 
>>>> When I start my bulk indexer program the CPU utilization is 100% on each
>>>> server but the rate of the indexer is about 1500 docs per second.
>>>> 
>>>> I know that some solr benchmarks reached 70,000+ doc. per second.
>>>> 
>>>> The question: What is the best way to determine the bottleneck on solr
>>>> indexing rate?
>>>> 
>>>> Thanks,
>>>> Mahmoud
>>> 

Reply via email to