Re: Indexing CPU performance

Erick Erickson Mon, 13 Mar 2017 12:29:39 -0700

OK, so you can get a 360% speedup by commenting out the solr.add. That
indicates that, indeed, you're pretty much running Solr flat out, not
surprising. You _might_ squeeze a little more out of Solr by adding
more client indexers, but that's not going to drive you to the numbers
you need. I do have one observation though. You say "...can produce
13,000+ docs per second...". Is this sustained or occasional spikes?
If the latter, can you let Solr fall behind and pick up the extra
files when producer slows down?


Second, you'll have to have at least three clients running to even do
the upstream processing without Solr in the picture at all. IOW, you
can't gather and generate the Solr documents fast enough with one
client, much less index them too.

bq: Regarding more shards, you mean use 2 nodes with 8 shards per node so we
got 16 shards on the same 2 nodes or spread shards over more nodes?

Yes ;). Once you have enough shards/replicas on a box that you're
running all the CPUs flat out, adding more shards won't do you any
good. And we're just skipping over what that'll do to your ability to
run queries. Plus, 13,000 docs/second will mount up pretty quickly, so
you have to do your capacity planning for the projected maximum number
of docs you'll host on this collection. My bet: If you size your
cluster appropriately for the eventual total size, your indexing
throughput will hit your numbers. Unless you have a very short
retention.

See: 
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Mon, Mar 13, 2017 at 12:01 PM, Mahmoud Almokadem
<prog.mahm...@gmail.com> wrote:
> Thanks Erick,
>
> I've commented out the line SolrClient.add(doclist) and get 5500+ docs per
> second from single producer.
>
> Regarding more shards, you mean use 2 nodes with 8 shards per node so we
> got 16 shards on the same 2 nodes or spread shards over more nodes?
>
> I'm using solr 6.4.1 with zookeeper on the same nodes.
>
> Here's what I got from sematext profiler
>
> 51%
> Thread.java:745java.lang.Thread#run
>
> 42%
> QueuedThreadPool.java:589
> org.eclipse.jetty.util.thread.QueuedThreadPool$2#run
> Collapsed 29 calls (Expand)
>
> 43%
> UpdateRequestHandler.java:97
> org.apache.solr.handler.UpdateRequestHandler$1#load
>
> 30%
> JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load
>
> 30%
> JsonLoader.java:115
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load
>
> 13%
> JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load
>
> 9%
> ThreadPoolExecutor.java:617
> java.util.concurrent.ThreadPoolExecutor$Worker#run
>
> 9%
> ThreadPoolExecutor.java:1142
> java.util.concurrent.ThreadPoolExecutor#runWorker
>
> 33%
> ConcurrentMergeScheduler.java:626
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run
>
> 33%
> ConcurrentMergeScheduler.java:588
> org.apache.lucene.index.ConcurrentMergeScheduler#doMerge
>
> 33%
> SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge
>
> 33%
> IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge
>
> 33%
> IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle
>
> 20%
> SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge
>
> 11%
> SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge
>
> 2%
> SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge
>
>
> On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Note that 70,000 docs/second pretty much guarantees that there are
>> multiple shards. Lots of shards.
>>
>> But since you're using SolrJ, the  very first thing I'd try would be
>> to comment out the SolrClient.add(doclist) call so you're doing
>> everything _except_ send the docs to Solr. That'll tell you whether
>> there's any bottleneck on getting the docs from the system of record.
>> The fact that you're pegging the CPUs argues that you are feeding Solr
>> as fast as Solr can go so this is just a sanity check. But it's
>> simple/fast.
>>
>> As far as what on Solr could be the bottleneck, no real way to know
>> without profiling. But 300+ fields per doc probably just means you're
>> doing a lot of processing, I'm not particularly hopeful you'll be able
>> to speed things up without either more shards or simplifying your
>> schema.
>>
>> Best,
>> Erick
>>
>> On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem
>> <prog.mahm...@gmail.com> wrote:
>> > Hi great community,
>> >
>> > I have a SolrCloud with the following configuration:
>> >
>> >    - 2 nodes (r3.2xlarge 61GB RAM)
>> >    - 4 shards.
>> >    - The producer can produce 13,000+ docs per second
>> >    - The schema contains about 300+ fields and the document size is about
>> >    3KB.
>> >    - Using SolrJ and SolrCloudClient, each batch to solr contains 500
>> docs.
>> >
>> > When I start my bulk indexer program the CPU utilization is 100% on each
>> > server but the rate of the indexer is about 1500 docs per second.
>> >
>> > I know that some solr benchmarks reached 70,000+ doc. per second.
>> >
>> > The question: What is the best way to determine the bottleneck on solr
>> > indexing rate?
>> >
>> > Thanks,
>> > Mahmoud
>>

Re: Indexing CPU performance

Reply via email to