Re: Indexing CPU performance

Mahmoud Almokadem Tue, 14 Mar 2017 02:52:23 -0700

Here is the profiler screenshot from VisualVM after upgrading

https://drive.google.com/open?id=0BwLcshoSCVcddldVRTExaDR2dzg


the jetty is taking the most time on CPU. Does this mean, the jetty is the
bottleneck on indexing?

Thanks,
Mahmoud


On Tue, Mar 14, 2017 at 11:41 AM, Mahmoud Almokadem <prog.mahm...@gmail.com>
wrote:

> Thanks Shalin,
>
> I'm posting data to solr with SolrInputDocument using SolrJ.
>
> According to the profiler, the com.codahale.metrics.Meter.mark is take
> much processing than others as mentioned on this issue
> https://issues.apache.org/jira/browse/SOLR-10130.
>
> And I think the profiler of sematext is different than VisualVM.
>
> Thanks for help,
> Mahmoud
>
>
>
> On Tue, Mar 14, 2017 at 11:08 AM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
>> According to the profiler output, a significant amount of cpu is being
>> spent in JSON parsing but your previous email said that you use SolrJ.
>> SolrJ uses the javabin binary format to send documents to Solr and it
>> never ever uses JSON so there is definitely some other indexing
>> process that you have not accounted for.
>>
>> On Tue, Mar 14, 2017 at 12:31 AM, Mahmoud Almokadem
>> <prog.mahm...@gmail.com> wrote:
>> > Thanks Erick,
>> >
>> > I've commented out the line SolrClient.add(doclist) and get 5500+ docs
>> per
>> > second from single producer.
>> >
>> > Regarding more shards, you mean use 2 nodes with 8 shards per node so we
>> > got 16 shards on the same 2 nodes or spread shards over more nodes?
>> >
>> > I'm using solr 6.4.1 with zookeeper on the same nodes.
>> >
>> > Here's what I got from sematext profiler
>> >
>> > 51%
>> > Thread.java:745java.lang.Thread#run
>> >
>> > 42%
>> > QueuedThreadPool.java:589
>> > org.eclipse.jetty.util.thread.QueuedThreadPool$2#run
>> > Collapsed 29 calls (Expand)
>> >
>> > 43%
>> > UpdateRequestHandler.java:97
>> > org.apache.solr.handler.UpdateRequestHandler$1#load
>> >
>> > 30%
>> > JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load
>> >
>> > 30%
>> > JsonLoader.java:115
>> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load
>> >
>> > 13%
>> > JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load
>> >
>> > 9%
>> > ThreadPoolExecutor.java:617
>> > java.util.concurrent.ThreadPoolExecutor$Worker#run
>> >
>> > 9%
>> > ThreadPoolExecutor.java:1142
>> > java.util.concurrent.ThreadPoolExecutor#runWorker
>> >
>> > 33%
>> > ConcurrentMergeScheduler.java:626
>> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run
>> >
>> > 33%
>> > ConcurrentMergeScheduler.java:588
>> > org.apache.lucene.index.ConcurrentMergeScheduler#doMerge
>> >
>> > 33%
>> > SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge
>> >
>> > 33%
>> > IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge
>> >
>> > 33%
>> > IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle
>> >
>> > 20%
>> > SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge
>> >
>> > 11%
>> > SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge
>> >
>> > 2%
>> > SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge
>> >
>> >
>> > On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Note that 70,000 docs/second pretty much guarantees that there are
>> >> multiple shards. Lots of shards.
>> >>
>> >> But since you're using SolrJ, the  very first thing I'd try would be
>> >> to comment out the SolrClient.add(doclist) call so you're doing
>> >> everything _except_ send the docs to Solr. That'll tell you whether
>> >> there's any bottleneck on getting the docs from the system of record.
>> >> The fact that you're pegging the CPUs argues that you are feeding Solr
>> >> as fast as Solr can go so this is just a sanity check. But it's
>> >> simple/fast.
>> >>
>> >> As far as what on Solr could be the bottleneck, no real way to know
>> >> without profiling. But 300+ fields per doc probably just means you're
>> >> doing a lot of processing, I'm not particularly hopeful you'll be able
>> >> to speed things up without either more shards or simplifying your
>> >> schema.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem
>> >> <prog.mahm...@gmail.com> wrote:
>> >> > Hi great community,
>> >> >
>> >> > I have a SolrCloud with the following configuration:
>> >> >
>> >> >    - 2 nodes (r3.2xlarge 61GB RAM)
>> >> >    - 4 shards.
>> >> >    - The producer can produce 13,000+ docs per second
>> >> >    - The schema contains about 300+ fields and the document size is
>> about
>> >> >    3KB.
>> >> >    - Using SolrJ and SolrCloudClient, each batch to solr contains 500
>> >> docs.
>> >> >
>> >> > When I start my bulk indexer program the CPU utilization is 100% on
>> each
>> >> > server but the rate of the indexer is about 1500 docs per second.
>> >> >
>> >> > I know that some solr benchmarks reached 70,000+ doc. per second.
>> >> >
>> >> > The question: What is the best way to determine the bottleneck on
>> solr
>> >> > indexing rate?
>> >> >
>> >> > Thanks,
>> >> > Mahmoud
>> >>
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>

Re: Indexing CPU performance

Reply via email to