Re: Indexing CPU performance

Erick Erickson Mon, 13 Mar 2017 14:24:37 -0700

I'm suggesting that worrying about your indexing rate is premature.
13,000 docs/second is over 1B docs per day. As a straw-man number,
each Solr replica (think shard) can hold 64M documents. You need 16
shards at that size to hold a single day's input. Let's say you want
to keep these docs around for 30 days. Now you're up to 480 shards.
Even at one follower each, that's 960 replicas.


Again, just for straw-man let's say you can host 4 Solr JVMs on each
physical machine and 4 replicas in each JVM. That's 16 Solr replicase
per physical machine which is still 60 physical machines. YMMV greatly
by the way.

I've seen a single Solr replica hold between 10M and 300M documents.

So until you figure out at least a straw-man size for your entire
cluster, worrying about the throughput at index time is pointless. The
link I provided will give you a way to figure out how many documents
you can host on a single machine and give you a fairly solid way to
extrapolate to your requirements. Until you do that, however, you're
just guessing.

Best,
Erick

On Mon, Mar 13, 2017 at 12:53 PM, Mahmoud Almokadem
<prog.mahm...@gmail.com> wrote:
> Hi Erick,
>
> Thanks for detailed answer.
>
> The producer can sustain producing with that rate, it's not a spikes.
>
> So, I can ran more clients that write to Solr although I got that maximum 
> utilization with a single client? Do you think it will increase throughput?
>
> And you advice me to add more shards on the same two nodes until I get the 
> best throughput?
>
>  autocommit is 15000 and softcommit is 60000
>
> Thanks,
> Mahmoud
>
>> On Mar 13, 2017, at 9:28 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>>
>> OK, so you can get a 360% speedup by commenting out the solr.add. That
>> indicates that, indeed, you're pretty much running Solr flat out, not
>> surprising. You _might_ squeeze a little more out of Solr by adding
>> more client indexers, but that's not going to drive you to the numbers
>> you need. I do have one observation though. You say "...can produce
>> 13,000+ docs per second...". Is this sustained or occasional spikes?
>> If the latter, can you let Solr fall behind and pick up the extra
>> files when producer slows down?
>>
>> Second, you'll have to have at least three clients running to even do
>> the upstream processing without Solr in the picture at all. IOW, you
>> can't gather and generate the Solr documents fast enough with one
>> client, much less index them too.
>>
>> bq: Regarding more shards, you mean use 2 nodes with 8 shards per node so we
>> got 16 shards on the same 2 nodes or spread shards over more nodes?
>>
>> Yes ;). Once you have enough shards/replicas on a box that you're
>> running all the CPUs flat out, adding more shards won't do you any
>> good. And we're just skipping over what that'll do to your ability to
>> run queries. Plus, 13,000 docs/second will mount up pretty quickly, so
>> you have to do your capacity planning for the projected maximum number
>> of docs you'll host on this collection. My bet: If you size your
>> cluster appropriately for the eventual total size, your indexing
>> throughput will hit your numbers. Unless you have a very short
>> retention.
>>
>> See: 
>> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> Best,
>> Erick
>>
>> On Mon, Mar 13, 2017 at 12:01 PM, Mahmoud Almokadem
>> <prog.mahm...@gmail.com> wrote:
>>> Thanks Erick,
>>>
>>> I've commented out the line SolrClient.add(doclist) and get 5500+ docs per
>>> second from single producer.
>>>
>>> Regarding more shards, you mean use 2 nodes with 8 shards per node so we
>>> got 16 shards on the same 2 nodes or spread shards over more nodes?
>>>
>>> I'm using solr 6.4.1 with zookeeper on the same nodes.
>>>
>>> Here's what I got from sematext profiler
>>>
>>> 51%
>>> Thread.java:745java.lang.Thread#run
>>>
>>> 42%
>>> QueuedThreadPool.java:589
>>> org.eclipse.jetty.util.thread.QueuedThreadPool$2#run
>>> Collapsed 29 calls (Expand)
>>>
>>> 43%
>>> UpdateRequestHandler.java:97
>>> org.apache.solr.handler.UpdateRequestHandler$1#load
>>>
>>> 30%
>>> JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load
>>>
>>> 30%
>>> JsonLoader.java:115
>>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load
>>>
>>> 13%
>>> JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load
>>>
>>> 9%
>>> ThreadPoolExecutor.java:617
>>> java.util.concurrent.ThreadPoolExecutor$Worker#run
>>>
>>> 9%
>>> ThreadPoolExecutor.java:1142
>>> java.util.concurrent.ThreadPoolExecutor#runWorker
>>>
>>> 33%
>>> ConcurrentMergeScheduler.java:626
>>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run
>>>
>>> 33%
>>> ConcurrentMergeScheduler.java:588
>>> org.apache.lucene.index.ConcurrentMergeScheduler#doMerge
>>>
>>> 33%
>>> SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge
>>>
>>> 33%
>>> IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge
>>>
>>> 33%
>>> IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle
>>>
>>> 20%
>>> SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge
>>>
>>> 11%
>>> SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge
>>>
>>> 2%
>>> SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge
>>>
>>>
>>> On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com>
>>> wrote:
>>>
>>>> Note that 70,000 docs/second pretty much guarantees that there are
>>>> multiple shards. Lots of shards.
>>>>
>>>> But since you're using SolrJ, the  very first thing I'd try would be
>>>> to comment out the SolrClient.add(doclist) call so you're doing
>>>> everything _except_ send the docs to Solr. That'll tell you whether
>>>> there's any bottleneck on getting the docs from the system of record.
>>>> The fact that you're pegging the CPUs argues that you are feeding Solr
>>>> as fast as Solr can go so this is just a sanity check. But it's
>>>> simple/fast.
>>>>
>>>> As far as what on Solr could be the bottleneck, no real way to know
>>>> without profiling. But 300+ fields per doc probably just means you're
>>>> doing a lot of processing, I'm not particularly hopeful you'll be able
>>>> to speed things up without either more shards or simplifying your
>>>> schema.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem
>>>> <prog.mahm...@gmail.com> wrote:
>>>>> Hi great community,
>>>>>
>>>>> I have a SolrCloud with the following configuration:
>>>>>
>>>>>   - 2 nodes (r3.2xlarge 61GB RAM)
>>>>>   - 4 shards.
>>>>>   - The producer can produce 13,000+ docs per second
>>>>>   - The schema contains about 300+ fields and the document size is about
>>>>>   3KB.
>>>>>   - Using SolrJ and SolrCloudClient, each batch to solr contains 500
>>>> docs.
>>>>>
>>>>> When I start my bulk indexer program the CPU utilization is 100% on each
>>>>> server but the rate of the indexer is about 1500 docs per second.
>>>>>
>>>>> I know that some solr benchmarks reached 70,000+ doc. per second.
>>>>>
>>>>> The question: What is the best way to determine the bottleneck on solr
>>>>> indexing rate?
>>>>>
>>>>> Thanks,
>>>>> Mahmoud
>>>>

Re: Indexing CPU performance

Reply via email to