Re: Performance potential for updating (reindexing) documents

Erick Erickson Thu, 24 Mar 2016 12:37:53 -0700

Well, for comparison I routinely get 20K docs/second on my Mac Pro
indexing Wikipedia docs. I _think_ I have 4 shards when I do this, all
in the same JVM. I'd be surprised if you can't get your 5K docs/sec,
but you may indeed need more shards.


All that said, 4G  for the JVM is kind of constrained, you already
mentioned GC. There are two pitfalls here:
1> allocating too little memory and spending lots of cycles doing very
small GCs. At 4G this is likelier than:
2> having very large heaps and seeing "stop the world" GC pauses.....

So I think you're on the right track looking at memory, at least
that's what I'd be looking at first.

Note: your indexing scaling (assuming you're sending complete docs not
atomic updates) will scale better if you
1> use CloudSolrClient from Java since it routes docs to the right
leader first and avoids an extra hop.
2> batch updates. Sending one doc at a time makes things very slow, see:

https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Thu, Mar 24, 2016 at 10:57 AM, tedsolr <tsm...@sciquest.com> wrote:
> Hi Erick,
>
> My post was scant on details. The numbers I gave for collection sizes are
> projections for the future. I am in the midst of an upgrade that will be
> completed within a few weeks. My concern is that I may not be able to
> produce the throughput necessary to index an entire collection quickly
> enough (3 to 4 hours) for a large customer (100M docs).
>
> Currently:
> - single Solr instance on one host that is sharing memory and cpu with other
> applications
> - 4GB dedicated to Solr
> ~ 20M docs
> ~ 10GB index size
> - using HttpSolrClient for all queries and updates
>
> Very soon:
> - two VMs dedicated to Solr (2 nodes)
> - up to 16GB available memory
> - running in cloud mode, and can now scale horizontally
> - all collections are single sharded with 2 replicas
>
> All fields are stored. The scenario I gave is using atomic updates. The
> updates are done in large batches of 5000-10000 docs. The use case I have is
> different than most Solr setups perhaps. Indexing throughput is more
> important than qps. We have very few concurrent users that do massive
> amounts of doc updates. I am seeing lousy (production) performance currently
> (not a surprise - long GC pauses), and have just begun the process of tuning
> in a test environment.
>
> After some more weeks of testing and tweaking I hope to get to 5000
> updates/sec, but even that may not be enough. So my main concern is that
> this business model (of updating entire collections about once a day) cannot
> be supported by Solr.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performance-potential-for-updating-reindexing-documents-tp4265861p4265922.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance potential for updating (reindexing) documents

Reply via email to