Well, for comparison I routinely get 20K docs/second on my Mac Pro indexing Wikipedia docs. I _think_ I have 4 shards when I do this, all in the same JVM. I'd be surprised if you can't get your 5K docs/sec, but you may indeed need more shards.
All that said, 4G for the JVM is kind of constrained, you already mentioned GC. There are two pitfalls here: 1> allocating too little memory and spending lots of cycles doing very small GCs. At 4G this is likelier than: 2> having very large heaps and seeing "stop the world" GC pauses..... So I think you're on the right track looking at memory, at least that's what I'd be looking at first. Note: your indexing scaling (assuming you're sending complete docs not atomic updates) will scale better if you 1> use CloudSolrClient from Java since it routes docs to the right leader first and avoids an extra hop. 2> batch updates. Sending one doc at a time makes things very slow, see: https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/ Best, Erick On Thu, Mar 24, 2016 at 10:57 AM, tedsolr <tsm...@sciquest.com> wrote: > Hi Erick, > > My post was scant on details. The numbers I gave for collection sizes are > projections for the future. I am in the midst of an upgrade that will be > completed within a few weeks. My concern is that I may not be able to > produce the throughput necessary to index an entire collection quickly > enough (3 to 4 hours) for a large customer (100M docs). > > Currently: > - single Solr instance on one host that is sharing memory and cpu with other > applications > - 4GB dedicated to Solr > ~ 20M docs > ~ 10GB index size > - using HttpSolrClient for all queries and updates > > Very soon: > - two VMs dedicated to Solr (2 nodes) > - up to 16GB available memory > - running in cloud mode, and can now scale horizontally > - all collections are single sharded with 2 replicas > > All fields are stored. The scenario I gave is using atomic updates. The > updates are done in large batches of 5000-10000 docs. The use case I have is > different than most Solr setups perhaps. Indexing throughput is more > important than qps. We have very few concurrent users that do massive > amounts of doc updates. I am seeing lousy (production) performance currently > (not a surprise - long GC pauses), and have just begun the process of tuning > in a test environment. > > After some more weeks of testing and tweaking I hope to get to 5000 > updates/sec, but even that may not be enough. So my main concern is that > this business model (of updating entire collections about once a day) cannot > be supported by Solr. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Performance-potential-for-updating-reindexing-documents-tp4265861p4265922.html > Sent from the Solr - User mailing list archive at Nabble.com.