Hi Erick,

My post was scant on details. The numbers I gave for collection sizes are
projections for the future. I am in the midst of an upgrade that will be
completed within a few weeks. My concern is that I may not be able to
produce the throughput necessary to index an entire collection quickly
enough (3 to 4 hours) for a large customer (100M docs).

Currently:
- single Solr instance on one host that is sharing memory and cpu with other
applications
- 4GB dedicated to Solr
~ 20M docs
~ 10GB index size
- using HttpSolrClient for all queries and updates

Very soon:
- two VMs dedicated to Solr (2 nodes)
- up to 16GB available memory
- running in cloud mode, and can now scale horizontally
- all collections are single sharded with 2 replicas

All fields are stored. The scenario I gave is using atomic updates. The
updates are done in large batches of 5000-10000 docs. The use case I have is
different than most Solr setups perhaps. Indexing throughput is more
important than qps. We have very few concurrent users that do massive
amounts of doc updates. I am seeing lousy (production) performance currently
(not a surprise - long GC pauses), and have just begun the process of tuning
in a test environment.

After some more weeks of testing and tweaking I hope to get to 5000
updates/sec, but even that may not be enough. So my main concern is that
this business model (of updating entire collections about once a day) cannot
be supported by Solr.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-potential-for-updating-reindexing-documents-tp4265861p4265922.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to