On 10/31/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
Bigger batches before a commit will be more efficient in general... the only state that Solr keeps around before a commit is a HashTable<String,Integer> entry per unique id deleted or overwritten. You might be able to do your entire collection.
Note that _some_ care should be taken here as well. I recently tried to commit 3.9m documents in one go to an index that already every document (thus needing to delete them all) and ended up in a strange situation where the cpu was spinning for over a day with the java heap maxed (1.1Gb). If you attempt less insane feats it will go better. DUH2.doDeletions() would also highly benefit from sorting the id terms before looking them up in these types of cases (as it would trigger optimizations in lucene as well as being kinder to the os' read-ahead buffers).
If you have a multi-CPU server, you could increase indexing performance by using a multithreaded client to keep all the CPUs on the server busy.
I thought so, too, but it turns out that there isn't a huge amount of concurrent updating that can occur, if I am reading the code correctly. DUH2.addDoc() calls exactly one of addConditionally, overwriteBoth, or allowDups, each of which add the document in a synchronized(this) block. This shouldn't be too hard to fix. I'm going to take a look at doing so. -Mike