On 2-Nov-07, at 11:41 AM, Jae Joo wrote:
Hi,
I have 6 millions article to be indexed by Solr and do need your
recommendation.
I do need to parse and generate the Solr based xml file to post it.
How
about to use Lucene directly?
I have short testing, it looks like Sola based indexing is faster than
direct indexing through Lucene.
I wouldn't recommend that. If you use persistent connections,
multiple threads and >1 docs/update you should achieve comparable
performance (about 10docs/request is about the right balance for web-
sized docs).
If you want index directly, use embedded Solr, not Lucene directly
(see the wiki).
Am I did something wrong and/or does Solr use multiple threading or
something else to get the good indexing performance?
It does use multiple threads if you connect to Solr using multiple
threads. But it doesn't do it behind the scenes if you aren't using
multiple threads.
Some possible differences:
1. Solr has more aggressive default buffering settings
(maxBufferedDocs, mergeFactor)
2. solr trunk (if that is what you are using) is using a more recent
version of Lucene than the released 2.2
-Mike