Hi.
I have a question about how I can get solr to index quicker then it does
at the moment.
I have to index (and re-index) some 3-5 million documents. These
documents are preprocessed by a java application that effectively
combines multiple database tables with each-other to form the
SolrInputDocument.
What I'm seeing however is that the queue of documents that are ready to
be send to the solr server exceeds my preset limit. Telling me that Solr
somehow can't process the documents fast enough.
(I have created my own queue in front of Solrj.StreamingUpdateSolrServer
as it would not process the documents fast enough causing
OutOfMemoryExceptions due to the large amount of documents building up
in it's queue)
I have an index that for 95% consist of ID's (Long). We don't do any
analysis on the fields that are being indexed. The schema is rather
straight forward.
most fields look like
<fieldType name="long" class="solr.LongField" omitNorms="true"/>
<field name="objectId" type="long" stored="true" indexed="true"
required="true" />
<field name="listId" type="long" stored="false" indexed="true"
multiValued="true"/>
the relevant solrconfig.xml
<indexDefaults>
<useCompoundFile>false</useCompoundFile>
<mergeFactor>100</mergeFactor>
<RAMBufferSizeMB>256</RAMBufferSizeMB>
<maxMergeDocs>2147483647</maxMergeDocs>
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
<lockType>single</lockType>
</indexDefaults>
The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4
What I'm seeing is that the network almost never reaches more then 10%
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is used, not the
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB
And that in the beginning (with a empty index) I get 2ms per insert but
this slows to 18-19ms per insert.
Are there any tips/tricks I can use to speed up my indexing? Because I
have a feeling that my machine is capable of doing more (use more
cpu's). I just can't figure-out how.
Thijs