Re: Faster Solr Indexing

Dmitry Kan Sun, 11 Mar 2012 08:27:29 -0700

one approach we have taken was decreasing the solr logging level for
the posting session, described here (implemented for 1.4, but should
be easy to port to 3.x):


http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html

On 3/11/12, Yandong Yao <yydz...@gmail.com> wrote:
> I have similar issues by using DIH,
> and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> consumes most of the time when indexing 10K rows (each row is about 70K)
>     -  DIH nextRow takes about 10 seconds totally
>     -  If index uses whitespace tokenizer and lower case filter, then
> addDoc() methods takes about 80 seconds
>     -  If index uses whitespace tokenizer, lower case filer, WDF, then
> addDoc uses about 112 seconds
>     -  If index uses whitespace tokenizer, lower case filer, WDF and porter
> stemmer, then addDoc uses about 145 seconds
>
> We have more than million rows totally, and am wondering whether i am using
> sth. wrong or is there any way to improve the performance of addDoc()?
>
> Thanks very much in advance!
>
>
> Following is the configure:
> 1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
> 2) Solr version 3.5
> 3) solrconfig.xml  (almost copied from solr's  example/solr directory.)
>
>   <indexDefaults>
>
>     <useCompoundFile>false</useCompoundFile>
>
>     <mergeFactor>10</mergeFactor>
>     <!-- Sets the amount of RAM that may be used by Lucene indexing
>          for buffering added documents and deletions before they are
>          flushed to the Directory.  -->
>     <ramBufferSizeMB>64</ramBufferSizeMB>
>     <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then
>          Lucene will flush based on whichever limit is hit first.
>       -->
>     <!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
>
>     <maxFieldLength>2147483647</maxFieldLength>
>     <writeLockTimeout>1000</writeLockTimeout>
>     <commitLockTimeout>10000</commitLockTimeout>
>
>     <lockType>native</lockType>
>   </indexDefaults>
>
> 2012/3/11 Peyman Faratin <pey...@robustlinks.com>
>
>> Hi
>>
>> I am trying to index 12MM docs faster than is currently happening in Solr
>> (using solrj). We have identified solr's add method as the bottleneck (and
>> not commit - which is tuned ok through mergeFactor and maxRamBufferSize
>> and
>> jvm ram).
>>
>> Adding 1000 docs is taking approximately 25 seconds. We are making sure we
>> add and commit in batches. And we've tried both CommonsHttpSolrServer and
>> EmbeddedSolrServer (assuming removing http overhead would speed things up
>> with embedding) but the differences is marginal.
>>
>> The docs being indexed are on average 20 fields long, mostly indexed but
>> none stored. The major size contributors are two fields:
>>
>>        - content, and
>>        - shingledContent (populated using copyField of content).
>>
>> The length of the content field is (likely) gaussian distributed (few
>> large docs 50-80K tokens, but majority around 2k tokens). We use
>> shingledContent to support phrase queries and content for unigram queries
>> (following the advice of Solr Enterprise search server advice - p. 305,
>> section "The Solution: Shingling").
>>
>> Clearly the size of the docs is a contributor to the slow adds (confirmed
>> by removing these 2 fields resulting in halving the indexing time). We've
>> tried compressed=true also but that is not working.
>>
>> Any guidance on how to support our application logic (without having to
>> change the schema too much) and speed the indexing speed (from current 212
>> days for 12MM docs) would be much appreciated.
>>
>> thank you
>>
>> Peyman
>>
>>
>


-- 
Regards,

Dmitry Kan

Re: Faster Solr Indexing

Reply via email to