one approach we have taken was decreasing the solr logging level for the posting session, described here (implemented for 1.4, but should be easy to port to 3.x):
http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html On 3/11/12, Yandong Yao <yydz...@gmail.com> wrote: > I have similar issues by using DIH, > and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) > consumes most of the time when indexing 10K rows (each row is about 70K) > - DIH nextRow takes about 10 seconds totally > - If index uses whitespace tokenizer and lower case filter, then > addDoc() methods takes about 80 seconds > - If index uses whitespace tokenizer, lower case filer, WDF, then > addDoc uses about 112 seconds > - If index uses whitespace tokenizer, lower case filer, WDF and porter > stemmer, then addDoc uses about 145 seconds > > We have more than million rows totally, and am wondering whether i am using > sth. wrong or is there any way to improve the performance of addDoc()? > > Thanks very much in advance! > > > Following is the configure: > 1) JVM: -Xms256M -Xmx1048M -XX:MaxPermSize=512m > 2) Solr version 3.5 > 3) solrconfig.xml (almost copied from solr's example/solr directory.) > > <indexDefaults> > > <useCompoundFile>false</useCompoundFile> > > <mergeFactor>10</mergeFactor> > <!-- Sets the amount of RAM that may be used by Lucene indexing > for buffering added documents and deletions before they are > flushed to the Directory. --> > <ramBufferSizeMB>64</ramBufferSizeMB> > <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then > Lucene will flush based on whichever limit is hit first. > --> > <!-- <maxBufferedDocs>1000</maxBufferedDocs> --> > > <maxFieldLength>2147483647</maxFieldLength> > <writeLockTimeout>1000</writeLockTimeout> > <commitLockTimeout>10000</commitLockTimeout> > > <lockType>native</lockType> > </indexDefaults> > > 2012/3/11 Peyman Faratin <pey...@robustlinks.com> > >> Hi >> >> I am trying to index 12MM docs faster than is currently happening in Solr >> (using solrj). We have identified solr's add method as the bottleneck (and >> not commit - which is tuned ok through mergeFactor and maxRamBufferSize >> and >> jvm ram). >> >> Adding 1000 docs is taking approximately 25 seconds. We are making sure we >> add and commit in batches. And we've tried both CommonsHttpSolrServer and >> EmbeddedSolrServer (assuming removing http overhead would speed things up >> with embedding) but the differences is marginal. >> >> The docs being indexed are on average 20 fields long, mostly indexed but >> none stored. The major size contributors are two fields: >> >> - content, and >> - shingledContent (populated using copyField of content). >> >> The length of the content field is (likely) gaussian distributed (few >> large docs 50-80K tokens, but majority around 2k tokens). We use >> shingledContent to support phrase queries and content for unigram queries >> (following the advice of Solr Enterprise search server advice - p. 305, >> section "The Solution: Shingling"). >> >> Clearly the size of the docs is a contributor to the slow adds (confirmed >> by removing these 2 fields resulting in halving the indexing time). We've >> tried compressed=true also but that is not working. >> >> Any guidance on how to support our application logic (without having to >> change the schema too much) and speed the indexing speed (from current 212 >> days for 12MM docs) would be much appreciated. >> >> thank you >> >> Peyman >> >> > -- Regards, Dmitry Kan