Hey guys, Your responses are welcome, but I still haven't gained a lot of improvements
*Are you posting through HTTP/SOLRJ?* I am using RSolr gem, which internally uses Ruby HTTP lib to POST document to Solr *Your script time 'T' includes time between sending POST request -to- the response fetched after successful response ....right??* Correct. It also includes the time taken to convert all those documents from a Ruby Hash to XML. *generate the ready-for-indexing XML documents on a file system* Alain, I have somewhere 6m documents for Indexing. You mean to say that I should convert all of it into one XML file and then index? *are you calling commit after your batches or do an optimize by any chance?* I am not optimizing, but I am performing an autocommit every 100000 docs. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Fri, Oct 21, 2011 at 16:32, Simon Willnauer < simon.willna...@googlemail.com> wrote: > On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash <pra...@gmail.com> wrote: > > Hi guys, > > > > I have set up a Solr instance and upon attempting to index document, the > > whole process is painfully slow. I will try to put as much info as I can > in > > this mail. Pl. feel free to ask me anything else that might be required. > > > > I am sending documents in batches not exceeding 2,000. The size of each > of > > them depends but usually is around 10-15MiB. My indexing script tells me > > that Solr took T seconds to add N documents of size S. For the same data, > > the Solr Log add QTime is QT. Some of the sample data are: > > > > N S T QT > > ------------------------------------------------------------------------- > > 390 docs | 3,478,804 Bytes | 14.5s | 2297 > > 852 docs | 6,039,535 Bytes | 25.3s | 4237 > > 1345 docs | 11,147,512 Bytes | 47s | 8543 > > 1147 docs | 9,457,717 Bytes | 44s | 2297 > > 1096 docs | 13,058,204 Bytes | 54.3s | 8782 > > > > The time T includes the time of converting an array of Hash objects into > > XML, POSTing it to Solr and response acknowledged from Solr. Clearly, > there > > is a huge difference between both the time T and QT. After a lot of > efforts, > > I have no clue why these times do not match. > > > > The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M > > -XX:+UseParNewGC > > > > I believe my Indexing is getting slow. Relevant portion from my schema > file > > are as follows. On a related note, every document has one dynamic field. > > Based on this rate, it takes me ~30hrs to do a full index of my database. > > I would really appreciate kindness of community in order to get this > > indexing faster. > > > > <indexDefaults> > > > > <useCompoundFile>false</useCompoundFile> > > > > <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> > > > > <int name="maxMergeCount">10</int> > > > > <int name="maxThreadCount">10</int> > > > > </mergeScheduler> > > > > <ramBufferSizeMB>2048</ramBufferSizeMB> > > > > <maxMergeDocs>2147483647</maxMergeDocs> > > > > <maxFieldLength>3000000</maxFieldLength> > > > > <writeLockTimeout>1000</writeLockTimeout> > > > > <maxBufferedDocs>50000</maxBufferedDocs> > > > > <termIndexInterval>256</termIndexInterval> > > > > <mergeFactor>10</mergeFactor> > > > > <useCompoundFile>false</useCompoundFile> > > > > <!-- <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> > > > > <int name="maxMergeAtOnceExplicit">19</int> > > > > <int name="segmentsPerTier">9</int> > > > > </mergePolicy> --> > > > > </indexDefaults> > > > > <mainIndex> > > > > <unlockOnStartup>true</unlockOnStartup> > > > > <reopenReaders>true</reopenReaders> > > > > <deletionPolicy class="solr.SolrDeletionPolicy"> > > > > <str name="maxCommitsToKeep">1</str> > > > > <str name="maxOptimizedCommitsToKeep">0</str> > > > > </deletionPolicy> > > > > <infoStream file="INFOSTREAM.txt">false</infoStream> > > > > </mainIndex> > > > > <updateHandler class="solr.DirectUpdateHandler2" > > > > > <autoCommit> > > > > <maxDocs>100000</maxDocs> > > > > </autoCommit> > > > > </updateHandler> > > > > > > *Pranav Prakash* > > > > "temet nosce" > > > > Twitter <http://twitter.com/pranavprakash> | Blog < > http://blog.myblive.com> | > > Google <http://www.google.com/profiles/pranny> > > > > hey, > > are you calling commit after your batches or do an optimize by any chance? > > I would suggest you to stream your documents to solr and try to commit > only if you really need to. Set your RAM Buffer to something between > 256 and 320 MB and remove the maxBufferedDocs setting completely. You > can also experiment with your merge settings a little and 10 merging > threads seem to be a lot. I know you have lots of CPU but IO will be > the bottleneck here. > > simon >