Re: Painfully slow indexing

Pranav Prakash Mon, 24 Oct 2011 00:34:07 -0700

Hey guys,

Your responses are welcome, but I still haven't gained a lot of improvements


*Are you posting through HTTP/SOLRJ?*
I am using RSolr gem, which internally uses Ruby HTTP lib to POST document
to Solr

*Your script time 'T' includes time between sending POST request -to-
the response fetched after successful response ....right??*
Correct. It also includes the time taken to convert all those documents from
a Ruby Hash to XML.


 *generate the ready-for-indexing XML documents on a file system*
Alain, I have somewhere 6m documents for Indexing. You mean to say that I
should convert all of it into one XML file and then index?

*are you calling commit after your batches or do an optimize by any chance?*
I am not optimizing, but I am performing an autocommit every 100000 docs.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Fri, Oct 21, 2011 at 16:32, Simon Willnauer <
simon.willna...@googlemail.com> wrote:

> On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash <pra...@gmail.com> wrote:
> > Hi guys,
> >
> > I have set up a Solr instance and upon attempting to index document, the
> > whole process is painfully slow. I will try to put as much info as I can
> in
> > this mail. Pl. feel free to ask me anything else that might be required.
> >
> > I am sending documents in batches not exceeding 2,000. The size of each
> of
> > them depends but usually is around 10-15MiB. My indexing script tells me
> > that Solr took T seconds to add N documents of size S. For the same data,
> > the Solr Log add QTime is QT. Some of the sample data are:
> >
> >   N                     S                T               QT
> > -------------------------------------------------------------------------
> >  390 docs  |   3,478,804 Bytes   | 14.5s    |  2297
> >  852 docs  |   6,039,535 Bytes   | 25.3s    |  4237
> > 1345 docs | 11,147,512 Bytes   |  47s      |  8543
> > 1147 docs |   9,457,717 Bytes   |  44s      |  2297
> > 1096 docs | 13,058,204 Bytes   |  54.3s   |   8782
> >
> > The time T includes the time of converting an array of Hash objects into
> > XML, POSTing it to Solr and response acknowledged from Solr. Clearly,
> there
> > is a huge difference between both the time T and QT. After a lot of
> efforts,
> > I have no clue why these times do not match.
> >
> > The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
> > -XX:+UseParNewGC
> >
> > I believe my Indexing is getting slow. Relevant portion from my schema
> file
> > are as follows. On a related note, every document has one dynamic field.
> > Based on this rate, it takes me ~30hrs to do a full index of my database.
> > I would really appreciate kindness of community in order to get this
> > indexing faster.
> >
> > <indexDefaults>
> >
> > <useCompoundFile>false</useCompoundFile>
> >
> > <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
> >
> > <int name="maxMergeCount">10</int>
> >
> > <int name="maxThreadCount">10</int>
> >
> >  </mergeScheduler>
> >
> > <ramBufferSizeMB>2048</ramBufferSizeMB>
> >
> > <maxMergeDocs>2147483647</maxMergeDocs>
> >
> > <maxFieldLength>3000000</maxFieldLength>
> >
> > <writeLockTimeout>1000</writeLockTimeout>
> >
> > <maxBufferedDocs>50000</maxBufferedDocs>
> >
> > <termIndexInterval>256</termIndexInterval>
> >
> > <mergeFactor>10</mergeFactor>
> >
> > <useCompoundFile>false</useCompoundFile>
> >
> > <!-- <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >
> >  <int name="maxMergeAtOnceExplicit">19</int>
> >
> > <int name="segmentsPerTier">9</int>
> >
> > </mergePolicy> -->
> >
> > </indexDefaults>
> >
> > <mainIndex>
> >
> > <unlockOnStartup>true</unlockOnStartup>
> >
> > <reopenReaders>true</reopenReaders>
> >
> > <deletionPolicy class="solr.SolrDeletionPolicy">
> >
> >  <str name="maxCommitsToKeep">1</str>
> >
> > <str name="maxOptimizedCommitsToKeep">0</str>
> >
> > </deletionPolicy>
> >
> > <infoStream file="INFOSTREAM.txt">false</infoStream>
> >
> > </mainIndex>
> >
> > <updateHandler class="solr.DirectUpdateHandler2" >
> >
> > <autoCommit>
> >
> >  <maxDocs>100000</maxDocs>
> >
> > </autoCommit>
> >
> > </updateHandler>
> >
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
> > Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com> |
> > Google <http://www.google.com/profiles/pranny>
> >
>
> hey,
>
> are you calling commit after your batches or do an optimize by any chance?
>
> I would suggest you to stream your documents to solr and try to commit
> only if you really need to. Set your RAM Buffer to something between
> 256 and 320 MB and remove the maxBufferedDocs setting completely. You
> can also experiment with your merge settings a little and 10 merging
> threads seem to be a lot. I know you have lots of CPU but IO will be
> the bottleneck here.
>
> simon
>

Re: Painfully slow indexing

Reply via email to