Parallal Import Process on same core. Solr 3.5

Mike L. Wed, 26 Jun 2013 09:59:28 -0700

 
Hello,
 
       I'm trying to execute a parallel DIH process and running into heap 
related issues, hoping somebody has experienced this and can recommend some 
options..
 
       Using Solr 3.5 on CentOS.
       Currently have JVM heap 4GB min , 8GB max
 
     When executing the entities in a sequential process (entities executing in 
sequence by default), my heap never exceeds 3GB. When executing the parallel 
process, everything runs fine for roughly an hour, then I reach the 8GB max 
heap size and the process stalls/fails.
 
     More specifically, here's how I'm executing the parallel import process: I 
target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
VALUE') within my entity queries. And within Solrconfig.xml, I've created 
corresponding data import handlers, one for each of these entities.
 
My total rows fetch/count is 9M records.
 
And when I initiate the import, I call each one, similar to the below 
(obviously I've stripped out my server & naming conventions.
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]
 
 
I assume that when doing this, only the first import request needs to contain 
the clean=true param. 
 
I've divided each import query to target roughly the same amount of data, and 
in solrconfig, I've tried various things in hopes to reduce heap size.
 
Here's my current config: 
 
 <useCompoundFile>false</useCompoundFile>
    <mergeFactor>15</mergeFactor>    <!-- I've experimented with 10, 15,25 and 
haven't seen much differences -->
    <ramBufferSizeMB>100</ramBufferSizeMB> 
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
    <writeLockTimeout>1000</writeLockTimeout>
    <commitLockTimeout>10000</commitLockTimeout>
    <lockType>single</lockType>
  </indexDefaults>
  <mainIndex>
    <useCompoundFile>false</useCompoundFile>
    <ramBufferSizeMB>100</ramBufferSizeMB>  <!-- I've bumped this up from 32 
--> 
    <mergeFactor>15</mergeFactor>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
    <unlockOnStartup>false</unlockOnStartup>
  </mainIndex>


 
<updateHandler class="solr.DirectUpdateHandler2">
   <autoCommit>
      <maxTime>60000</maxTime> <!-- I've experimented with various times here 
as well --> 
      <maxDocs>25000</maxDocs> <!-- I've experimented with 25k, 500k, 100k --> 
    </autoCommit>
    <maxPendingDeletes>100000</maxPendingDeletes>
 </updateHandler>

 
What gets tricky is finding the sweet spot with these parameters, but wondering 
if anybody has any recommendations for an optimal config. Also, regarding 
autoCommit, I've even turned that feature off, but my heap size reaches its max 
sooner. I am wondering though, what would be the difference with autoCommit and 
passing in the commit=true param on each import query.
 
Thanks in advance!
Mike

Parallal Import Process on same core. Solr 3.5

Reply via email to