Thanks for the response.
 
Here's the scrubbed version of my DIH: http://apaste.info/6uGH 
 
It contains everything I'm more or less doing...pretty straight forward.. One 
thing to note and I don't know if this is a bug or not, but the batchSize="-1" 
streaming feature doesn't seem to work, at least with informix jdbc drivers. I 
set the batchsize to "500", but have tested it with various numbers including 
5000, 10000. I'm aware that behind the scenes this should be just setting the 
fetchsize, but its a bit puzzling why I don't see a difference regardless of 
what value I actually use. I was told by one of our DBA's that our value is set 
as a global DB param and can't be modified (which I haven't looked into 
afterward.)
 
As far as HEAP patterns, I watch the process via WILY and notice GC occurs 
every 15min's or so, but becomes infrequent and not as significant as the 
previous one. It's almost as if some memory is never released until it 
eventually catches up to the max heap size.
 
I did assume that perhaps there could have been some locking issues, which is 
why I made the following modifications:
 
readOnly="true" transactionIsolation="TRANSACTION_READ_UNCOMMITTED"
 
What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? 
My general understanding is the higher the mergeFactor, the less frequent 
merges which should improve index time, but slow down query response time. I 
also read somewhere that an increase on the ramBufferSize should help prevent 
frequent merges...but confused why I didn't really see an improvement...perhaps 
my combination of these values wasn't right in relation to my total fetch size.
 
Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e 
the defaults) the better on memory management, but cost on index time as you 
pay for the overhead of committing. That is a number I've been experimenting 
with as well and have scene some variations in heap trends but unfortunately, 
have not completed the job quite yet with any config... I did get very close.. 
I'd hate to throw additional memory at the problem if there is something else I 
can tweak.. 
 
Thanks!
Mike
 

From: Shawn Heisey <s...@elyograg.org>
To: solr-user@lucene.apache.org 
Sent: Wednesday, June 26, 2013 12:13 PM
Subject: Re: Parallal Import Process on same core. Solr 3.5


On 6/26/2013 10:58 AM, Mike L. wrote:
>  
> Hello,
>  
>        I'm trying to execute a parallel DIH process and running into heap 
>related issues, hoping somebody has experienced this and can recommend some 
>options..
>  
>        Using Solr 3.5 on CentOS.
>        Currently have JVM heap 4GB min , 8GB max
>  
>      When executing the entities in a sequential process (entities executing 
>in sequence by default), my heap never exceeds 3GB. When executing the 
>parallel process, everything runs fine for roughly an hour, then I reach the 
>8GB max heap size and the process stalls/fails.
>  
>      More specifically, here's how I'm executing the parallel import process: 
>I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
>VALUE') within my entity queries. And within Solrconfig.xml, I've created 
>corresponding data import handlers, one for each of these entities.
>  
> My total rows fetch/count is 9M records.
>  
> And when I initiate the import, I call each one, similar to the below 
> (obviously I've stripped out my server & naming conventions.
>  
> http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true
>  
> http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]
>  
>  
> I assume that when doing this, only the first import request needs to contain 
> the clean=true param. 
>  
> I've divided each import query to target roughly the same amount of data, and 
> in solrconfig, I've tried various things in hopes to reduce heap size.

Thanks for including some solrconfig snippets, but I think what we
really need is your DIH configuration(s).  Use a pastebin site and
choose the proper document type.  http://apaste.info/is available and
the proper type there would be (X)HTML.  If you need to sanitize these
to remove host/user/pass, please replace the values with something else
rather than deleting them entirely.

With full-import, clean defaults to true, so including it doesn't change
anything.  What I would actually do is have clean=true on the first
import you run, then after waiting a few seconds to be sure it is
running, start the others with clean=false so that they don't do ANOTHER
clean.

I suspect that you might be running into JDBC driver behavior where the
entire result set is being buffered into RAM.

Thanks,
Shawn

Reply via email to