Re: Parallal Import Process on same core. Solr 3.5

Shawn Heisey Wed, 26 Jun 2013 13:45:04 -0700

On 6/26/2013 1:36 PM, Mike L. wrote:

Here's the scrubbed version of my DIH: http://apaste.info/6uGH


It contains everything I'm more or less doing...pretty straight forward.. One thing to note and I 
don't know if this is a bug or not, but the batchSize="-1" streaming feature doesn't seem 
to work, at least with informix jdbc drivers. I set the batchsize to "500", but have 
tested it with various numbers including 5000, 10000. I'm aware that behind the scenes this should 
be just setting the fetchsize, but its a bit puzzling why I don't see a difference regardless of 
what value I actually use. I was told by one of our DBA's that our value is set as a global DB 
param and can't be modified (which I haven't looked into afterward.)

Setting the batchSize to -1 causes DIH to set fetchSize toInteger.MIN_VALUE (around negative two billion), which seems to be aMySQL-specific hack to enable result streaming. I've never heard of itworking on any other JDBC driver.

Assuming that the Informix JDBC driver is actually honoring thefetchSize, setting batchSize in the DIH config should be enough. Ifit's not, then it's a bug in the JDBC driver or possibly a servermisconfiguration.

As far as HEAP patterns, I watch the process via WILY and notice GC occurs 
every 15min's or so, but becomes infrequent and not as significant as the 
previous one. It's almost as if some memory is never released until it 
eventually catches up to the max heap size.

I did assume that perhaps there could have been some locking issues, which is 
why I made the following modifications:

readOnly="true" transactionIsolation="TRANSACTION_READ_UNCOMMITTED"

I can't really comment here. It does appear that the Informix JDBCdriver is not something you can download from IBM's website withoutpaying them money. I would suggest going to IBM (or an informix-relatedsupport avenue) for some help, ESPECIALLY if you've paid money for it.

What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? 
My general understanding is the higher the mergeFactor, the less frequent 
merges which should improve index time, but slow down query response time. I 
also read somewhere that an increase on the ramBufferSize should help prevent 
frequent merges...but confused why I didn't really see an improvement...perhaps 
my combination of these values wasn't right in relation to my total fetch size.

Of these, ramBufferSizeMB is the only one that should have a*significant* effect on RAM usage, and at a value of 100, I would notexpect there to be a major issue unless you are doing a lot of importsat the same time.

Because you are using Solr 3.5, if you do not need your import resultsto be visible until the end, I wouldn't worry about using autoCommit.If you were using Solr 4.x, I would recommend that you turn autoCommiton, but with openSearcher set to false.

Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e 
the defaults) the better on memory management, but cost on index time as you 
pay for the overhead of committing. That is a number I've been experimenting 
with as well and have scene some variations in heap trends but unfortunately, 
have not completed the job quite yet with any config... I did get very close.. 
I'd hate to throw additional memory at the problem if there is something else I 
can tweak..

General impressions: Unless the amount of data involved in each Solrdocument is absolutely enormous, this is very likely bugs (memory leaksor fetchSize problems) in the Informix JDBC driver. I did find thefollowing page, but it's REALLY REALLY old, which hopefully means thatit doesn't apply.


http://www-01.ibm.com/support/docview.wss?uid=swg21260832

If your documents ARE huge, then you probably need to give more memoryto the java heap ... but you might still have memory leak bugs in theJDBC driver.

When it comes to Java and Lucene/Solr, IBM has a *terrible* trackrecord, especially for people using the IBM Java VM. I would not besurprised if their JDBC driver is plagued by similar problems. If youdo find a support resource and they tell you that you should change yourJDBC code to work differently, then you need to tell them that you can'tchange the JDBC code and that they need to give you a configuration URLworkaround.


Here's another possibility of a bug that causes memory leaks:

http://www-01.ibm.com/support/docview.wss?uid=swg1IC58469

You might ask whether the problem could be a memory leak in Solr. It'salways possible, but I've had a lot of experience with DIH from MySQL onSolr 1.4.0, 1.4.1, 3.2.0, 3.5.0, and 4.2.1. I've never seen any signsof a leak.


Thanks,
Shawn

Re: Parallal Import Process on same core. Solr 3.5

Reply via email to