On 6/26/2013 1:36 PM, Mike L. wrote:
Here's the scrubbed version of my DIH: http://apaste.info/6uGH

It contains everything I'm more or less doing...pretty straight forward.. One thing to note and I 
don't know if this is a bug or not, but the batchSize="-1" streaming feature doesn't seem 
to work, at least with informix jdbc drivers. I set the batchsize to "500", but have 
tested it with various numbers including 5000, 10000. I'm aware that behind the scenes this should 
be just setting the fetchsize, but its a bit puzzling why I don't see a difference regardless of 
what value I actually use. I was told by one of our DBA's that our value is set as a global DB 
param and can't be modified (which I haven't looked into afterward.)

Setting the batchSize to -1 causes DIH to set fetchSize to Integer.MIN_VALUE (around negative two billion), which seems to be a MySQL-specific hack to enable result streaming. I've never heard of it working on any other JDBC driver.

Assuming that the Informix JDBC driver is actually honoring the fetchSize, setting batchSize in the DIH config should be enough. If it's not, then it's a bug in the JDBC driver or possibly a server misconfiguration.

As far as HEAP patterns, I watch the process via WILY and notice GC occurs 
every 15min's or so, but becomes infrequent and not as significant as the 
previous one. It's almost as if some memory is never released until it 
eventually catches up to the max heap size.

I did assume that perhaps there could have been some locking issues, which is 
why I made the following modifications:

readOnly="true" transactionIsolation="TRANSACTION_READ_UNCOMMITTED"

I can't really comment here. It does appear that the Informix JDBC driver is not something you can download from IBM's website without paying them money. I would suggest going to IBM (or an informix-related support avenue) for some help, ESPECIALLY if you've paid money for it.

What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? 
My general understanding is the higher the mergeFactor, the less frequent 
merges which should improve index time, but slow down query response time. I 
also read somewhere that an increase on the ramBufferSize should help prevent 
frequent merges...but confused why I didn't really see an improvement...perhaps 
my combination of these values wasn't right in relation to my total fetch size.

Of these, ramBufferSizeMB is the only one that should have a *significant* effect on RAM usage, and at a value of 100, I would not expect there to be a major issue unless you are doing a lot of imports at the same time.

Because you are using Solr 3.5, if you do not need your import results to be visible until the end, I wouldn't worry about using autoCommit. If you were using Solr 4.x, I would recommend that you turn autoCommit on, but with openSearcher set to false.

Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e 
the defaults) the better on memory management, but cost on index time as you 
pay for the overhead of committing. That is a number I've been experimenting 
with as well and have scene some variations in heap trends but unfortunately, 
have not completed the job quite yet with any config... I did get very close.. 
I'd hate to throw additional memory at the problem if there is something else I 
can tweak..

General impressions: Unless the amount of data involved in each Solr document is absolutely enormous, this is very likely bugs (memory leaks or fetchSize problems) in the Informix JDBC driver. I did find the following page, but it's REALLY REALLY old, which hopefully means that it doesn't apply.

http://www-01.ibm.com/support/docview.wss?uid=swg21260832

If your documents ARE huge, then you probably need to give more memory to the java heap ... but you might still have memory leak bugs in the JDBC driver.

When it comes to Java and Lucene/Solr, IBM has a *terrible* track record, especially for people using the IBM Java VM. I would not be surprised if their JDBC driver is plagued by similar problems. If you do find a support resource and they tell you that you should change your JDBC code to work differently, then you need to tell them that you can't change the JDBC code and that they need to give you a configuration URL workaround.

Here's another possibility of a bug that causes memory leaks:

http://www-01.ibm.com/support/docview.wss?uid=swg1IC58469

You might ask whether the problem could be a memory leak in Solr. It's always possible, but I've had a lot of experience with DIH from MySQL on Solr 1.4.0, 1.4.1, 3.2.0, 3.5.0, and 4.2.1. I've never seen any signs of a leak.

Thanks,
Shawn

Reply via email to