On 6/26/2013 1:36 PM, Mike L. wrote:
Here's the scrubbed version of my DIH: http://apaste.info/6uGH
It contains everything I'm more or less doing...pretty straight forward.. One thing to note and I
don't know if this is a bug or not, but the batchSize="-1" streaming feature doesn't seem
to work, at least with informix jdbc drivers. I set the batchsize to "500", but have
tested it with various numbers including 5000, 10000. I'm aware that behind the scenes this should
be just setting the fetchsize, but its a bit puzzling why I don't see a difference regardless of
what value I actually use. I was told by one of our DBA's that our value is set as a global DB
param and can't be modified (which I haven't looked into afterward.)
Setting the batchSize to -1 causes DIH to set fetchSize to
Integer.MIN_VALUE (around negative two billion), which seems to be a
MySQL-specific hack to enable result streaming. I've never heard of it
working on any other JDBC driver.
Assuming that the Informix JDBC driver is actually honoring the
fetchSize, setting batchSize in the DIH config should be enough. If
it's not, then it's a bug in the JDBC driver or possibly a server
misconfiguration.
As far as HEAP patterns, I watch the process via WILY and notice GC occurs
every 15min's or so, but becomes infrequent and not as significant as the
previous one. It's almost as if some memory is never released until it
eventually catches up to the max heap size.
I did assume that perhaps there could have been some locking issues, which is
why I made the following modifications:
readOnly="true" transactionIsolation="TRANSACTION_READ_UNCOMMITTED"
I can't really comment here. It does appear that the Informix JDBC
driver is not something you can download from IBM's website without
paying them money. I would suggest going to IBM (or an informix-related
support avenue) for some help, ESPECIALLY if you've paid money for it.
What do you recommend for the mergeFactor,ramBufferSize and autoCommit options?
My general understanding is the higher the mergeFactor, the less frequent
merges which should improve index time, but slow down query response time. I
also read somewhere that an increase on the ramBufferSize should help prevent
frequent merges...but confused why I didn't really see an improvement...perhaps
my combination of these values wasn't right in relation to my total fetch size.
Of these, ramBufferSizeMB is the only one that should have a
*significant* effect on RAM usage, and at a value of 100, I would not
expect there to be a major issue unless you are doing a lot of imports
at the same time.
Because you are using Solr 3.5, if you do not need your import results
to be visible until the end, I wouldn't worry about using autoCommit.
If you were using Solr 4.x, I would recommend that you turn autoCommit
on, but with openSearcher set to false.
Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e
the defaults) the better on memory management, but cost on index time as you
pay for the overhead of committing. That is a number I've been experimenting
with as well and have scene some variations in heap trends but unfortunately,
have not completed the job quite yet with any config... I did get very close..
I'd hate to throw additional memory at the problem if there is something else I
can tweak..
General impressions: Unless the amount of data involved in each Solr
document is absolutely enormous, this is very likely bugs (memory leaks
or fetchSize problems) in the Informix JDBC driver. I did find the
following page, but it's REALLY REALLY old, which hopefully means that
it doesn't apply.
http://www-01.ibm.com/support/docview.wss?uid=swg21260832
If your documents ARE huge, then you probably need to give more memory
to the java heap ... but you might still have memory leak bugs in the
JDBC driver.
When it comes to Java and Lucene/Solr, IBM has a *terrible* track
record, especially for people using the IBM Java VM. I would not be
surprised if their JDBC driver is plagued by similar problems. If you
do find a support resource and they tell you that you should change your
JDBC code to work differently, then you need to tell them that you can't
change the JDBC code and that they need to give you a configuration URL
workaround.
Here's another possibility of a bug that causes memory leaks:
http://www-01.ibm.com/support/docview.wss?uid=swg1IC58469
You might ask whether the problem could be a memory leak in Solr. It's
always possible, but I've had a lot of experience with DIH from MySQL on
Solr 1.4.0, 1.4.1, 3.2.0, 3.5.0, and 4.2.1. I've never seen any signs
of a leak.
Thanks,
Shawn