On 4/25/2013 9:00 AM, xiaoqi wrote:
> i using DIH to build index is slow , when it fetch 2 million rows , it will
> spend 20 minutes , very slow. 

If it takes 20 minutes for two million records, I'd say it's working
very well.  I do six simultaneous MySQL imports of 13 million records
each.  It takes a little over 3 hours on Solr 3.5.0, a little over four
hours on Solr 4.2.1 (due to compression and the transaction log).  If I
do them one at a time instead of all at once, it will go *slightly*
faster for each one, but the overall process would take a whole day.
For comparison purposes, that's about 20 minutes each time it does 1
million rows.  Yours is going twice as fast as mine.

Looking at your config file, I don't see a batchSize parameter.  This is
a change that is specific to MySQL.  You can greatly reduce the memory
usage by including this attribute in the dataSource tag along with the
user and password:

batchSize="-1"

With two million records and no batchSize parameter, I'm surprised you
aren't hitting an Out Of Memory error.  By default JDBC will pull down
all the results and store them in memory, then DIH will begin indexing.
 A batchSize of -1 makes DIH tell the MySQL JDBC driver to stream the
results instead of storing them.  Reducing the memory usage in this way
might make it go faster.

Thanks,
Shawn

Reply via email to