On 15/10/2015 09:57, nabil Kouici wrote:
Hi All,
I'm using DIH to index more than 15M from Sql Server to Solr. This take more
than 2 hours. Big amount of this time is consumed by data fetching from
database. I'm thinking about a solution to have parallel (thread) loud in the
same DIH. Each thread load a part of data.
Do you have any experience with this kind of situation?
Regards,Nabil.
Hi Nabil,
Although very convenient for database imports, DIH is single-threaded
and difficult to optimise for performance. There is a batchSize
parameter that you may try adjusting to see if that helps.
However, we generally avoid the DIH and roll our own indexers using
Python or Java, reading the database using SQL (easy in either language)
and then posting directly to Solr. This gives us a lot more flexibility
in terms of conditioning the data, multi-threading and batching Solr
updates. There are lots of great examples of high-performance indexing
code available e.g.:
http://bryanbende.com/development/2014/08/16/indexing-wikipedia-with-apache-solr/
Best
Charlie
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk