On 15/10/2015 09:57, nabil Kouici wrote:
Hi All,
I'm using DIH to index more than 15M from Sql Server to Solr. This take more 
than 2 hours. Big amount of this time is consumed by data fetching from 
database. I'm thinking about a solution to have parallel (thread) loud in the 
same DIH. Each thread load a part of data.
Do you have any experience with this kind of situation?
Regards,Nabil.

Hi Nabil,

Although very convenient for database imports, DIH is single-threaded and difficult to optimise for performance. There is a batchSize parameter that you may try adjusting to see if that helps.

However, we generally avoid the DIH and roll our own indexers using Python or Java, reading the database using SQL (easy in either language) and then posting directly to Solr. This gives us a lot more flexibility in terms of conditioning the data, multi-threading and batching Solr updates. There are lots of great examples of high-performance indexing code available e.g.:
http://bryanbende.com/development/2014/08/16/indexing-wikipedia-with-apache-solr/

Best

Charlie
--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to