Re: DIH parallel processing

Charlie Hull Thu, 15 Oct 2015 02:43:37 -0700

On 15/10/2015 09:57, nabil Kouici wrote:

Hi All,
I'm using DIH to index more than 15M from Sql Server to Solr. This take more 
than 2 hours. Big amount of this time is consumed by data fetching from 
database. I'm thinking about a solution to have parallel (thread) loud in the 
same DIH. Each thread load a part of data.
Do you have any experience with this kind of situation?
Regards,Nabil.

Hi Nabil,

Although very convenient for database imports, DIH is single-threadedand difficult to optimise for performance. There is a batchSizeparameter that you may try adjusting to see if that helps.

However, we generally avoid the DIH and roll our own indexers usingPython or Java, reading the database using SQL (easy in either language)and then posting directly to Solr. This gives us a lot more flexibilityin terms of conditioning the data, multi-threading and batching Solrupdates. There are lots of great examples of high-performance indexingcode available e.g.:

http://bryanbende.com/development/2014/08/16/indexing-wikipedia-with-apache-solr/

Best

Charlie
--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: DIH parallel processing

Reply via email to