Re: problems with bulk indexing with concurrent DIH

Erick Erickson Wed, 27 Jul 2016 07:05:00 -0700

I'd actually recommend you move to a SolrJ solution
or similar. Currently, you're putting a load on the Solr
servers (especially if you're also using Tika) in addition
to all indexing etc.


Here's a sample:
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Dodging the question I know, but DIH sometimes isn't
the best solution.

Best,
Erick

On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:
> After enhancing the server with SSDs I'm trying to speed up indexing.
>
> The server has 16 CPUs and more than 100G RAM.
> JAVA (1.8.0_92) has 24G.
> SOLR is 4.10.4.
> Plain XML data to load is 218G with about 96M records.
> This will result in a single index of 299G.
>
> I tried with 4, 8, 12 and 16 concurrent DIHs.
> 16 and 12 was to much because for 16 CPUs and my test continued with 8 
> concurrent DIHs.
> Then i was trying different <indexConfig> and <updateHandler> settings but 
> now I'm stuck.
> I can't figure out what is the best setting for bulk indexing.
> What I see is that the indexing is "falling asleep" after some time of 
> indexing.
> It is only producing del-files, like _11_1.del, _w_2.del, _h_3.del,...
>
> <indexConfig>
>     <maxIndexingThreads>8</maxIndexingThreads>
>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>     <maxBufferedDocs>-1</maxBufferedDocs>
>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>       <int name="maxMergeAtOnce">8</int>
>       <int name="segmentsPerTier">100</int>
>       <int name="maxMergedSegmentMB">512</int>
>     </mergePolicy>
>     <mergeFactor>8</mergeFactor>
>     <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>     <lockType>${solr.lock.type:native}</lockType>
>     ...
> </indexConfig>
>
> <updateHandler class="solr.DirectUpdateHandler2">
>      ### no autocommit at all
>      <autoSoftCommit>
>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
>      </autoSoftCommit>
> </updateHandler>
>
>
> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
> After indexing finishes there is a final optimize.
>
> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
> It should do no commit, no optimize.
> ramBufferSizeMB is high because I have plenty of RAM and I want make use the 
> speed of RAM.
> segmentsPerTier is high to reduce merging.
>
> But somewhere is a misconfiguration because indexing gets stalled.
>
> Any idea what's going wrong?
>
>
> Bernd
>
>
>
>

Re: problems with bulk indexing with concurrent DIH

Reply via email to