Re: problems with bulk indexing with concurrent DIH

Bernd Fehling Tue, 02 Aug 2016 06:51:06 -0700

Hi Mikhail,

there are no deletes at all from my point of view.
All records have unique id's.
No sharding at all, it is a single index and it is certified
that all DIH's get different data to load and no record is
sent twice to any DIH participating at concurrent loading.


Only assumption so far, DIH is sending the records as "update"
(and not pure "add") to the indexer which will generate delete
files during merge. If the number of segments is high it will
take quite long to merge and check all records of all segments.

I'm currently setting up SOLR 5.5.3 but that takes a while.
I also located an "overwrite" parameter somewhere in DIH which
will force an "add" and not an "update" to the index, but
couldn't figure out how to set the parameter with command.

Bernd


Am 02.08.2016 um 15:15 schrieb Mikhail Khludnev:
> Bernd,
> But why do you have so many deletes? Is it expected?
> When you run DIHs concurrently, do you shard intput data by uniqueKey?
> 
> On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
> 
>> If there is a problem in single index then it might also be in CloudSolr.
>> As far as I could figure out from INFOSTREAM, documents are added to
>> segments
>> and terms are "collected". Duplicate term are "deleted" (or whatever).
>> These deletes (or whatever) are not concurrent.
>> I have a lines like:
>> BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes:
>> infos=...
>> BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes took
>> 180028 msec
>> ...
>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>> infos=...
>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes took
>> 3411845 msec
>>
>> 3411545 msec are about 56 minutes where the system is doing what???
>> At least not indexing because only one JAVA process and no I/O at all!
>>
>> How can SolrJ help me now with this problem?
>>
>> Best
>> Bernd
>>
>>
>> Am 27.07.2016 um 16:41 schrieb Erick Erickson:
>>> Well, at least it'll be easier to debug in my experience. Simple example.
>>> At some point you'll call CloudSolrClient.add(doc list). Comment just
>> that
>>> out and you'll be able to isolate whether the issue is querying the be or
>>> sending to Solr.
>>>
>>> Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
>>> routing...
>>>
>>> Best
>>> Erick
>>>
>>> On Jul 27, 2016 7:24 AM, "Bernd Fehling" <bernd.fehl...@uni-bielefeld.de
>>>
>>> wrote:
>>>
>>>> So writing some SolrJ doing the same job as the DIH script
>>>> and using that concurrent will solve my problem?
>>>> I'm not using Tika.
>>>>
>>>> I don't think that DIH is my problem, even if it is not the best
>> solution
>>>> right now.
>>>> Nevertheless, you are right SolrJ has higher performance, but what
>>>> if I have the same problems with SolrJ like with DIH?
>>>>
>>>> If it runs with DIH it should run with SolrJ with additional performance
>>>> boost.
>>>>
>>>> Bernd
>>>>
>>>>
>>>> On 27.07.2016 at 16:03, Erick Erickson:
>>>>> I'd actually recommend you move to a SolrJ solution
>>>>> or similar. Currently, you're putting a load on the Solr
>>>>> servers (especially if you're also using Tika) in addition
>>>>> to all indexing etc.
>>>>>
>>>>> Here's a sample:
>>>>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>>>>>
>>>>> Dodging the question I know, but DIH sometimes isn't
>>>>> the best solution.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
>>>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>>>> After enhancing the server with SSDs I'm trying to speed up indexing.
>>>>>>
>>>>>> The server has 16 CPUs and more than 100G RAM.
>>>>>> JAVA (1.8.0_92) has 24G.
>>>>>> SOLR is 4.10.4.
>>>>>> Plain XML data to load is 218G with about 96M records.
>>>>>> This will result in a single index of 299G.
>>>>>>
>>>>>> I tried with 4, 8, 12 and 16 concurrent DIHs.
>>>>>> 16 and 12 was to much because for 16 CPUs and my test continued with 8
>>>> concurrent DIHs.
>>>>>> Then i was trying different <indexConfig> and <updateHandler> settings
>>>> but now I'm stuck.
>>>>>> I can't figure out what is the best setting for bulk indexing.
>>>>>> What I see is that the indexing is "falling asleep" after some time of
>>>> indexing.
>>>>>> It is only producing del-files, like _11_1.del, _w_2.del, _h_3.del,...
>>>>>>
>>>>>> <indexConfig>
>>>>>>     <maxIndexingThreads>8</maxIndexingThreads>
>>>>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>>>>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>>>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>>>>>       <int name="maxMergeAtOnce">8</int>
>>>>>>       <int name="segmentsPerTier">100</int>
>>>>>>       <int name="maxMergedSegmentMB">512</int>
>>>>>>     </mergePolicy>
>>>>>>     <mergeFactor>8</mergeFactor>
>>>>>>     <mergeScheduler
>>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>>>>>     <lockType>${solr.lock.type:native}</lockType>
>>>>>>     ...
>>>>>> </indexConfig>
>>>>>>
>>>>>> <updateHandler class="solr.DirectUpdateHandler2">
>>>>>>      ### no autocommit at all
>>>>>>      <autoSoftCommit>
>>>>>>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
>>>>>>      </autoSoftCommit>
>>>>>> </updateHandler>
>>>>>>
>>>>>>
>>>>>>
>>>>
>> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
>>>>>> After indexing finishes there is a final optimize.
>>>>>>
>>>>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
>>>>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
>>>>>> It should do no commit, no optimize.
>>>>>> ramBufferSizeMB is high because I have plenty of RAM and I want make
>>>> use the speed of RAM.
>>>>>> segmentsPerTier is high to reduce merging.
>>>>>>
>>>>>> But somewhere is a misconfiguration because indexing gets stalled.
>>>>>>
>>>>>> Any idea what's going wrong?
>>>>>>
>>>>>>
>>>>>> Bernd
>>>>>>
>>>>
>>>
>>
>> --
>> *************************************************************
>> Bernd Fehling                    Bielefeld University Library
>> Dipl.-Inform. (FH)                LibTec - Library Technology
>> Universitätsstr. 25                  and Knowledge Management
>> 33615 Bielefeld
>> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>>
>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> *************************************************************
>>
> 
> 
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

Re: problems with bulk indexing with concurrent DIH

Reply via email to