Re: problems with bulk indexing with concurrent DIH

Bernd Fehling Thu, 04 Aug 2016 00:32:28 -0700

After updating to version 5.5.3 it looks good now.
I think LUCENE-6161 has fixed my problem.
Nevertheless, after updating my development system and recompyling
my plugins I will have a look at DIH about the "update" and also
your advise about the uniqueKey.


Best regards
Bernd

Am 02.08.2016 um 16:16 schrieb Mikhail Khludnev:
> These deletes seem really puzzling to me. Can you experiment with
> commenting uniqeKey in schema.xml. My expectation that deletes should go
> away after that.
> 
> On Tue, Aug 2, 2016 at 4:50 PM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
> 
>> Hi Mikhail,
>>
>> there are no deletes at all from my point of view.
>> All records have unique id's.
>> No sharding at all, it is a single index and it is certified
>> that all DIH's get different data to load and no record is
>> sent twice to any DIH participating at concurrent loading.
>>
>> Only assumption so far, DIH is sending the records as "update"
>> (and not pure "add") to the indexer which will generate delete
>> files during merge. If the number of segments is high it will
>> take quite long to merge and check all records of all segments.
>>
>> I'm currently setting up SOLR 5.5.3 but that takes a while.
>> I also located an "overwrite" parameter somewhere in DIH which
>> will force an "add" and not an "update" to the index, but
>> couldn't figure out how to set the parameter with command.
>>
>> Bernd
>>
>>
>> Am 02.08.2016 um 15:15 schrieb Mikhail Khludnev:
>>> Bernd,
>>> But why do you have so many deletes? Is it expected?
>>> When you run DIHs concurrently, do you shard intput data by uniqueKey?
>>>
>>> On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling <
>>> bernd.fehl...@uni-bielefeld.de> wrote:
>>>
>>>> If there is a problem in single index then it might also be in
>> CloudSolr.
>>>> As far as I could figure out from INFOSTREAM, documents are added to
>>>> segments
>>>> and terms are "collected". Duplicate term are "deleted" (or whatever).
>>>> These deletes (or whatever) are not concurrent.
>>>> I have a lines like:
>>>> BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes:
>>>> infos=...
>>>> BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes
>> took
>>>> 180028 msec
>>>> ...
>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>>>> infos=...
>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
>> took
>>>> 3411845 msec
>>>>
>>>> 3411545 msec are about 56 minutes where the system is doing what???
>>>> At least not indexing because only one JAVA process and no I/O at all!
>>>>
>>>> How can SolrJ help me now with this problem?
>>>>
>>>> Best
>>>> Bernd
>>>>
>>>>
>>>> Am 27.07.2016 um 16:41 schrieb Erick Erickson:
>>>>> Well, at least it'll be easier to debug in my experience. Simple
>> example.
>>>>> At some point you'll call CloudSolrClient.add(doc list). Comment just
>>>> that
>>>>> out and you'll be able to isolate whether the issue is querying the be
>> or
>>>>> sending to Solr.
>>>>>
>>>>> Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
>>>>> routing...
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Jul 27, 2016 7:24 AM, "Bernd Fehling" <
>> bernd.fehl...@uni-bielefeld.de
>>>>>
>>>>> wrote:
>>>>>
>>>>>> So writing some SolrJ doing the same job as the DIH script
>>>>>> and using that concurrent will solve my problem?
>>>>>> I'm not using Tika.
>>>>>>
>>>>>> I don't think that DIH is my problem, even if it is not the best
>>>> solution
>>>>>> right now.
>>>>>> Nevertheless, you are right SolrJ has higher performance, but what
>>>>>> if I have the same problems with SolrJ like with DIH?
>>>>>>
>>>>>> If it runs with DIH it should run with SolrJ with additional
>> performance
>>>>>> boost.
>>>>>>
>>>>>> Bernd
>>>>>>
>>>>>>
>>>>>> On 27.07.2016 at 16:03, Erick Erickson:
>>>>>>> I'd actually recommend you move to a SolrJ solution
>>>>>>> or similar. Currently, you're putting a load on the Solr
>>>>>>> servers (especially if you're also using Tika) in addition
>>>>>>> to all indexing etc.
>>>>>>>
>>>>>>> Here's a sample:
>>>>>>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>>>>>>>
>>>>>>> Dodging the question I know, but DIH sometimes isn't
>>>>>>> the best solution.
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
>>>>>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>>>>>> After enhancing the server with SSDs I'm trying to speed up
>> indexing.
>>>>>>>>
>>>>>>>> The server has 16 CPUs and more than 100G RAM.
>>>>>>>> JAVA (1.8.0_92) has 24G.
>>>>>>>> SOLR is 4.10.4.
>>>>>>>> Plain XML data to load is 218G with about 96M records.
>>>>>>>> This will result in a single index of 299G.
>>>>>>>>
>>>>>>>> I tried with 4, 8, 12 and 16 concurrent DIHs.
>>>>>>>> 16 and 12 was to much because for 16 CPUs and my test continued
>> with 8
>>>>>> concurrent DIHs.
>>>>>>>> Then i was trying different <indexConfig> and <updateHandler>
>> settings
>>>>>> but now I'm stuck.
>>>>>>>> I can't figure out what is the best setting for bulk indexing.
>>>>>>>> What I see is that the indexing is "falling asleep" after some time
>> of
>>>>>> indexing.
>>>>>>>> It is only producing del-files, like _11_1.del, _w_2.del,
>> _h_3.del,...
>>>>>>>>
>>>>>>>> <indexConfig>
>>>>>>>>     <maxIndexingThreads>8</maxIndexingThreads>
>>>>>>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>>>>>>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>>>>>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>>>>>>>       <int name="maxMergeAtOnce">8</int>
>>>>>>>>       <int name="segmentsPerTier">100</int>
>>>>>>>>       <int name="maxMergedSegmentMB">512</int>
>>>>>>>>     </mergePolicy>
>>>>>>>>     <mergeFactor>8</mergeFactor>
>>>>>>>>     <mergeScheduler
>>>>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>>>>>>>     <lockType>${solr.lock.type:native}</lockType>
>>>>>>>>     ...
>>>>>>>> </indexConfig>
>>>>>>>>
>>>>>>>> <updateHandler class="solr.DirectUpdateHandler2">
>>>>>>>>      ### no autocommit at all
>>>>>>>>      <autoSoftCommit>
>>>>>>>>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
>>>>>>>>      </autoSoftCommit>
>>>>>>>> </updateHandler>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
>>>>>>>> After indexing finishes there is a final optimize.
>>>>>>>>
>>>>>>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
>>>>>>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
>>>>>>>> It should do no commit, no optimize.
>>>>>>>> ramBufferSizeMB is high because I have plenty of RAM and I want make
>>>>>> use the speed of RAM.
>>>>>>>> segmentsPerTier is high to reduce merging.
>>>>>>>>
>>>>>>>> But somewhere is a misconfiguration because indexing gets stalled.
>>>>>>>>
>>>>>>>> Any idea what's going wrong?
>>>>>>>>
>>>>>>>>
>>>>>>>> Bernd
>>>>>>>>
>>>>>>
>>>>>

Re: problems with bulk indexing with concurrent DIH

Reply via email to