Re: problems with bulk indexing with concurrent DIH

Mikhail Khludnev Tue, 02 Aug 2016 06:16:31 -0700

Bernd,
But why do you have so many deletes? Is it expected?
When you run DIHs concurrently, do you shard intput data by uniqueKey?


On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling <
[email protected]> wrote:

> If there is a problem in single index then it might also be in CloudSolr.
> As far as I could figure out from INFOSTREAM, documents are added to
> segments
> and terms are "collected". Duplicate term are "deleted" (or whatever).
> These deletes (or whatever) are not concurrent.
> I have a lines like:
> BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes:
> infos=...
> BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes took
> 180028 msec
> ...
> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
> infos=...
> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes took
> 3411845 msec
>
> 3411545 msec are about 56 minutes where the system is doing what???
> At least not indexing because only one JAVA process and no I/O at all!
>
> How can SolrJ help me now with this problem?
>
> Best
> Bernd
>
>
> Am 27.07.2016 um 16:41 schrieb Erick Erickson:
> > Well, at least it'll be easier to debug in my experience. Simple example.
> > At some point you'll call CloudSolrClient.add(doc list). Comment just
> that
> > out and you'll be able to isolate whether the issue is querying the be or
> > sending to Solr.
> >
> > Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
> > routing...
> >
> > Best
> > Erick
> >
> > On Jul 27, 2016 7:24 AM, "Bernd Fehling" <[email protected]
> >
> > wrote:
> >
> >> So writing some SolrJ doing the same job as the DIH script
> >> and using that concurrent will solve my problem?
> >> I'm not using Tika.
> >>
> >> I don't think that DIH is my problem, even if it is not the best
> solution
> >> right now.
> >> Nevertheless, you are right SolrJ has higher performance, but what
> >> if I have the same problems with SolrJ like with DIH?
> >>
> >> If it runs with DIH it should run with SolrJ with additional performance
> >> boost.
> >>
> >> Bernd
> >>
> >>
> >> On 27.07.2016 at 16:03, Erick Erickson:
> >>> I'd actually recommend you move to a SolrJ solution
> >>> or similar. Currently, you're putting a load on the Solr
> >>> servers (especially if you're also using Tika) in addition
> >>> to all indexing etc.
> >>>
> >>> Here's a sample:
> >>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >>>
> >>> Dodging the question I know, but DIH sometimes isn't
> >>> the best solution.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
> >>> <[email protected]> wrote:
> >>>> After enhancing the server with SSDs I'm trying to speed up indexing.
> >>>>
> >>>> The server has 16 CPUs and more than 100G RAM.
> >>>> JAVA (1.8.0_92) has 24G.
> >>>> SOLR is 4.10.4.
> >>>> Plain XML data to load is 218G with about 96M records.
> >>>> This will result in a single index of 299G.
> >>>>
> >>>> I tried with 4, 8, 12 and 16 concurrent DIHs.
> >>>> 16 and 12 was to much because for 16 CPUs and my test continued with 8
> >> concurrent DIHs.
> >>>> Then i was trying different <indexConfig> and <updateHandler> settings
> >> but now I'm stuck.
> >>>> I can't figure out what is the best setting for bulk indexing.
> >>>> What I see is that the indexing is "falling asleep" after some time of
> >> indexing.
> >>>> It is only producing del-files, like _11_1.del, _w_2.del, _h_3.del,...
> >>>>
> >>>> <indexConfig>
> >>>>     <maxIndexingThreads>8</maxIndexingThreads>
> >>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
> >>>>     <maxBufferedDocs>-1</maxBufferedDocs>
> >>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >>>>       <int name="maxMergeAtOnce">8</int>
> >>>>       <int name="segmentsPerTier">100</int>
> >>>>       <int name="maxMergedSegmentMB">512</int>
> >>>>     </mergePolicy>
> >>>>     <mergeFactor>8</mergeFactor>
> >>>>     <mergeScheduler
> >> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >>>>     <lockType>${solr.lock.type:native}</lockType>
> >>>>     ...
> >>>> </indexConfig>
> >>>>
> >>>> <updateHandler class="solr.DirectUpdateHandler2">
> >>>>      ### no autocommit at all
> >>>>      <autoSoftCommit>
> >>>>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> >>>>      </autoSoftCommit>
> >>>> </updateHandler>
> >>>>
> >>>>
> >>>>
> >>
> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
> >>>> After indexing finishes there is a final optimize.
> >>>>
> >>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
> >>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
> >>>> It should do no commit, no optimize.
> >>>> ramBufferSizeMB is high because I have plenty of RAM and I want make
> >> use the speed of RAM.
> >>>> segmentsPerTier is high to reduce merging.
> >>>>
> >>>> But somewhere is a misconfiguration because indexing gets stalled.
> >>>>
> >>>> Any idea what's going wrong?
> >>>>
> >>>>
> >>>> Bernd
> >>>>
> >>
> >
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>



-- 
Sincerely yours
Mikhail Khludnev

Re: problems with bulk indexing with concurrent DIH

Reply via email to