Re: problems with bulk indexing with concurrent DIH

Susheel Kumar Tue, 02 Aug 2016 06:44:44 -0700

My experience with DIH was we couldn't scale to the level we wanted.  SorlJ
with multi-threading & batch updates (parallel threads pushing data into
solr) worked and were able to ingest 5K-10K docs per second.


Thanks,
Susheel

On Tue, Aug 2, 2016 at 9:15 AM, Mikhail Khludnev <m...@apache.org> wrote:

> Bernd,
> But why do you have so many deletes? Is it expected?
> When you run DIHs concurrently, do you shard intput data by uniqueKey?
>
> On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
>
> > If there is a problem in single index then it might also be in CloudSolr.
> > As far as I could figure out from INFOSTREAM, documents are added to
> > segments
> > and terms are "collected". Duplicate term are "deleted" (or whatever).
> > These deletes (or whatever) are not concurrent.
> > I have a lines like:
> > BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes:
> > infos=...
> > BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes
> took
> > 180028 msec
> > ...
> > BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
> > infos=...
> > BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
> took
> > 3411845 msec
> >
> > 3411545 msec are about 56 minutes where the system is doing what???
> > At least not indexing because only one JAVA process and no I/O at all!
> >
> > How can SolrJ help me now with this problem?
> >
> > Best
> > Bernd
> >
> >
> > Am 27.07.2016 um 16:41 schrieb Erick Erickson:
> > > Well, at least it'll be easier to debug in my experience. Simple
> example.
> > > At some point you'll call CloudSolrClient.add(doc list). Comment just
> > that
> > > out and you'll be able to isolate whether the issue is querying the be
> or
> > > sending to Solr.
> > >
> > > Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
> > > routing...
> > >
> > > Best
> > > Erick
> > >
> > > On Jul 27, 2016 7:24 AM, "Bernd Fehling" <
> bernd.fehl...@uni-bielefeld.de
> > >
> > > wrote:
> > >
> > >> So writing some SolrJ doing the same job as the DIH script
> > >> and using that concurrent will solve my problem?
> > >> I'm not using Tika.
> > >>
> > >> I don't think that DIH is my problem, even if it is not the best
> > solution
> > >> right now.
> > >> Nevertheless, you are right SolrJ has higher performance, but what
> > >> if I have the same problems with SolrJ like with DIH?
> > >>
> > >> If it runs with DIH it should run with SolrJ with additional
> performance
> > >> boost.
> > >>
> > >> Bernd
> > >>
> > >>
> > >> On 27.07.2016 at 16:03, Erick Erickson:
> > >>> I'd actually recommend you move to a SolrJ solution
> > >>> or similar. Currently, you're putting a load on the Solr
> > >>> servers (especially if you're also using Tika) in addition
> > >>> to all indexing etc.
> > >>>
> > >>> Here's a sample:
> > >>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> > >>>
> > >>> Dodging the question I know, but DIH sometimes isn't
> > >>> the best solution.
> > >>>
> > >>> Best,
> > >>> Erick
> > >>>
> > >>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
> > >>> <bernd.fehl...@uni-bielefeld.de> wrote:
> > >>>> After enhancing the server with SSDs I'm trying to speed up
> indexing.
> > >>>>
> > >>>> The server has 16 CPUs and more than 100G RAM.
> > >>>> JAVA (1.8.0_92) has 24G.
> > >>>> SOLR is 4.10.4.
> > >>>> Plain XML data to load is 218G with about 96M records.
> > >>>> This will result in a single index of 299G.
> > >>>>
> > >>>> I tried with 4, 8, 12 and 16 concurrent DIHs.
> > >>>> 16 and 12 was to much because for 16 CPUs and my test continued
> with 8
> > >> concurrent DIHs.
> > >>>> Then i was trying different <indexConfig> and <updateHandler>
> settings
> > >> but now I'm stuck.
> > >>>> I can't figure out what is the best setting for bulk indexing.
> > >>>> What I see is that the indexing is "falling asleep" after some time
> of
> > >> indexing.
> > >>>> It is only producing del-files, like _11_1.del, _w_2.del,
> _h_3.del,...
> > >>>>
> > >>>> <indexConfig>
> > >>>>     <maxIndexingThreads>8</maxIndexingThreads>
> > >>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
> > >>>>     <maxBufferedDocs>-1</maxBufferedDocs>
> > >>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> > >>>>       <int name="maxMergeAtOnce">8</int>
> > >>>>       <int name="segmentsPerTier">100</int>
> > >>>>       <int name="maxMergedSegmentMB">512</int>
> > >>>>     </mergePolicy>
> > >>>>     <mergeFactor>8</mergeFactor>
> > >>>>     <mergeScheduler
> > >> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> > >>>>     <lockType>${solr.lock.type:native}</lockType>
> > >>>>     ...
> > >>>> </indexConfig>
> > >>>>
> > >>>> <updateHandler class="solr.DirectUpdateHandler2">
> > >>>>      ### no autocommit at all
> > >>>>      <autoSoftCommit>
> > >>>>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> > >>>>      </autoSoftCommit>
> > >>>> </updateHandler>
> > >>>>
> > >>>>
> > >>>>
> > >>
> >
> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
> > >>>> After indexing finishes there is a final optimize.
> > >>>>
> > >>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
> > >>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
> > >>>> It should do no commit, no optimize.
> > >>>> ramBufferSizeMB is high because I have plenty of RAM and I want make
> > >> use the speed of RAM.
> > >>>> segmentsPerTier is high to reduce merging.
> > >>>>
> > >>>> But somewhere is a misconfiguration because indexing gets stalled.
> > >>>>
> > >>>> Any idea what's going wrong?
> > >>>>
> > >>>>
> > >>>> Bernd
> > >>>>
> > >>
> > >
> >
> > --
> > *************************************************************
> > Bernd Fehling                    Bielefeld University Library
> > Dipl.-Inform. (FH)                LibTec - Library Technology
> > Universitätsstr. 25                  and Knowledge Management
> > 33615 Bielefeld
> > Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
> >
> > BASE - Bielefeld Academic Search Engine - www.base-search.net
> > *************************************************************
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: problems with bulk indexing with concurrent DIH

Reply via email to