These deletes seem really puzzling to me. Can you experiment with
commenting uniqeKey in schema.xml. My expectation that deletes should go
away after that.

On Tue, Aug 2, 2016 at 4:50 PM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> Hi Mikhail,
>
> there are no deletes at all from my point of view.
> All records have unique id's.
> No sharding at all, it is a single index and it is certified
> that all DIH's get different data to load and no record is
> sent twice to any DIH participating at concurrent loading.
>
> Only assumption so far, DIH is sending the records as "update"
> (and not pure "add") to the indexer which will generate delete
> files during merge. If the number of segments is high it will
> take quite long to merge and check all records of all segments.
>
> I'm currently setting up SOLR 5.5.3 but that takes a while.
> I also located an "overwrite" parameter somewhere in DIH which
> will force an "add" and not an "update" to the index, but
> couldn't figure out how to set the parameter with command.
>
> Bernd
>
>
> Am 02.08.2016 um 15:15 schrieb Mikhail Khludnev:
> > Bernd,
> > But why do you have so many deletes? Is it expected?
> > When you run DIHs concurrently, do you shard intput data by uniqueKey?
> >
> > On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling <
> > bernd.fehl...@uni-bielefeld.de> wrote:
> >
> >> If there is a problem in single index then it might also be in
> CloudSolr.
> >> As far as I could figure out from INFOSTREAM, documents are added to
> >> segments
> >> and terms are "collected". Duplicate term are "deleted" (or whatever).
> >> These deletes (or whatever) are not concurrent.
> >> I have a lines like:
> >> BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes:
> >> infos=...
> >> BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes
> took
> >> 180028 msec
> >> ...
> >> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
> >> infos=...
> >> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
> took
> >> 3411845 msec
> >>
> >> 3411545 msec are about 56 minutes where the system is doing what???
> >> At least not indexing because only one JAVA process and no I/O at all!
> >>
> >> How can SolrJ help me now with this problem?
> >>
> >> Best
> >> Bernd
> >>
> >>
> >> Am 27.07.2016 um 16:41 schrieb Erick Erickson:
> >>> Well, at least it'll be easier to debug in my experience. Simple
> example.
> >>> At some point you'll call CloudSolrClient.add(doc list). Comment just
> >> that
> >>> out and you'll be able to isolate whether the issue is querying the be
> or
> >>> sending to Solr.
> >>>
> >>> Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
> >>> routing...
> >>>
> >>> Best
> >>> Erick
> >>>
> >>> On Jul 27, 2016 7:24 AM, "Bernd Fehling" <
> bernd.fehl...@uni-bielefeld.de
> >>>
> >>> wrote:
> >>>
> >>>> So writing some SolrJ doing the same job as the DIH script
> >>>> and using that concurrent will solve my problem?
> >>>> I'm not using Tika.
> >>>>
> >>>> I don't think that DIH is my problem, even if it is not the best
> >> solution
> >>>> right now.
> >>>> Nevertheless, you are right SolrJ has higher performance, but what
> >>>> if I have the same problems with SolrJ like with DIH?
> >>>>
> >>>> If it runs with DIH it should run with SolrJ with additional
> performance
> >>>> boost.
> >>>>
> >>>> Bernd
> >>>>
> >>>>
> >>>> On 27.07.2016 at 16:03, Erick Erickson:
> >>>>> I'd actually recommend you move to a SolrJ solution
> >>>>> or similar. Currently, you're putting a load on the Solr
> >>>>> servers (especially if you're also using Tika) in addition
> >>>>> to all indexing etc.
> >>>>>
> >>>>> Here's a sample:
> >>>>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >>>>>
> >>>>> Dodging the question I know, but DIH sometimes isn't
> >>>>> the best solution.
> >>>>>
> >>>>> Best,
> >>>>> Erick
> >>>>>
> >>>>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
> >>>>> <bernd.fehl...@uni-bielefeld.de> wrote:
> >>>>>> After enhancing the server with SSDs I'm trying to speed up
> indexing.
> >>>>>>
> >>>>>> The server has 16 CPUs and more than 100G RAM.
> >>>>>> JAVA (1.8.0_92) has 24G.
> >>>>>> SOLR is 4.10.4.
> >>>>>> Plain XML data to load is 218G with about 96M records.
> >>>>>> This will result in a single index of 299G.
> >>>>>>
> >>>>>> I tried with 4, 8, 12 and 16 concurrent DIHs.
> >>>>>> 16 and 12 was to much because for 16 CPUs and my test continued
> with 8
> >>>> concurrent DIHs.
> >>>>>> Then i was trying different <indexConfig> and <updateHandler>
> settings
> >>>> but now I'm stuck.
> >>>>>> I can't figure out what is the best setting for bulk indexing.
> >>>>>> What I see is that the indexing is "falling asleep" after some time
> of
> >>>> indexing.
> >>>>>> It is only producing del-files, like _11_1.del, _w_2.del,
> _h_3.del,...
> >>>>>>
> >>>>>> <indexConfig>
> >>>>>>     <maxIndexingThreads>8</maxIndexingThreads>
> >>>>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
> >>>>>>     <maxBufferedDocs>-1</maxBufferedDocs>
> >>>>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >>>>>>       <int name="maxMergeAtOnce">8</int>
> >>>>>>       <int name="segmentsPerTier">100</int>
> >>>>>>       <int name="maxMergedSegmentMB">512</int>
> >>>>>>     </mergePolicy>
> >>>>>>     <mergeFactor>8</mergeFactor>
> >>>>>>     <mergeScheduler
> >>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >>>>>>     <lockType>${solr.lock.type:native}</lockType>
> >>>>>>     ...
> >>>>>> </indexConfig>
> >>>>>>
> >>>>>> <updateHandler class="solr.DirectUpdateHandler2">
> >>>>>>      ### no autocommit at all
> >>>>>>      <autoSoftCommit>
> >>>>>>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> >>>>>>      </autoSoftCommit>
> >>>>>> </updateHandler>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>
> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
> >>>>>> After indexing finishes there is a final optimize.
> >>>>>>
> >>>>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
> >>>>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
> >>>>>> It should do no commit, no optimize.
> >>>>>> ramBufferSizeMB is high because I have plenty of RAM and I want make
> >>>> use the speed of RAM.
> >>>>>> segmentsPerTier is high to reduce merging.
> >>>>>>
> >>>>>> But somewhere is a misconfiguration because indexing gets stalled.
> >>>>>>
> >>>>>> Any idea what's going wrong?
> >>>>>>
> >>>>>>
> >>>>>> Bernd
> >>>>>>
> >>>>
> >>>
> >>
> >> --
> >> *************************************************************
> >> Bernd Fehling                    Bielefeld University Library
> >> Dipl.-Inform. (FH)                LibTec - Library Technology
> >> Universitätsstr. 25                  and Knowledge Management
> >> 33615 Bielefeld
> >> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
> >>
> >> BASE - Bielefeld Academic Search Engine - www.base-search.net
> >> *************************************************************
> >>
> >
> >
> >
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>



-- 
Sincerely yours
Mikhail Khludnev

Reply via email to