These deletes seem really puzzling to me. Can you experiment with commenting uniqeKey in schema.xml. My expectation that deletes should go away after that.
On Tue, Aug 2, 2016 at 4:50 PM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > Hi Mikhail, > > there are no deletes at all from my point of view. > All records have unique id's. > No sharding at all, it is a single index and it is certified > that all DIH's get different data to load and no record is > sent twice to any DIH participating at concurrent loading. > > Only assumption so far, DIH is sending the records as "update" > (and not pure "add") to the indexer which will generate delete > files during merge. If the number of segments is high it will > take quite long to merge and check all records of all segments. > > I'm currently setting up SOLR 5.5.3 but that takes a while. > I also located an "overwrite" parameter somewhere in DIH which > will force an "add" and not an "update" to the index, but > couldn't figure out how to set the parameter with command. > > Bernd > > > Am 02.08.2016 um 15:15 schrieb Mikhail Khludnev: > > Bernd, > > But why do you have so many deletes? Is it expected? > > When you run DIHs concurrently, do you shard intput data by uniqueKey? > > > > On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling < > > bernd.fehl...@uni-bielefeld.de> wrote: > > > >> If there is a problem in single index then it might also be in > CloudSolr. > >> As far as I could figure out from INFOSTREAM, documents are added to > >> segments > >> and terms are "collected". Duplicate term are "deleted" (or whatever). > >> These deletes (or whatever) are not concurrent. > >> I have a lines like: > >> BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes: > >> infos=... > >> BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes > took > >> 180028 msec > >> ... > >> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: > >> infos=... > >> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes > took > >> 3411845 msec > >> > >> 3411545 msec are about 56 minutes where the system is doing what??? > >> At least not indexing because only one JAVA process and no I/O at all! > >> > >> How can SolrJ help me now with this problem? > >> > >> Best > >> Bernd > >> > >> > >> Am 27.07.2016 um 16:41 schrieb Erick Erickson: > >>> Well, at least it'll be easier to debug in my experience. Simple > example. > >>> At some point you'll call CloudSolrClient.add(doc list). Comment just > >> that > >>> out and you'll be able to isolate whether the issue is querying the be > or > >>> sending to Solr. > >>> > >>> Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of > >>> routing... > >>> > >>> Best > >>> Erick > >>> > >>> On Jul 27, 2016 7:24 AM, "Bernd Fehling" < > bernd.fehl...@uni-bielefeld.de > >>> > >>> wrote: > >>> > >>>> So writing some SolrJ doing the same job as the DIH script > >>>> and using that concurrent will solve my problem? > >>>> I'm not using Tika. > >>>> > >>>> I don't think that DIH is my problem, even if it is not the best > >> solution > >>>> right now. > >>>> Nevertheless, you are right SolrJ has higher performance, but what > >>>> if I have the same problems with SolrJ like with DIH? > >>>> > >>>> If it runs with DIH it should run with SolrJ with additional > performance > >>>> boost. > >>>> > >>>> Bernd > >>>> > >>>> > >>>> On 27.07.2016 at 16:03, Erick Erickson: > >>>>> I'd actually recommend you move to a SolrJ solution > >>>>> or similar. Currently, you're putting a load on the Solr > >>>>> servers (especially if you're also using Tika) in addition > >>>>> to all indexing etc. > >>>>> > >>>>> Here's a sample: > >>>>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > >>>>> > >>>>> Dodging the question I know, but DIH sometimes isn't > >>>>> the best solution. > >>>>> > >>>>> Best, > >>>>> Erick > >>>>> > >>>>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling > >>>>> <bernd.fehl...@uni-bielefeld.de> wrote: > >>>>>> After enhancing the server with SSDs I'm trying to speed up > indexing. > >>>>>> > >>>>>> The server has 16 CPUs and more than 100G RAM. > >>>>>> JAVA (1.8.0_92) has 24G. > >>>>>> SOLR is 4.10.4. > >>>>>> Plain XML data to load is 218G with about 96M records. > >>>>>> This will result in a single index of 299G. > >>>>>> > >>>>>> I tried with 4, 8, 12 and 16 concurrent DIHs. > >>>>>> 16 and 12 was to much because for 16 CPUs and my test continued > with 8 > >>>> concurrent DIHs. > >>>>>> Then i was trying different <indexConfig> and <updateHandler> > settings > >>>> but now I'm stuck. > >>>>>> I can't figure out what is the best setting for bulk indexing. > >>>>>> What I see is that the indexing is "falling asleep" after some time > of > >>>> indexing. > >>>>>> It is only producing del-files, like _11_1.del, _w_2.del, > _h_3.del,... > >>>>>> > >>>>>> <indexConfig> > >>>>>> <maxIndexingThreads>8</maxIndexingThreads> > >>>>>> <ramBufferSizeMB>1024</ramBufferSizeMB> > >>>>>> <maxBufferedDocs>-1</maxBufferedDocs> > >>>>>> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> > >>>>>> <int name="maxMergeAtOnce">8</int> > >>>>>> <int name="segmentsPerTier">100</int> > >>>>>> <int name="maxMergedSegmentMB">512</int> > >>>>>> </mergePolicy> > >>>>>> <mergeFactor>8</mergeFactor> > >>>>>> <mergeScheduler > >>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> > >>>>>> <lockType>${solr.lock.type:native}</lockType> > >>>>>> ... > >>>>>> </indexConfig> > >>>>>> > >>>>>> <updateHandler class="solr.DirectUpdateHandler2"> > >>>>>> ### no autocommit at all > >>>>>> <autoSoftCommit> > >>>>>> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> > >>>>>> </autoSoftCommit> > >>>>>> </updateHandler> > >>>>>> > >>>>>> > >>>>>> > >>>> > >> > command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false > >>>>>> After indexing finishes there is a final optimize. > >>>>>> > >>>>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging > >>>>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor). > >>>>>> It should do no commit, no optimize. > >>>>>> ramBufferSizeMB is high because I have plenty of RAM and I want make > >>>> use the speed of RAM. > >>>>>> segmentsPerTier is high to reduce merging. > >>>>>> > >>>>>> But somewhere is a misconfiguration because indexing gets stalled. > >>>>>> > >>>>>> Any idea what's going wrong? > >>>>>> > >>>>>> > >>>>>> Bernd > >>>>>> > >>>> > >>> > >> > >> -- > >> ************************************************************* > >> Bernd Fehling Bielefeld University Library > >> Dipl.-Inform. (FH) LibTec - Library Technology > >> Universitätsstr. 25 and Knowledge Management > >> 33615 Bielefeld > >> Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de > >> > >> BASE - Bielefeld Academic Search Engine - www.base-search.net > >> ************************************************************* > >> > > > > > > > > -- > ************************************************************* > Bernd Fehling Bielefeld University Library > Dipl.-Inform. (FH) LibTec - Library Technology > Universitätsstr. 25 and Knowledge Management > 33615 Bielefeld > Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de > > BASE - Bielefeld Academic Search Engine - www.base-search.net > ************************************************************* > -- Sincerely yours Mikhail Khludnev