My experience with DIH was we couldn't scale to the level we wanted. SorlJ with multi-threading & batch updates (parallel threads pushing data into solr) worked and were able to ingest 5K-10K docs per second.
Thanks, Susheel On Tue, Aug 2, 2016 at 9:15 AM, Mikhail Khludnev <m...@apache.org> wrote: > Bernd, > But why do you have so many deletes? Is it expected? > When you run DIHs concurrently, do you shard intput data by uniqueKey? > > On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> wrote: > > > If there is a problem in single index then it might also be in CloudSolr. > > As far as I could figure out from INFOSTREAM, documents are added to > > segments > > and terms are "collected". Duplicate term are "deleted" (or whatever). > > These deletes (or whatever) are not concurrent. > > I have a lines like: > > BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes: > > infos=... > > BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes > took > > 180028 msec > > ... > > BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: > > infos=... > > BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes > took > > 3411845 msec > > > > 3411545 msec are about 56 minutes where the system is doing what??? > > At least not indexing because only one JAVA process and no I/O at all! > > > > How can SolrJ help me now with this problem? > > > > Best > > Bernd > > > > > > Am 27.07.2016 um 16:41 schrieb Erick Erickson: > > > Well, at least it'll be easier to debug in my experience. Simple > example. > > > At some point you'll call CloudSolrClient.add(doc list). Comment just > > that > > > out and you'll be able to isolate whether the issue is querying the be > or > > > sending to Solr. > > > > > > Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of > > > routing... > > > > > > Best > > > Erick > > > > > > On Jul 27, 2016 7:24 AM, "Bernd Fehling" < > bernd.fehl...@uni-bielefeld.de > > > > > > wrote: > > > > > >> So writing some SolrJ doing the same job as the DIH script > > >> and using that concurrent will solve my problem? > > >> I'm not using Tika. > > >> > > >> I don't think that DIH is my problem, even if it is not the best > > solution > > >> right now. > > >> Nevertheless, you are right SolrJ has higher performance, but what > > >> if I have the same problems with SolrJ like with DIH? > > >> > > >> If it runs with DIH it should run with SolrJ with additional > performance > > >> boost. > > >> > > >> Bernd > > >> > > >> > > >> On 27.07.2016 at 16:03, Erick Erickson: > > >>> I'd actually recommend you move to a SolrJ solution > > >>> or similar. Currently, you're putting a load on the Solr > > >>> servers (especially if you're also using Tika) in addition > > >>> to all indexing etc. > > >>> > > >>> Here's a sample: > > >>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > > >>> > > >>> Dodging the question I know, but DIH sometimes isn't > > >>> the best solution. > > >>> > > >>> Best, > > >>> Erick > > >>> > > >>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling > > >>> <bernd.fehl...@uni-bielefeld.de> wrote: > > >>>> After enhancing the server with SSDs I'm trying to speed up > indexing. > > >>>> > > >>>> The server has 16 CPUs and more than 100G RAM. > > >>>> JAVA (1.8.0_92) has 24G. > > >>>> SOLR is 4.10.4. > > >>>> Plain XML data to load is 218G with about 96M records. > > >>>> This will result in a single index of 299G. > > >>>> > > >>>> I tried with 4, 8, 12 and 16 concurrent DIHs. > > >>>> 16 and 12 was to much because for 16 CPUs and my test continued > with 8 > > >> concurrent DIHs. > > >>>> Then i was trying different <indexConfig> and <updateHandler> > settings > > >> but now I'm stuck. > > >>>> I can't figure out what is the best setting for bulk indexing. > > >>>> What I see is that the indexing is "falling asleep" after some time > of > > >> indexing. > > >>>> It is only producing del-files, like _11_1.del, _w_2.del, > _h_3.del,... > > >>>> > > >>>> <indexConfig> > > >>>> <maxIndexingThreads>8</maxIndexingThreads> > > >>>> <ramBufferSizeMB>1024</ramBufferSizeMB> > > >>>> <maxBufferedDocs>-1</maxBufferedDocs> > > >>>> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> > > >>>> <int name="maxMergeAtOnce">8</int> > > >>>> <int name="segmentsPerTier">100</int> > > >>>> <int name="maxMergedSegmentMB">512</int> > > >>>> </mergePolicy> > > >>>> <mergeFactor>8</mergeFactor> > > >>>> <mergeScheduler > > >> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> > > >>>> <lockType>${solr.lock.type:native}</lockType> > > >>>> ... > > >>>> </indexConfig> > > >>>> > > >>>> <updateHandler class="solr.DirectUpdateHandler2"> > > >>>> ### no autocommit at all > > >>>> <autoSoftCommit> > > >>>> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> > > >>>> </autoSoftCommit> > > >>>> </updateHandler> > > >>>> > > >>>> > > >>>> > > >> > > > command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false > > >>>> After indexing finishes there is a final optimize. > > >>>> > > >>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging > > >>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor). > > >>>> It should do no commit, no optimize. > > >>>> ramBufferSizeMB is high because I have plenty of RAM and I want make > > >> use the speed of RAM. > > >>>> segmentsPerTier is high to reduce merging. > > >>>> > > >>>> But somewhere is a misconfiguration because indexing gets stalled. > > >>>> > > >>>> Any idea what's going wrong? > > >>>> > > >>>> > > >>>> Bernd > > >>>> > > >> > > > > > > > -- > > ************************************************************* > > Bernd Fehling Bielefeld University Library > > Dipl.-Inform. (FH) LibTec - Library Technology > > Universitätsstr. 25 and Knowledge Management > > 33615 Bielefeld > > Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de > > > > BASE - Bielefeld Academic Search Engine - www.base-search.net > > ************************************************************* > > > > > > -- > Sincerely yours > Mikhail Khludnev >