Re: problems with bulk indexing with concurrent DIH

2016-08-08 Thread Shawn Heisey
On 8/2/2016 7:50 AM, Bernd Fehling wrote: > Only assumption so far, DIH is sending the records as "update" (and > not pure "add") to the indexer which will generate delete files during > merge. If the number of segments is high it will take quite long to > merge and check all records of all segment

Re: problems with bulk indexing with concurrent DIH

2016-08-04 Thread Bernd Fehling
After updating to version 5.5.3 it looks good now. I think LUCENE-6161 has fixed my problem. Nevertheless, after updating my development system and recompyling my plugins I will have a look at DIH about the "update" and also your advise about the uniqueKey. Best regards Bernd Am 02.08.2016 um 16:

Re: problems with bulk indexing with concurrent DIH

2016-08-02 Thread Bernd Fehling
Hi Shalin, yes I'm going to setup 5.5.3 to see how that behaves. Michael McCandless gave me the hint about LUCENE-6161. We will see... :-) Am 02.08.2016 um 16:31 schrieb Shalin Shekhar Mangar: > Hi Bernd, > > I think you are running into > https://issues.apache.org/jira/browse/LUCENE-6161. Can

Re: problems with bulk indexing with concurrent DIH

2016-08-02 Thread Shalin Shekhar Mangar
Hi Bernd, I think you are running into https://issues.apache.org/jira/browse/LUCENE-6161. Can you upgrade to 5.1 or newer? On Wed, Jul 27, 2016 at 7:29 PM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > After enhancing the server with SSDs I'm trying to speed up indexing. > > The serve

Re: problems with bulk indexing with concurrent DIH

2016-08-02 Thread Mikhail Khludnev
These deletes seem really puzzling to me. Can you experiment with commenting uniqeKey in schema.xml. My expectation that deletes should go away after that. On Tue, Aug 2, 2016 at 4:50 PM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > Hi Mikhail, > > there are no deletes at all from my

Re: problems with bulk indexing with concurrent DIH

2016-08-02 Thread Bernd Fehling
Well, concurrent DIH is very simple, just one little shell script :-) 5K-10K docs per second say nothing. Is it just data pushed plain into the index or has it complex schema with many analyzers? Bernd Am 02.08.2016 um 15:44 schrieb Susheel Kumar: > My experience with DIH was we couldn't scale t

Re: problems with bulk indexing with concurrent DIH

2016-08-02 Thread Bernd Fehling
Hi Mikhail, there are no deletes at all from my point of view. All records have unique id's. No sharding at all, it is a single index and it is certified that all DIH's get different data to load and no record is sent twice to any DIH participating at concurrent loading. Only assumption so far, D

Re: problems with bulk indexing with concurrent DIH

2016-08-02 Thread Susheel Kumar
My experience with DIH was we couldn't scale to the level we wanted. SorlJ with multi-threading & batch updates (parallel threads pushing data into solr) worked and were able to ingest 5K-10K docs per second. Thanks, Susheel On Tue, Aug 2, 2016 at 9:15 AM, Mikhail Khludnev wrote: > Bernd, > Bu

Re: problems with bulk indexing with concurrent DIH

2016-08-02 Thread Mikhail Khludnev
Bernd, But why do you have so many deletes? Is it expected? When you run DIHs concurrently, do you shard intput data by uniqueKey? On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > If there is a problem in single index then it might also be in CloudSolr. >

Re: problems with bulk indexing with concurrent DIH

2016-07-27 Thread Bernd Fehling
If there is a problem in single index then it might also be in CloudSolr. As far as I could figure out from INFOSTREAM, documents are added to segments and terms are "collected". Duplicate term are "deleted" (or whatever). These deletes (or whatever) are not concurrent. I have a lines like: BD 0 [W

Re: problems with bulk indexing with concurrent DIH

2016-07-27 Thread Erick Erickson
Well, at least it'll be easier to debug in my experience. Simple example. At some point you'll call CloudSolrClient.add(doc list). Comment just that out and you'll be able to isolate whether the issue is querying the be or sending to Solr. Then CloudSolrClient (assuming SolrCloud) has efficiencies

Re: problems with bulk indexing with concurrent DIH

2016-07-27 Thread Bernd Fehling
So writing some SolrJ doing the same job as the DIH script and using that concurrent will solve my problem? I'm not using Tika. I don't think that DIH is my problem, even if it is not the best solution right now. Nevertheless, you are right SolrJ has higher performance, but what if I have the sam

Re: problems with bulk indexing with concurrent DIH

2016-07-27 Thread Erick Erickson
I'd actually recommend you move to a SolrJ solution or similar. Currently, you're putting a load on the Solr servers (especially if you're also using Tika) in addition to all indexing etc. Here's a sample: https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ Dodging the question I know, bu

problems with bulk indexing with concurrent DIH

2016-07-27 Thread Bernd Fehling
After enhancing the server with SSDs I'm trying to speed up indexing. The server has 16 CPUs and more than 100G RAM. JAVA (1.8.0_92) has 24G. SOLR is 4.10.4. Plain XML data to load is 218G with about 96M records. This will result in a single index of 299G. I tried with 4, 8, 12 and 16 concurrent