On 8/2/2016 7:50 AM, Bernd Fehling wrote:
> Only assumption so far, DIH is sending the records as "update" (and
> not pure "add") to the indexer which will generate delete files during
> merge. If the number of segments is high it will take quite long to
> merge and check all records of all segment
After updating to version 5.5.3 it looks good now.
I think LUCENE-6161 has fixed my problem.
Nevertheless, after updating my development system and recompyling
my plugins I will have a look at DIH about the "update" and also
your advise about the uniqueKey.
Best regards
Bernd
Am 02.08.2016 um 16:
Hi Shalin,
yes I'm going to setup 5.5.3 to see how that behaves.
Michael McCandless gave me the hint about LUCENE-6161.
We will see... :-)
Am 02.08.2016 um 16:31 schrieb Shalin Shekhar Mangar:
> Hi Bernd,
>
> I think you are running into
> https://issues.apache.org/jira/browse/LUCENE-6161. Can
Hi Bernd,
I think you are running into
https://issues.apache.org/jira/browse/LUCENE-6161. Can you upgrade to 5.1
or newer?
On Wed, Jul 27, 2016 at 7:29 PM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:
> After enhancing the server with SSDs I'm trying to speed up indexing.
>
> The serve
These deletes seem really puzzling to me. Can you experiment with
commenting uniqeKey in schema.xml. My expectation that deletes should go
away after that.
On Tue, Aug 2, 2016 at 4:50 PM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:
> Hi Mikhail,
>
> there are no deletes at all from my
Well, concurrent DIH is very simple, just one little shell script :-)
5K-10K docs per second say nothing. Is it just data pushed plain into
the index or has it complex schema with many analyzers?
Bernd
Am 02.08.2016 um 15:44 schrieb Susheel Kumar:
> My experience with DIH was we couldn't scale t
Hi Mikhail,
there are no deletes at all from my point of view.
All records have unique id's.
No sharding at all, it is a single index and it is certified
that all DIH's get different data to load and no record is
sent twice to any DIH participating at concurrent loading.
Only assumption so far, D
My experience with DIH was we couldn't scale to the level we wanted. SorlJ
with multi-threading & batch updates (parallel threads pushing data into
solr) worked and were able to ingest 5K-10K docs per second.
Thanks,
Susheel
On Tue, Aug 2, 2016 at 9:15 AM, Mikhail Khludnev wrote:
> Bernd,
> Bu
Bernd,
But why do you have so many deletes? Is it expected?
When you run DIHs concurrently, do you shard intput data by uniqueKey?
On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:
> If there is a problem in single index then it might also be in CloudSolr.
>
If there is a problem in single index then it might also be in CloudSolr.
As far as I could figure out from INFOSTREAM, documents are added to segments
and terms are "collected". Duplicate term are "deleted" (or whatever).
These deletes (or whatever) are not concurrent.
I have a lines like:
BD 0 [W
Well, at least it'll be easier to debug in my experience. Simple example.
At some point you'll call CloudSolrClient.add(doc list). Comment just that
out and you'll be able to isolate whether the issue is querying the be or
sending to Solr.
Then CloudSolrClient (assuming SolrCloud) has efficiencies
So writing some SolrJ doing the same job as the DIH script
and using that concurrent will solve my problem?
I'm not using Tika.
I don't think that DIH is my problem, even if it is not the best solution right
now.
Nevertheless, you are right SolrJ has higher performance, but what
if I have the sam
I'd actually recommend you move to a SolrJ solution
or similar. Currently, you're putting a load on the Solr
servers (especially if you're also using Tika) in addition
to all indexing etc.
Here's a sample:
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
Dodging the question I know, bu
After enhancing the server with SSDs I'm trying to speed up indexing.
The server has 16 CPUs and more than 100G RAM.
JAVA (1.8.0_92) has 24G.
SOLR is 4.10.4.
Plain XML data to load is 218G with about 96M records.
This will result in a single index of 299G.
I tried with 4, 8, 12 and 16 concurrent
14 matches
Mail list logo