Re: deduplication of suggester results are not enough

2020-03-26 Thread Michal Hlavac
the suggester related discussions quite a while > ago. Everybody agrees that it is not the expected behaviour from a > Suggester where the terms are the entities and not the documents to return > the same string representation several times. > > One suggestion was to make dedupli

deduplication of suggester results are not enough

2020-03-26 Thread Szűcs Roland
make deduplication on client side of Solr. It is very easy in most of the client solution as any set based data structure solve this. *But one important problem is not solved the deduplication: suggest.count*. If I have15 matches by the suggester and the suggest.count=10 and the first 9 matches are

Atomic update deletes deduplication signature

2018-08-09 Thread Thomas Eckart
Hello, I am having trouble when doing atomic updates in combination with SignatureUpdateProcessorFactory (on Solr 7.2). Normal commits of new documents work as expected and generate a valid signature: curl "$URL/update?commit=true" -H 'Content-type:application/json' -d '{"add":{"doc":{"id":

RE: Solr Cloud: query elevation + deduplication?

2018-03-06 Thread Markus Jelsma
SOLR-3473, nobody seems to be working on that. Regards, Markus -Original message- > From:Ronja Koistinen > Sent: Monday 5th March 2018 15:32 > To: solr-user@lucene.apache.org > Subject: Solr Cloud: query elevation + deduplication? > > Hello, > > I am running Solr C

Solr Cloud: query elevation + deduplication?

2018-03-05 Thread Ronja Koistinen
Hello, I am running Solr Cloud 6.6.2 and trying to get query elevation and deduplication (with SignatureUpdateProcessor) working at the same time. The documentation for deduplication (https://lucene.apache.org/solr/guide/6_6/de-duplication.html) does not specify if the signatureField needs to be

Re: Deduplication

2015-05-20 Thread Shalin Shekhar Mangar
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam wrote: > >> Write a custom update processor and include it in your update chain. > >> You will then have the ability to do anything you want with the entire > >> input document before it hits the code to actually do the indexing. > > This sounded lik

Re: Deduplication

2015-05-20 Thread Alessandro Benedetti
What the Solr de-duplciation offers you is to calculate for each document in input an Hash ( based on a set of fields). You can then select two options : - Index everything, documents with same signature will be equals - avoid the overwriting of duplicates. How the similarity has is calculated is

Re: Deduplication

2015-05-20 Thread Bram Van Dam
On 19/05/15 14:47, Alessandro Benedetti wrote: > Hi Bram, > what do you mean with : > " I > would like it to provide the unique value myself, without having the > deduplicator create a hash of field values " . > > This is not reduplication, but simple document filtering based on a > constraint. >

Re: Deduplication

2015-05-20 Thread Bram Van Dam
>> Write a custom update processor and include it in your update chain. >> You will then have the ability to do anything you want with the entire >> input document before it hits the code to actually do the indexing. This sounded like the perfect option ... until I read Jack's comment: > > My und

Re: Deduplication

2015-05-19 Thread Jack Krupansky
Shawn, I was going to say the same thing, but... then I was thinking about SolrCloud and the fact that update processors are invoked before the document is set to its target node, so there wouldn't be a reliable way to tell if the input document field value exists on the target rather than current

Re: Deduplication

2015-05-19 Thread Shawn Heisey
On 5/19/2015 3:02 AM, Bram Van Dam wrote: > I'm looking for a way to have Solr reject documents if a certain field > value is duplicated (reject, not overwrite). There doesn't seem to be > any kind of unique option in schema fields. > > The de-duplication feature seems to make this (somewhat) poss

Re: Deduplication

2015-05-19 Thread Alessandro Benedetti
Hi Bram, what do you mean with : " I would like it to provide the unique value myself, without having the deduplicator create a hash of field values " . This is not reduplication, but simple document filtering based on a constraint. In the case you want de-duplication ( which seemed from your ver

Deduplication

2015-05-19 Thread Bram Van Dam
Hi folks, I'm looking for a way to have Solr reject documents if a certain field value is duplicated (reject, not overwrite). There doesn't seem to be any kind of unique option in schema fields. The de-duplication feature seems to make this (somewhat) possible, but I would like it to provide the

having Solr deduplication and partial update

2014-10-14 Thread Ali Nazemian
Hi, I was wondering how can I have both solr deduplication and partial update. I found out that due to some reasons you can not rely on solr deduplication when you try to update a document partially! It seems that when you do partial update on some field- even if that field does not consider as

Re: any project for record linkage, fuzzy grouping, and deduplication based on Solr/Lucene?

2014-03-17 Thread Jack Krupansky
See: https://cwiki.apache.org/confluence/display/solr/De-Duplication -- Jack Krupansky -Original Message- From: Mobius ReX Sent: Monday, March 17, 2014 1:59 PM To: solr-user@lucene.apache.org Subject: any project for record linkage, fuzzy grouping, and deduplication based on Solr

any project for record linkage, fuzzy grouping, and deduplication based on Solr/Lucene?

2014-03-17 Thread Mobius ReX
o put weights for different matching rules. Any tips to handle such runtime fast deduplication tasks for big data (about 100 million records)? Any open-source project working on this?

Re: Newbie question on Deduplication overWriteDupes flag

2014-02-06 Thread Alexandre Rafalovitch
A follow up question on this (as it is kind of new functionality). What happens if several documents are submitted and one of them fails due to that? Do they get rolled back or only one? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexand

Re: Newbie question on Deduplication overWriteDupes flag

2014-02-06 Thread Chris Hostetter
: How do I achieve, add if not there, fail if duplicate is found. I though You can use the optimistic concurrency features to do this, by including a _version_=-1 field value in the document. this will instruct solr that the update should only be processed if the document does not already exis

Solr Deduplication use of overWriteDupes flag

2014-02-04 Thread Amit Agrawal
Hello, I had a configuration where I had "overwriteDupes"=false. I added few duplicate documents. Result: I got duplicate documents in the index. When I changed to "overwriteDupes"=true, the duplicate documents started overwriting the older documents. Question 1: How do I achieve, [add if not th

Newbie question on Deduplication overWriteDupes flag

2014-02-04 Thread aagrawal75
found. I though that "overwriteDupes"=false would do that. -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-on-Deduplication-overWriteDupes-flag-tp4115212.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Custom update handler with deduplication

2013-12-15 Thread Shalin Shekhar Mangar
Firstly, I see that you have overwriteDupes=false in your configuration. This means that a signature will be generated but the similar documents will still be added to the index. Now to your main question about counting duplicate attempts, one simple way is to have another UpdateRequestProcessor af

Custom update handler with deduplication

2013-12-15 Thread Jorge Luis Betancourt González
Currently I've the following Update Request Processor chain to prevent indexing very similar text items into a core dedicated to store queries that our users put into the web interface of our system. true false signature textsuggest,textng org.apache.solr.upd

RE: Pros and Cons of Using Deduplication of Solr at Huge Data Indexing

2013-05-02 Thread Markus Jelsma
Distributed deduplication does not work right now: https://issues.apache.org/jira/browse/SOLR-3473 We've chosen not do use update processors for deduplication anymore and rely on several custom mapreduce jobs in Nutch and some custom collectors in Solr to do some on-demand online deduplic

Pros and Cons of Using Deduplication of Solr at Huge Data Indexing

2013-05-02 Thread Furkan KAMACI
I use Solr 4.2.1 as SolrCloud. I crawl huge data with Nutch and index them with SolrCloud. I wonder about Solr's deduplication mechanism. What exactly it does and does it results with a slow indexing or is it beneficial for my situation?

Re: Deduplication in SolrCloud

2012-07-27 Thread Lance Norskog
distributed deduplication: > https://issues.apache.org/jira/browse/SOLR-3473 > > > -Original message- >> From:Daniel Brügge >> Sent: Fri 27-Jul-2012 17:38 >> To: solr-user@lucene.apache.org >> Subject: Deduplication in SolrCloud >> >> Hi, >>

RE: Deduplication in SolrCloud

2012-07-27 Thread Markus Jelsma
This issue doesn't really describe your problem but a more general problem of distributed deduplication: https://issues.apache.org/jira/browse/SOLR-3473 -Original message- > From:Daniel Brügge > Sent: Fri 27-Jul-2012 17:38 > To: solr-user@lucene.apache.org > Subject:

Deduplication in SolrCloud

2012-07-27 Thread Daniel Brügge
Hi, in my old Solr Setup I have used the deduplication feature in the update chain with couple of fields. true signature false uuid,type,url,content_hash org.apache.solr.update.processor.Lookup3Signature This worked fine. When I now use this in my 2 shards SolrCloud setup when

Deduplication in MLT

2012-06-12 Thread Pranav Prakash
I have an implementation of Deduplication as mentioned at http://wiki.apache.org/solr/Deduplication. It is helpful in grouping search results. I would like to achieve the same functionality in my MLT queries, where the result set should include grouped documents. What is a good way to do the same

Re: SolrCloud deduplication

2012-05-21 Thread Mark Miller
On May 21, 2012, at 12:10 PM, Mark Miller wrote: > I think the reason that you see a multiple values error when you try the > other order is because of the lack of a document clone (the other issue I > mentioned a few emails back). Addressing that won't solve your issue though I take that back

RE: SolrCloud deduplication

2012-05-21 Thread Markus Jelsma
https://issues.apache.org/jira/browse/SOLR-3473 -Original message- > From:Mark Miller > Sent: Mon 21-May-2012 18:11 > To: solr-user@lucene.apache.org > Subject: Re: SolrCloud deduplication > > Looking again at the SignatureUpdateProcessor code, I think that in

Re: SolrCloud deduplication

2012-05-21 Thread Mark Miller
r-user@lucene.apache.org; Mark Miller >> Subject: RE: SolrCloud deduplication >> >> Hi, >> >> SOLR-2822 seems to work just fine as long as the SignatureProcessor precedes >> the DistributedProcessor in the update chain. >> >> Thanks, >> Ma

RE: SolrCloud deduplication

2012-05-21 Thread Markus Jelsma
to try? Thanks Markus -Original message- > From:Markus Jelsma > Sent: Mon 21-May-2012 15:58 > To: solr-user@lucene.apache.org; Mark Miller > Subject: RE: SolrCloud deduplication > > Hi, > > SOLR-2822 seems to work just fine as long as the SignatureP

RE: SolrCloud deduplication

2012-05-21 Thread Markus Jelsma
Subject: Re: SolrCloud deduplication > > Hey Markus - > > When I ran into a similar issue with another update proc, I created > https://issues.apache.org/jira/browse/SOLR-3215 so that I could order things > to avoid this. I have not committed this yet though, in favor of waiting fo

RE: SolrCloud deduplication

2012-05-18 Thread Markus Jelsma
you're right. I'll test the patch as soon as possible. Thanks! -Original message- > From:Chris Hostetter > Sent: Fri 18-May-2012 18:20 > To: solr-user@lucene.apache.org > Subject: RE: SolrCloud deduplication > > > : Interesting! I'm watching the

RE: SolrCloud deduplication

2012-05-18 Thread Chris Hostetter
: Interesting! I'm watching the issues and will test as soon as they are committed. FWIW: it's a chicken and egg problem -- if you could test out the patch in SOLR-2822 with your real world use case / configs, and comment on it's effectiveness, that would go a long way towards my confidence in

RE: SolrCloud deduplication

2012-05-18 Thread Markus Jelsma
Hi, Interesting! I'm watching the issues and will test as soon as they are committed. Thanks! -Original message- > From:Mark Miller > Sent: Fri 18-May-2012 16:05 > To: solr-user@lucene.apache.org; Markus Jelsma > Subject: Re: SolrCloud deduplication > > H

Re: SolrCloud deduplication

2012-05-18 Thread Mark Miller
? :) On May 18, 2012, at 7:49 AM, Markus Jelsma wrote: > Hi, > > Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not > functional anymore. The problem is that documents are passed multiple times > through the URP and the digest field is added as if it is

SolrCloud deduplication

2012-05-18 Thread Markus Jelsma
Hi, Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not functional anymore. The problem is that documents are passed multiple times through the URP and the digest field is added as if it is an multi valued field. If the field is not multi valued you'll get

Re: null pointer error with solr deduplication

2012-04-23 Thread Peter Markey
report the error since in this scenario (i guess the components for deduplication are pretty new), it would probably help the devs to make the behavior more deterministic towards duplicate documents. On Sat, Apr 21, 2012 at 2:21 AM, Alexander Aristov < alexander.aris...@gmail.com> wrote: >

Re: null pointer error with solr deduplication

2012-04-23 Thread Mark Miller
he behavior may be non-deterministic. > > So solr bahaves as it should :) _unexpectidly_ > > But I agree in that sence that there must be no error especially such as > NPE. > > Best Regards > Alexander Aristov > > > On 21 April 2012 03:42, Peter Markey wrote: > &g

Re: null pointer error with solr deduplication

2012-04-21 Thread Alexander Aristov
e in that sence that there must be no error especially such as NPE. Best Regards Alexander Aristov On 21 April 2012 03:42, Peter Markey wrote: > Hello, > > I have been trying out deduplication in solr by following: > http://wiki.apache.org/solr/Deduplication. I have defined a sign

null pointer error with solr deduplication

2012-04-20 Thread Peter Markey
Hello, I have been trying out deduplication in solr by following: http://wiki.apache.org/solr/Deduplication. I have defined a signature field to hold the values of the signature created based on few other fields in a document and the idea seems to work like a charm in a single solr instance. But

Re: Similar documents and advantages / disadvantages of MLT / Deduplication

2011-11-16 Thread Chris Hostetter
: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted : blog articles from different sources, with slight changes (author name, : etc..)). : But they have differences. : *Now i like to see 1 doc in my result set and the other 4 should be marked : as similar.* Do you actaully w

Similar documents and advantages / disadvantages of MLT / Deduplication

2011-11-07 Thread Vadim Kisselmann
Hello folks, i have questions about MLT and Deduplication and what would be the best choice in my case. Case: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted blog articles from different sources, with slight changes (author name, etc..)). But they have differences. *Now

Re: A good signature class for deduplication

2011-09-01 Thread Chris Hostetter
/Deduplication ...which one you should choose, and which fields you feed it depend entirely on your goal -- if you want to deduplicate anytime both the "user_fname" and "user_lname" fields are exactly the same, then use those fields with either the MD5Signature or the Lookup3Signa

Re: Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

2011-08-30 Thread Marc Sturlese
Deduplication uses lucene indexWriter.updateDocument using the signature term. I don't think it's possible as a default feature to choose wich document to index, the "original" should be always the last to be indexed. /IndexWriter.updateDocument Updates a document by first del

Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

2011-08-29 Thread Pranav Prakash
Solr 3.3. has a feature "Grouping". Is it practically same as deduplication? Here is my use case for duplicates removal - We have many documents with similar (upto 99%) content. Upon some search queries, almost all of them come up on first page results. Of all these documents, essentia

Re: How to combine Deduplication and Elevation

2011-05-02 Thread Chris Hostetter
: Hi I have a question. How to combine the Deduplication and Elevation : implementations in Solr. Currently , I managed to implement either one only. can you elaborate a bit more on what exactly you've tried and what problem you are facing? the SignatureUpdateProcessorFactory (which is

How to combine Deduplication and Elevation

2011-04-15 Thread shamex
Hi I have a question. How to combine the Deduplication and Elevation implementations in Solr. Currently , I managed to implement either one only. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-combine-Deduplication-and-Elevation-tp2819621p2819621.html Sent from the

Re: Deduplication questions

2011-04-11 Thread Chris Hostetter
: Q1. Is is possible to pass *analyzed* content to the : : public abstract class Signature { No, analysis happens as the documents are being written to the lucene index, well after the UpdateProcessors have had a chance to interact with the values. : Q2. Method calculate() is using concatenat

Re: Question about http://wiki.apache.org/solr/Deduplication

2011-04-04 Thread eks dev
g in async fashion, and finally solr slaves are chasing master index with standard solr replication. Overnight we run simple map reduce jobs to consolidate, normalize and sort update stream and reindex at the end. Deduplication and collection sorting is for us only an optimization, if done reasona

Re: Question about http://wiki.apache.org/solr/Deduplication

2011-04-02 Thread Chris Hostetter
: Is it possible in solr to have multivalued "id"? Or I need to make my : own "mv_ID" for this? Any ideas how to achieve this efficiently? This isn't something the SignatureUpdateProcessor is going to be able to hel pyou with -- it does the deduplication be changi

Deduplication questions

2011-03-25 Thread eks dev
Q1. Is is possible to pass *analyzed* content to the public abstract class Signature { public void init(SolrParams nl) { } public abstract String calculate(String content); } Q2. Method calculate() is using concatenated fields from name,features,cat Is there any mechanism I could build "fi

Question about http://wiki.apache.org/solr/Deduplication

2011-03-24 Thread eks dev
Hi, Use case I am trying to figure out is about preserving IDs without re-indexing on duplicate, rather adding this new ID under list of document id "aliases". Example: Input collection: "id":1, "text":"dummy text 1", "signature":"A" "id":2, "text":"dummy text 1", "signature":"A" I add the first

Re: SOLR deduplication

2011-01-26 Thread Markus Jelsma
Not right now: https://issues.apache.org/jira/browse/SOLR-1909 > Hi - I have the SOLR deduplication configured and working well. > > Is there any way I can tell which documents have been not added to the > index as a result of the deduplication rejecting subsequent identical

SOLR deduplication

2011-01-26 Thread Jason Brown
Hi - I have the SOLR deduplication configured and working well. Is there any way I can tell which documents have been not added to the index as a result of the deduplication rejecting subsequent identical documents? Many Thanks Jason Brown. If you wish to view the St. James's Place

Re: Is deduplication possible during Tika extract?

2011-01-17 Thread Markus Jelsma
> > > and > > > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> > true > signature > false > text > name="signatureClass">org.apache.solr.update.processor.TextProfileSignature > > > > > > deduplication works when I use only "/update" but not when solr does an > extract with Tika! > Is deduplication possible during Tika extract? > > Thanks in advance, > Arno

Is deduplication possible during Tika extract?

2011-01-14 Thread arnaud gaudinat
quot;> true signature false text name="signatureClass">org.apache.solr.update.processor.TextProfileSignature deduplication works when I use only "/update" but not when solr does an extract with Tika! Is deduplication possible during Tika extract? Thanks in advance, Arno

RE: Re: Solr Deduplication and Field Collpasing

2010-09-28 Thread Markus Jelsma
 Correction, Java heap size should be RAM buffer size if i'm not too mistaken.   -Original message- From: Markus Jelsma Sent: Wed 29-09-2010 01:17 To: solr-user@lucene.apache.org; Subject: RE: Re: Solr Deduplication and Field Collpasing If you can set the digest field for your

RE: Re: Solr Deduplication and Field Collpasing

2010-09-28 Thread Markus Jelsma
00:57 To: solr-user@lucene.apache.org; Subject: Re: Solr Deduplication and Field Collpasing I have the digest field already in the schema because the index is shared between nutch docs and others.  I do not know if the second approach is the quickest in my case. I can set the digest value to something

Re: Solr Deduplication and Field Collpasing

2010-09-28 Thread Nemani, Raj
date the digest field with the value from the corresponding I'd field using solr? Thanks Raj - Original Message - From: Markus Jelsma To: solr-user@lucene.apache.org Sent: Tue Sep 28 18:19:17 2010 Subject: RE: Solr Deduplication and Field Collpasing You could create a custom update p

RE: Solr Deduplication and Field Collpasing

2010-09-28 Thread Markus Jelsma
You could create a custom update processor that adds a digest field for newly added documents that do not have the digest field themselves. This way, the documents that are not added by Nutch get a proper non-empty digest field so the deduplication processor won't create the same empty has

Solr Deduplication and Field Collpasing

2010-09-28 Thread Nemani, Raj
the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup c

Re: Deduplication

2010-05-19 Thread Ahmet Arslan
> TermsComponent maybe? > > or faceting? > q=*:*&facet=true&facet.field=signatureField&defType=lucene&rows=0&start=0 > > if you append &facet.mincount=1 to above url you can > see your duplications > After re-reading your message: sometimes you want to show duplicates, sometimes you don't wan

Re: Deduplication

2010-05-19 Thread Ahmet Arslan
> Basically for some uses cases I would like to show > duplicates for other I > wanted them ignored. > > If I have overwriteDupes=false and I just create the dedup > hash how can I > query for only unique hash values... ie something like a > SQL group by. TermsComponent maybe? or faceting? q

Deduplication

2010-05-18 Thread Blargy
://lucene.472066.n3.nabble.com/Deduplication-tp828016p828016.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: [resolved] Config issue for deduplication

2010-05-13 Thread Markus Fischer
Markus Jelsma schrieb: What's your solrconfig? No deduplication is overwritesDedupes = false and signature field is other than doc ID field (unique) -Original message- From: Markus Fischer Sent: Thu 13-05-2010 17:01 To: solr-user@lucene.apache.org; Subject: Config issu

Re: Config issue for deduplication

2010-05-13 Thread Markus Fischer
x27;s your solrconfig? No deduplication is overwritesDedupes = false and signature field is other than doc ID field (unique) -Original message- From: Markus Fischer Sent: Thu 13-05-2010 17:01 To: solr-user@lucene.apache.org; Subject: Config issue for deduplication I am trying to conf

RE: Config issue for deduplication

2010-05-13 Thread Markus Jelsma
What's your solrconfig? No deduplication is overwritesDedupes = false and signature field is other than doc ID field (unique)   -Original message- From: Markus Fischer Sent: Thu 13-05-2010 17:01 To: solr-user@lucene.apache.org; Subject: Config issue for deduplication I am tryi

Re: Config issue for deduplication

2010-05-13 Thread Markus Fischer
nd of having twice this line. Markus Ahmet Arslan schrieb: I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. I followed: http://wiki.apache.org/solr/Deduplication Actually nothing happens. All records are being imported without any deduplication. Does "being impor

Re: Config issue for deduplication

2010-05-13 Thread Ahmet Arslan
> I am trying to configure automatic > deduplication for SOLR 1.4 in Vufind. I followed: > > http://wiki.apache.org/solr/Deduplication > > Actually nothing happens. All records are being imported > without any deduplication. Does "being imported" means you are

Config issue for deduplication

2010-05-13 Thread Markus Fischer
I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. I followed: http://wiki.apache.org/solr/Deduplication Actually nothing happens. All records are being imported without any deduplication. What am I missing? Thanks Markus I did: - create a duplicated set of records

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-02 Thread Bill Engle
Thanks for the responses. This is exactly what I had to resort to. I will definitely put in a feature request to get the generated ID back from the extract request. I am doing this with PHP cURL for extraction and pecl php solr for querying. I am then saving the unique id and dupe hash in a MyS

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter
: To quote from the wiki, ... That's all true ... but Bill explicitly said he wanted to use SignatureUpdateProcessorFactory to generate a uniqueKey from the content field post-extraction so he could dedup documents with the same content ... his question was how to get that key after ad

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Lance Norskog
To quote from the wiki, http://wiki.apache.org/solr/ExtractingRequestHandler curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfi...@tutorial.html" This runs the extractor on your input file (in this case an HTML file). It then stores the generated document with t

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter
: You could create your own unique ID and pass it in with the : literal.field=value feature. By which Lance means you could specify an unique value in a differnet field from yoru uniqueKey field, and then query on that field:value pair to get the doc after it's been added -- but that query will

Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Lance Norskog
ce that a new file will have >> duplicate content but not necessarily the same file name.  To avoid this I >> am using the deduplication feature of Solr. >> >>   >>     > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">

Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Bill Engle
re is a good chance that a new file will have > duplicate content but not necessarily the same file name. To avoid this I > am using the deduplication feature of Solr. > > > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> > true &

Solr Cell and Deduplication - Get ID of doc

2010-02-24 Thread Bill Engle
Hi - New Solr user here. I am using Solr Cell to index files (PDF, doc, docx, txt, htm, etc.) and there is a good chance that a new file will have duplicate content but not necessarily the same file name. To avoid this I am using the deduplication feature of Solr. true id

Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
Message > >> From: Martijn v Groningen >> To: solr-user@lucene.apache.org >> Sent: Thu, November 26, 2009 3:19:40 AM >> Subject: Re: Deduplication in 1.4 >> >> Field collapsing has been used by many in their production >> environment. > > Got any po

Re: Deduplication in 1.4

2009-11-26 Thread Otis Gospodnetic
Hi Martijn, - Original Message > From: Martijn v Groningen > To: solr-user@lucene.apache.org > Sent: Thu, November 26, 2009 3:19:40 AM > Subject: Re: Deduplication in 1.4 > > Field collapsing has been used by many in their production > environment. Got any poi

Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
much. > Any idea on how close it is to being production-ready? > > Thanks, > -Chak > > Otis Gospodnetic wrote: >> >> Hi, >> >> As far as I know, the point of deduplication in Solr ( >> http://wiki.apache.org/solr/Deduplication ) is to detect a dupl

Re: Deduplication in 1.4

2009-11-25 Thread KaktuChakarabati
the point of deduplication in Solr ( > http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate > document before indexing it in order to avoid duplicates in the index in > the first place. > > What you are describing is closer to field collapsing patch in SOLR-236. &g

Re: Deduplication in 1.4

2009-11-24 Thread Otis Gospodnetic
Hi, As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place. What you are describing is closer to field collapsing patch in SOLR-236

Deduplication in 1.4

2009-11-24 Thread KaktuChakarabati
Hey, I've been trying to find some documentation on using this feature in 1.4 but Wiki page is alittle sparse.. In specific, here's what i'm trying to do: I have a field, say 'duplicate_group_id' that i'll populate based on some offline documents deduplication proc

Re: Conditional deduplication

2009-09-30 Thread Mauricio Scheffer
t To: fields in the corpus, I get > back 10 email documents? > > I'm aware of http://wiki.apache.org/solr/Deduplication but I want to > retain > the ability to search across all of my email documents most of the time, > and > only occasionally search for the distinct ones. >

Conditional deduplication

2009-09-30 Thread Michael
If I index a bunch of email documents, is there a way to say"show me all email documents, but only one per To: email address" so that if there are a total of 10 distinct To: fields in the corpus, I get back 10 email documents? I'm aware of http://wiki.apache.org/solr/Deduplicatio

Re: stress tests to DIH and deduplication patch

2009-04-30 Thread Marc Sturlese
doing some stress tests indexing with DIH. >> I am indexing a mysql DB with 140 rows aprox. I am using also the >> DeDuplication patch. >> I am using tomcat with JVM limit of -Xms2000M -Xmx2000M >> I have indexed 3 times using full-import command without restarting >> t

Re: stress tests to DIH and deduplication patch

2009-04-29 Thread Shalin Shekhar Mangar
On Wed, Apr 29, 2009 at 7:44 PM, Marc Sturlese wrote: > > Hey there, I am doing some stress tests indexing with DIH. > I am indexing a mysql DB with 140 rows aprox. I am using also the > DeDuplication patch. > I am using tomcat with JVM limit of -Xms2000M -Xmx2000M > I ha

stress tests to DIH and deduplication patch

2009-04-29 Thread Marc Sturlese
Hey there, I am doing some stress tests indexing with DIH. I am indexing a mysql DB with 140 rows aprox. I am using also the DeDuplication patch. I am using tomcat with JVM limit of -Xms2000M -Xmx2000M I have indexed 3 times using full-import command without restarting tomcat or reloading the

Re: Deduplication patch not working in nightly build

2009-01-10 Thread Grant Ingersoll
, and somehow avoiding a longer merge I think. Also, likely, deduplication is probably adding enough extra data to your index to hit a sweet spot where a merge is too long. Or something to that effect - MySql is especially sensitive to timeouts when doing a select * on a huge db in my testing. I

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese
"Last packet sent to the server was 202481 ms ago." >> >> Something took very very long to complete and the connection got closed >> by >> the time the next row was fetched from the opened resultset. >> >> Just curious, what was the previous val

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
Your basically writing segments more often now, and somehow avoiding a longer merge I think. Also, likely, deduplication is probably adding enough extra data to your index to hit a sweet spot where a merge is too long. Or something to that effect - MySql is especially sensitive to timeouts

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese
got closed by > the time the next row was fetched from the opened resultset. > > Just curious, what was the previous value of maxBufferedDocs and what did > you change it to? > > >> >> -- >> View this message in context: >> http://ww

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Shalin Shekhar Mangar
was fetched from the opened resultset. Just curious, what was the previous value of maxBufferedDocs and what did you change it to? > > -- > View this message in context: > http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html > S

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese
nks Marc Sturlese wrote: > > Hey there, > I was using the Deduplication patch with Solr 1.3 release and everything > was working perfectly. Now I upgraded to a nigthly build (20th december) > to be able to use new facet algorithm and other stuff and DeDuplication is > not worki

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
lty from a week ago, mysql and this driver and url: driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/my_db" I can use deduplication patch with indexs of 200.000 docs and no problem. When I try a full-import with a db of 1.500.000 it stops indexing at doc number 15.

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese
Hey there, I am stack in this problem sine 3 days ago and no idea how to sort it. I am using the nighlty from a week ago, mysql and this driver and url: driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/my_db" I can use deduplication patch with indexs of 200.000

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese
; > >> > I am going to try to set up the last nigthly build... let's see if I >> have >> > better luck. >> > >> > The thing is it stop indexing at the doc num 150.000 aprox... and give >> me >> > that mysql exception error... Without DeD

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
gt; > Donig this fix I get the same error :( > > > > I am going to try to set up the last nigthly build... let's see if I have > > better luck. > > > > The thing is it stop indexing at the doc num 150.000 aprox... and give me > > that mysql exception error... Witho

  1   2   >