If you set overwriteDupes = false the exact or near duplicate documents will 
not be deleted. The signature field is set, however, so you can later query 
yourself for duplicates in an external program and do whatever you want with 
the duplicates.


On Tuesday 11 May 2010 15:41:33 Matthieu Labour wrote:
> Hi Markus
> 
> Thank you for your answer
> 
> Here is a use case where I think it would be nice to know there is a dup
>  before I insert it.
> 
> Let's say I create a summary out of the document and I only index the
>  summary and store the document itself on a separate device (S3, Cassandra
>  etc ...). Then I would need that addDocument on the summary failed because
>  it detected a duplicate so that I don't neet to store the document. 
> When you write:
> "On the other hand, you can also have a manual process that finds
> duplicates based on that signature and gather that information yourself
> as long as such a feature isn't there."
> 
> Can you explain more what you have in mind ?
> 
> Thank you for your help!
> 
> matt
> 
> --- On Mon, 5/10/10, Markus Jelsma <markus.jel...@buyways.nl> wrote:
> 
> From: Markus Jelsma <markus.jel...@buyways.nl>
> Subject: RE: How to query for similar documents before indexing
> To: solr-user@lucene.apache.org
> Date: Monday, May 10, 2010, 5:07 PM
> 
> Hi Matthieu,
> 
>  
> 
>  
> 
> On the top of the wiki page you can see it's in 1.4 already. As far as i
>  know the API doesn't return information on found duplicates in its
>  response header, the wiki isn't clear on that subject. I, at least, never
>  saw any other response than an error or the usual status code and QTime.
> 
>  
> 
> Perhaps it would be a nice feature. On the other hand, you can also have a
>  manual process that finds duplicates based on that signature and gather
>  that information yourself as long as such a feature isn't there.
> 
>  
> 
>  
> 
> Cheers,
> 
> 
>  
> -----Original message-----
> From: Matthieu Labour <matthieu_lab...@yahoo.com>
> Sent: Mon 10-05-2010 23:30
> To: solr-user@lucene.apache.org;
> Subject: RE: How to query for similar documents before indexing
> 
> Markus
> Thank you for your response
> That would be great if the index has the option to prevent duplicate from
>  entering the index. But is it going to be a silent action ? Or will the
>  add method return that it failed indexing because it detected a duplicate
>  ? Is it commited to the 1.4 already ?
> Cheers
> matt
> 
> 
> --- On Mon, 5/10/10, Markus Jelsma <markus.jel...@buyways.nl> wrote:
> 
> From: Markus Jelsma <markus.jel...@buyways.nl>
> Subject: RE: How to query for similar documents before indexing
> To: solr-user@lucene.apache.org
> Date: Monday, May 10, 2010, 4:11 PM
> 
> Hi,
> 
>  
> 
>  
> 
> Deduplication [1] is what you're looking for.It can utilize different
>  analyzers that will add a one or more signatures or hashes to your
>  document depending on exact or partial matches for configurable fields.
>  Based on that, it should be able to prevent new documents from entering
>  the index.
> 
>  
> 
> The first part works very well but i have some issues with removing those
>  documents on which i also need to check with the community tomorrow back
>  at work ;-)
> 
>  
> 
>  
> 
> [1]: http://wiki.apache.org/solr/Deduplication
> 
> 
> 
>  
> 
> Cheers,
> 
> 
>  
> -----Original message-----
> From: Matthieu Labour <matthieu_lab...@yahoo.com>
> Sent: Mon 10-05-2010 22:41
> To: solr-user@lucene.apache.org;
> Subject: How to query for similar documents before indexing
> 
> Hi
> 
> I want to implement the following logic:
> 
> Before I index a new document into the index, I want to check if there are
>  already documents in the index with similar content to the content of the
>  document about to be inserted. If the request returns 1 or more documents,
>  then I don't want to insert the document.
> 
> What is the best way to achieve the above functionality ?
> 
> I read about Fuzzy searches in logic. But can I really build a request such
>  as mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9
>  ?
> 
> Thank you for your help
> 
> 
> 
> 
>      
>  
> 
> 
> 
>      
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to