Re: Near Duplicate Documents

Otis Gospodnetic Tue, 20 Nov 2007 23:12:00 -0800

To whomever started this thread: look at Nutch.  I believe something related to 
this already exists in Nutch for near-duplicate detection.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents

On 18-Nov-07, at 8:17 AM, Eswar K wrote:

> Is there any idea implementing that feature in the up coming
 releases?

Not currently.  Feel free to contribute something if you find a good  
solution <g>.

-Mike


> On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
>
>> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>>> We have a scenario, where we want to find out documents which are
>> similar in
>>> content. To elaborate a little more on what we mean here, lets  
>>> take an
>>> example.
>>>
>>> The example of this email chain in which we are interacting on,  
>>> can be
>> best
>>> used for illustrating the concept of near dupes (We are not getting
>> confused
>>> with threads, they are two different things.). Each email in this  
>>> thread
>> is
>>> treated as a document by the system. A reply to the original mail  
>>> also
>>> includes the original mail in which case it becomes a near  
>>> duplicate of
>> the
>>> orginal mail (depending on the percentage of similarity).   
>>> Similarly it
>> goes
>>> on. The near dupes need not be limited to emails.
>>
>> I think this is what's known as "shingling."  See
>> http://en.wikipedia.org/wiki/W-shingling
>> Lucene (and therefore Solr) does not implement shingling.  The
>> "MoreLikeThis" query might be close enough, however.
>>
>> -Stuart
>>

Re: Near Duplicate Documents

Reply via email to