Re: Near Duplicate Documents

Stuart Sierra Sun, 18 Nov 2007 08:06:05 -0800

On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> We have a scenario, where we want to find out documents which are similar in
> content. To elaborate a little more on what we mean here, lets take an
> example.
>
> The example of this email chain in which we are interacting on, can be best
> used for illustrating the concept of near dupes (We are not getting confused
> with threads, they are two different things.). Each email in this thread is
> treated as a document by the system. A reply to the original mail also
> includes the original mail in which case it becomes a near duplicate of the
> orginal mail (depending on the percentage of similarity).  Similarly it goes
> on. The near dupes need not be limited to emails.


I think this is what's known as "shingling."  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
"MoreLikeThis" query might be close enough, however.

-Stuart

Re: Near Duplicate Documents

Reply via email to