Re: Near Duplicate Documents

Eswar K Sun, 18 Nov 2007 08:17:58 -0800

Is there any idea implementing that feature in the up coming releases?

Regards,
Eswar
On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:

> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > We have a scenario, where we want to find out documents which are
> similar in
> > content. To elaborate a little more on what we mean here, lets take an
> > example.
> >
> > The example of this email chain in which we are interacting on, can be
> best
> > used for illustrating the concept of near dupes (We are not getting
> confused
> > with threads, they are two different things.). Each email in this thread
> is
> > treated as a document by the system. A reply to the original mail also
> > includes the original mail in which case it becomes a near duplicate of
> the
> > orginal mail (depending on the percentage of similarity).  Similarly it
> goes
> > on. The near dupes need not be limited to emails.
>
> I think this is what's known as "shingling."  See
> http://en.wikipedia.org/wiki/W-shingling
> Lucene (and therefore Solr) does not implement shingling.  The
> "MoreLikeThis" query might be close enough, however.
>
> -Stuart
>

Re: Near Duplicate Documents

Reply via email to