Re: Near Duplicate Documents

Mike Klaas Sun, 18 Nov 2007 20:09:17 -0800

On 18-Nov-07, at 8:17 AM, Eswar K wrote:

Is there any idea implementing that feature in the up coming releases?

Not currently. Feel free to contribute something if you find a goodsolution <g>.


-Mike

On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
We have a scenario, where we want to find out documents which are
similar in
content. To elaborate a little more on what we mean here, letstake an
example.
The example of this email chain in which we are interacting on,can be
best
used for illustrating the concept of near dupes (We are not getting
confused
with threads, they are two different things.). Each email in thisthread
is
treated as a document by the system. A reply to the original mailalsoincludes the original mail in which case it becomes a nearduplicate of
the
orginal mail (depending on the percentage of similarity).Similarly it
goes
on. The near dupes need not be limited to emails.
I think this is what's known as "shingling."  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
"MoreLikeThis" query might be close enough, however.

-Stuart

Re: Near Duplicate Documents

Reply via email to