On 18-Nov-07, at 8:17 AM, Eswar K wrote:

Is there any idea implementing that feature in the up coming releases?

Not currently. Feel free to contribute something if you find a good solution <g>.

-Mike


On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:

On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
We have a scenario, where we want to find out documents which are
similar in
content. To elaborate a little more on what we mean here, lets take an
example.

The example of this email chain in which we are interacting on, can be
best
used for illustrating the concept of near dupes (We are not getting
confused
with threads, they are two different things.). Each email in this thread
is
treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of
the
orginal mail (depending on the percentage of similarity). Similarly it
goes
on. The near dupes need not be limited to emails.

I think this is what's known as "shingling."  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
"MoreLikeThis" query might be close enough, however.

-Stuart


Reply via email to