On 18-Nov-07, at 8:17 AM, Eswar K wrote:
Is there any idea implementing that feature in the up coming releases?
Not currently. Feel free to contribute something if you find a good
solution <g>.
-Mike
On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
We have a scenario, where we want to find out documents which are
similar in
content. To elaborate a little more on what we mean here, lets
take an
example.
The example of this email chain in which we are interacting on,
can be
best
used for illustrating the concept of near dupes (We are not getting
confused
with threads, they are two different things.). Each email in this
thread
is
treated as a document by the system. A reply to the original mail
also
includes the original mail in which case it becomes a near
duplicate of
the
orginal mail (depending on the percentage of similarity).
Similarly it
goes
on. The near dupes need not be limited to emails.
I think this is what's known as "shingling." See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling. The
"MoreLikeThis" query might be close enough, however.
-Stuart