On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > We have a scenario, where we want to find out documents which are similar in > content. To elaborate a little more on what we mean here, lets take an > example. > > The example of this email chain in which we are interacting on, can be best > used for illustrating the concept of near dupes (We are not getting confused > with threads, they are two different things.). Each email in this thread is > treated as a document by the system. A reply to the original mail also > includes the original mail in which case it becomes a near duplicate of the > orginal mail (depending on the percentage of similarity). Similarly it goes > on. The near dupes need not be limited to emails.
I think this is what's known as "shingling." See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The "MoreLikeThis" query might be close enough, however. -Stuart