Is there any idea implementing that feature in the up coming releases? Regards, Eswar On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > We have a scenario, where we want to find out documents which are > similar in > > content. To elaborate a little more on what we mean here, lets take an > > example. > > > > The example of this email chain in which we are interacting on, can be > best > > used for illustrating the concept of near dupes (We are not getting > confused > > with threads, they are two different things.). Each email in this thread > is > > treated as a document by the system. A reply to the original mail also > > includes the original mail in which case it becomes a near duplicate of > the > > orginal mail (depending on the percentage of similarity). Similarly it > goes > > on. The near dupes need not be limited to emails. > > I think this is what's known as "shingling." See > http://en.wikipedia.org/wiki/W-shingling > Lucene (and therefore Solr) does not implement shingling. The > "MoreLikeThis" query might be close enough, however. > > -Stuart >