To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Mike Klaas <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: > Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution <g>. -Mike > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: >>> We have a scenario, where we want to find out documents which are >> similar in >>> content. To elaborate a little more on what we mean here, lets >>> take an >>> example. >>> >>> The example of this email chain in which we are interacting on, >>> can be >> best >>> used for illustrating the concept of near dupes (We are not getting >> confused >>> with threads, they are two different things.). Each email in this >>> thread >> is >>> treated as a document by the system. A reply to the original mail >>> also >>> includes the original mail in which case it becomes a near >>> duplicate of >> the >>> orginal mail (depending on the percentage of similarity). >>> Similarly it >> goes >>> on. The near dupes need not be limited to emails. >> >> I think this is what's known as "shingling." See >> http://en.wikipedia.org/wiki/W-shingling >> Lucene (and therefore Solr) does not implement shingling. The >> "MoreLikeThis" query might be close enough, however. >> >> -Stuart >>