Thanks for the info Cuong! Regards, Rishabh
On Nov 21, 2007 1:59 PM, climbingrose <[EMAIL PROTECTED]> wrote: > The duplication detection mechanism in Nutch is quite primitive. I > think it uses a MD5 signature generated from the content of a field. > The generation algorithm is described here: > > http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html > . > > The problem with this approach is MD5 hash is very sensitive: one > letter difference will generate completely different hash. You > probably have to roll your own near duplication detection algorithm. > My advice is have a look at existing literature on near duplication > detection techniques and then implement one of them. I know Google has > some papers that describe a technique called minhash. I read the paper > and found it's very interesting. I'm not sure if you can implement the > algorithm because they have patented it. That said, there are plenty > literature on near dup detection so you should be able to get one for > free! > > On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: > > Otis, > > > > Thanks for your response. > > > > I just gave a quick look to the Nutch Forum and find that there is an > > implementation to obtain de-duplicate documents/pages but none for Near > > Duplicates documents. Can you guide me a little further as to where > exactly > > under Nutch I should be concentrating, regarding near duplicate > documents? > > > > Regards, > > Rishabh > > > > On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > > wrote: > > > > > > > To whomever started this thread: look at Nutch. I believe something > > > related to this already exists in Nutch for near-duplicate detection. > > > > > > Otis > > > -- > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > ----- Original Message ---- > > > From: Mike Klaas <[EMAIL PROTECTED]> > > > To: solr-user@lucene.apache.org > > > Sent: Sunday, November 18, 2007 11:08:38 PM > > > Subject: Re: Near Duplicate Documents > > > > > > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > > > > > > > Is there any idea implementing that feature in the up coming > > > releases? > > > > > > Not currently. Feel free to contribute something if you find a good > > > solution <g>. > > > > > > -Mike > > > > > > > > > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> > wrote: > > > > > > > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > >>> We have a scenario, where we want to find out documents which are > > > >> similar in > > > >>> content. To elaborate a little more on what we mean here, lets > > > >>> take an > > > >>> example. > > > >>> > > > >>> The example of this email chain in which we are interacting on, > > > >>> can be > > > >> best > > > >>> used for illustrating the concept of near dupes (We are not > getting > > > >> confused > > > >>> with threads, they are two different things.). Each email in this > > > >>> thread > > > >> is > > > >>> treated as a document by the system. A reply to the original mail > > > >>> also > > > >>> includes the original mail in which case it becomes a near > > > >>> duplicate of > > > >> the > > > >>> orginal mail (depending on the percentage of similarity). > > > >>> Similarly it > > > >> goes > > > >>> on. The near dupes need not be limited to emails. > > > >> > > > >> I think this is what's known as "shingling." See > > > >> http://en.wikipedia.org/wiki/W-shingling > > > >> Lucene (and therefore Solr) does not implement shingling. The > > > >> "MoreLikeThis" query might be close enough, however. > > > >> > > > >> -Stuart > > > >> > > > > > > > > > > > > > > > > > > > > > -- > Regards, > > Cuong Hoang >