Re: Near Duplicate Documents

climbingrose Wed, 21 Nov 2007 00:30:11 -0800

The duplication detection mechanism in Nutch is quite primitive. I
think it uses a MD5 signature generated from the content of a field.
The generation algorithm is described here:
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.


The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!

On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> Otis,
>
> Thanks for your response.
>
> I just gave a quick look to the Nutch Forum and find that there is an
> implementation to obtain de-duplicate documents/pages but none for Near
> Duplicates documents. Can you guide me a little further as to where exactly
> under Nutch I should be concentrating, regarding near duplicate documents?
>
> Regards,
> Rishabh
>
> On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
>
>
> > To whomever started this thread: look at Nutch.  I believe something
> > related to this already exists in Nutch for near-duplicate detection.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > ----- Original Message ----
> > From: Mike Klaas <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Sunday, November 18, 2007 11:08:38 PM
> > Subject: Re: Near Duplicate Documents
> >
> > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> >
> > > Is there any idea implementing that feature in the up coming
> >  releases?
> >
> > Not currently.  Feel free to contribute something if you find a good
> > solution <g>.
> >
> > -Mike
> >
> >
> > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> > >
> > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > >>> We have a scenario, where we want to find out documents which are
> > >> similar in
> > >>> content. To elaborate a little more on what we mean here, lets
> > >>> take an
> > >>> example.
> > >>>
> > >>> The example of this email chain in which we are interacting on,
> > >>> can be
> > >> best
> > >>> used for illustrating the concept of near dupes (We are not getting
> > >> confused
> > >>> with threads, they are two different things.). Each email in this
> > >>> thread
> > >> is
> > >>> treated as a document by the system. A reply to the original mail
> > >>> also
> > >>> includes the original mail in which case it becomes a near
> > >>> duplicate of
> > >> the
> > >>> orginal mail (depending on the percentage of similarity).
> > >>> Similarly it
> > >> goes
> > >>> on. The near dupes need not be limited to emails.
> > >>
> > >> I think this is what's known as "shingling."  See
> > >> http://en.wikipedia.org/wiki/W-shingling
> > >> Lucene (and therefore Solr) does not implement shingling.  The
> > >> "MoreLikeThis" query might be close enough, however.
> > >>
> > >> -Stuart
> > >>
> >
> >
> >
> >
> >
>



-- 
Regards,

Cuong Hoang

Re: Near Duplicate Documents

Reply via email to