Re: Near Duplicate Documents

Rishabh Joshi Wed, 21 Nov 2007 00:57:27 -0800

Thanks for the info Cuong!

Regards,
Rishabh


On Nov 21, 2007 1:59 PM, climbingrose <[EMAIL PROTECTED]> wrote:

> The duplication detection mechanism in Nutch is quite primitive. I
> think it uses a MD5 signature generated from the content of a field.
> The generation algorithm is described here:
>
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html
> .
>
> The problem with this approach is MD5 hash is very sensitive: one
> letter difference will generate completely different hash. You
> probably have to roll your own near duplication detection algorithm.
> My advice is have a look at existing literature on near duplication
> detection techniques and then implement one of them. I know Google has
> some papers that describe a technique called minhash. I read the paper
> and found it's very interesting. I'm not sure if you can implement the
> algorithm because they have patented it. That said, there are plenty
> literature on near dup detection so you should be able to get one for
> free!
>
> On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> > Otis,
> >
> > Thanks for your response.
> >
> > I just gave a quick look to the Nutch Forum and find that there is an
> > implementation to obtain de-duplicate documents/pages but none for Near
> > Duplicates documents. Can you guide me a little further as to where
> exactly
> > under Nutch I should be concentrating, regarding near duplicate
> documents?
> >
> > Regards,
> > Rishabh
> >
> > On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> > wrote:
> >
> >
> > > To whomever started this thread: look at Nutch.  I believe something
> > > related to this already exists in Nutch for near-duplicate detection.
> > >
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > ----- Original Message ----
> > > From: Mike Klaas <[EMAIL PROTECTED]>
> > > To: solr-user@lucene.apache.org
> > > Sent: Sunday, November 18, 2007 11:08:38 PM
> > > Subject: Re: Near Duplicate Documents
> > >
> > > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> > >
> > > > Is there any idea implementing that feature in the up coming
> > >  releases?
> > >
> > > Not currently.  Feel free to contribute something if you find a good
> > > solution <g>.
> > >
> > > -Mike
> > >
> > >
> > > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]>
> wrote:
> > > >
> > > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > > >>> We have a scenario, where we want to find out documents which are
> > > >> similar in
> > > >>> content. To elaborate a little more on what we mean here, lets
> > > >>> take an
> > > >>> example.
> > > >>>
> > > >>> The example of this email chain in which we are interacting on,
> > > >>> can be
> > > >> best
> > > >>> used for illustrating the concept of near dupes (We are not
> getting
> > > >> confused
> > > >>> with threads, they are two different things.). Each email in this
> > > >>> thread
> > > >> is
> > > >>> treated as a document by the system. A reply to the original mail
> > > >>> also
> > > >>> includes the original mail in which case it becomes a near
> > > >>> duplicate of
> > > >> the
> > > >>> orginal mail (depending on the percentage of similarity).
> > > >>> Similarly it
> > > >> goes
> > > >>> on. The near dupes need not be limited to emails.
> > > >>
> > > >> I think this is what's known as "shingling."  See
> > > >> http://en.wikipedia.org/wiki/W-shingling
> > > >> Lucene (and therefore Solr) does not implement shingling.  The
> > > >> "MoreLikeThis" query might be close enough, however.
> > > >>
> > > >> -Stuart
> > > >>
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Regards,
>
> Cuong Hoang
>

Re: Near Duplicate Documents

Reply via email to