Re: Near Duplicate Documents

Ken Krugler Fri, 23 Nov 2007 15:55:52 -0800

It's correct that uncommon words are most likely not showing up in the
signature. However, I was trying to say that if two documents has 99%
common tokens and differ in one token with frequency > quantised
frequency, the two resulted hashes are completely different.

I think that the change in the token frequency would have to cause itto jump between quantized levels, in order for it to change the MD5hash. This can obviously happen, but the point of quantizing thefrequencies is to make it much less likely.

Also, we're using this for our tech page crawl, and seems to beworking pretty well, though we haven't done any in-depth analysis.

If you
want true near dup detection, what you would like to have is two
hashes that differ only in 1-2 bytes. That way, the signatures will
truely reflect the content of the document they present. However, with
this approach, you need a bit more work to cluster near dup documents.
Basically, once you have the hash function as I describe above,
finding similar documents comes down to Hamming distance problem: two
docs are near dup if ther hashes different in k positions (with k
small, might be < 3).

I've seen a few different proximity hash functions - which one orones are you referring to above?


Thanks,

-- Ken

On Nov 22, 2007 2:35 AM, Ken Krugler <[EMAIL PROTECTED]> wrote:

 >The duplication detection mechanism in Nutch is quite primitive. I
 >think it uses a MD5 signature generated from the content of a field.
 >The generation algorithm is described here:
 
>http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.
 >
 >The problem with this approach is MD5 hash is very sensitive: one
 >letter difference will generate completely different hash.

 I'm confused by your answer, assuming it's based on the page
 referenced by the URL you provided.

 The approach by TextProfileSignature would only generate a different
 MD5 hash with a single letter change if that change resulted in a
 change in the quantized frequency for that word. And if it's an
 uncommon word, then it wouldn't even show up in the signature.

 -- Ken


 >You
 >probably have to roll your own near duplication detection algorithm.
 >My advice is have a look at existing literature on near duplication
 >detection techniques and then implement one of them. I know Google has
 >some papers that describe a technique called minhash. I read the paper
 >and found it's very interesting. I'm not sure if you can implement the
 >algorithm because they have patented it. That said, there are plenty
 >literature on near dup detection so you should be able to get one for
 >free!
 >
 >On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
 >>  Otis,
 >>
 >>  Thanks for your response.
 >>
 >  > I just gave a quick look to the Nutch Forum and find that there is an
 >>  implementation to obtain de-duplicate documents/pages but none for Near

>> Duplicates documents. Can you guide me a little further as towhere exactly> > under Nutch I should be concentrating, regarding nearduplicate documents?

 >  >
 >  > Regards,
 >>  Rishabh
 >>
 >>  On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
 >>  wrote:
 >>
 >>
 >>  > To whomever started this thread: look at Nutch.  I believe something
 >>  > related to this already exists in Nutch for near-duplicate detection.
 >>  >
 >>  > Otis
 >>  > --
 >>  > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 >>  >
 >>  > ----- Original Message ----
 >>  > From: Mike Klaas <[EMAIL PROTECTED]>
 >>  > To: solr-user@lucene.apache.org
 >>  > Sent: Sunday, November 18, 2007 11:08:38 PM
 >>  > Subject: Re: Near Duplicate Documents
 >>  >
 >>  > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
 >>  >
 >>  > > Is there any idea implementing that feature in the up coming
 >>  >  releases?
 >>  >
 >  > > Not currently.  Feel free to contribute something if you find a good

 > >>  > solution <g>.

 >  > >
 >>  > -Mike
 >>  >
 >>  >

>> > > On Nov 18, 2007 9:35 PM, Stuart Sierra<[EMAIL PROTECTED]> wrote:

 >>  > >
 >>  > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
 >>  > >>> We have a scenario, where we want to find out documents which are
 >>  > >> similar in
 >>  > >>> content. To elaborate a little more on what we mean here, lets
 >>  > >>> take an
 >>  > >>> example.
 >>  > >>>
 >>  > >>> The example of this email chain in which we are interacting on,
 >>  > >>> can be
 >>  > >> best

>> > >>> used for illustrating the concept of near dupes (We arenot getting

 >>  > >> confused
 >>  > >>> with threads, they are two different things.). Each email in this
 >>  > >>> thread
 >>  > >> is
 >>  > >>> treated as a document by the system. A reply to the original mail
 >>  > >>> also
 >>  > >>> includes the original mail in which case it becomes a near
 >>  > >>> duplicate of
 >>  > >> the
 >>  > >>> orginal mail (depending on the percentage of similarity).
 >>  > >>> Similarly it
 >>  > >> goes
 >>  > >>> on. The near dupes need not be limited to emails.
 >>  > >>
 >>  > >> I think this is what's known as "shingling."  See
 >>  > >> http://en.wikipedia.org/wiki/W-shingling
 >>  > >> Lucene (and therefore Solr) does not implement shingling.  The
 >>  > >> "MoreLikeThis" query might be close enough, however.
 >>  > >>
 >  > > >> -Stuart

 --
 Ken Krugler
 Krugle, Inc.
 +1 530-210-6378
 "If you can't find it, you can't fix it"




--
Regards,

Cuong Hoang



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Near Duplicate Documents

Reply via email to