On 21-Nov-07, at 12:29 AM, climbingrose wrote:

The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!

To help your googling: the main algorithm used for this is called 'shingling' or 'shingle printing'.

-Mike

Reply via email to