On 21-Nov-07, at 12:29 AM, climbingrose wrote:
The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free!
To help your googling: the main algorithm used for this is called 'shingling' or 'shingle printing'.
-Mike