Mark Miller wrote:
Thanks for sharing Marc, thats very nice to know. I'll take your
experience as a starting point for some wiki recommendations.
Sounds like we should add a switch to order alpha as well.
On the general note of near-duplicate detection ... I found this paper
in the proceedings of SIGIR-08, which presents an interesting and
relatively simple algorithm that yields excellent results. Who has some
spare CPU cycles to implement this? ;)
http://ilpubs.stanford.edu:8090/860/
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com