Mark Miller wrote:
Thanks for sharing Marc, thats very nice to know. I'll take your experience as a starting point for some wiki recommendations.

Sounds like we should add a switch to order alpha as well.

On the general note of near-duplicate detection ... I found this paper in the proceedings of SIGIR-08, which presents an interesting and relatively simple algorithm that yields excellent results. Who has some spare CPU cycles to implement this? ;)

http://ilpubs.stanford.edu:8090/860/

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to