Lance Norskog wrote:
Thanks!  We made variants of this and a couple of other files.

As to why we have the same document in different shards with different
contents: once you hit a certain index size and ingest rate, it is easiest
to create a series of indexes and leave the older ones alone. In the future,
please consider this as a legitimate use case instead of simply a mistake.

You may be interested in implementing something like this:

"Compact Features for Detection of Near-Duplicates in Distributed Retrieval", Yaniv Bernstein, Milad Shokouhi, and Justin Zobel

It sounds straightforward, and relieves your from the need to de-duplicate your collection.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to