Lance Norskog wrote:
Thanks! We made variants of this and a couple of other files.
As to why we have the same document in different shards with different
contents: once you hit a certain index size and ingest rate, it is easiest
to create a series of indexes and leave the older ones alone. In the future,
please consider this as a legitimate use case instead of simply a mistake.
You may be interested in implementing something like this:
"Compact Features for Detection of Near-Duplicates in Distributed
Retrieval", Yaniv Bernstein, Milad Shokouhi, and Justin Zobel
It sounds straightforward, and relieves your from the need to
de-duplicate your collection.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com