On Fri, 2012-08-31 at 02:25 +0200, Lance Norskog wrote:
> The math for "confidence values" in probability theory shows that
> distributed DF does not matter after not very many documents. If you
> have 10s of thousands of documents in each shard, don't worry.

The old advice of distributing the documents by hashing id or a similar
deterministic method is sound enough. However, it is my experience that
sharding is often done by source or material: When building a workflow,
it is the logical thing to do. This might be more of an educational than
a technical problem.

For setups with a large unchanging set of data and a smaller set with
high update frequency, the standard advice is to have a large unchanging
shard and a smaller NRT one. For that case, I would expect that the
unchanging data is often quite different from the changing ones.

Third case: Distributed search where the separate indexes are controlled
by different parties, where the parties does want to collaborate on the
distribution part but does not want to have their data indexed by the
other parties. We currently have this challenge.

Regards,
Toke Eskildsen

Reply via email to