On Fri, 2012-08-31 at 02:25 +0200, Lance Norskog wrote: > The math for "confidence values" in probability theory shows that > distributed DF does not matter after not very many documents. If you > have 10s of thousands of documents in each shard, don't worry.
The old advice of distributing the documents by hashing id or a similar deterministic method is sound enough. However, it is my experience that sharding is often done by source or material: When building a workflow, it is the logical thing to do. This might be more of an educational than a technical problem. For setups with a large unchanging set of data and a smaller set with high update frequency, the standard advice is to have a large unchanging shard and a smaller NRT one. For that case, I would expect that the unchanging data is often quite different from the changing ones. Third case: Distributed search where the separate indexes are controlled by different parties, where the parties does want to collaborate on the distribution part but does not want to have their data indexed by the other parties. We currently have this challenge. Regards, Toke Eskildsen