Hi I'd like some feedback on how I'd like to solve the following sharding problem
I have a collection that will eventually become big Average document size is 1.5kb Every year 30 Million documents will be indexed Data come from different document producers (a person, owner of his documents) and queries are almost always performed by a document producer who can only query his own document. So shard by document producer seems a good choice there are 3 types of doc producer type A, cardinality 105 (there are 105 producers of this type) produce 17M docs/year (the aggregated production af all type A producers) type B cardinality ~10k produce 4M docs/year type C cardinality ~10M produce 9M docs/year I'm thinking about use compositeId ( solrDocId = producerId!docId ) to send all docs of the same producer to the same shards. When a shard becomes too large I can use shard splitting. problems -documents from type A producers could be oddly distributed among shards, because hashing doesn't work well on small numbers (105) see Appendix As a solution I could do this when a new typeA producer (producerA1) arrives: 1) client app: generate a producer code 2) client app: simulate murmurhashing and shard assignment 3) client app: check shard assignment is optimal (producer code is assigned to the shard with the least type A producers) otherwise goto 1) and try with another code when I add documents or perform searches for producerA1 I use it's producer code respectively in the compositeId or in the route parameter What do you think? -----------Appendix: murmurhash shard assignment simulation----------------------- import mmh3 hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)] num_shards = 16 shards = [0]*num_shards for hash in hashes: idx = hash % num_shards shards[idx] += 1 print shards print sum(shards) ------------- result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8] so with 16 shards and 105 shard keys I can have shards with 1 key shards with 11 keys