Hi
I'd like some feedback on how I'd like to solve the following sharding problem


I have a collection that will eventually become big

Average document size is 1.5kb
Every year 30 Million documents will be indexed

Data come from different document producers (a person, owner of his documents) 
and queries are almost always performed by a document producer who can only 
query his own document. So shard by document producer seems a good choice

there are 3 types of doc producer
type A, 
cardinality 105 (there are 105 producers of this type)
produce 17M docs/year (the aggregated production af all type A producers)
type B
cardinality ~10k
produce 4M docs/year
type C
cardinality ~10M
produce 9M docs/year

I'm thinking about 
use compositeId ( solrDocId = producerId!docId ) to send all docs of the same 
producer to the same shards. When a shard becomes too large I can use shard 
splitting.

problems
-documents from type A producers could be oddly distributed among shards, 
because hashing doesn't work well on small numbers (105) see Appendix

As a solution I could do this when a new typeA producer (producerA1) arrives:

1) client app: generate a producer code
2) client app: simulate murmurhashing and shard assignment
3) client app: check shard assignment is optimal (producer code is assigned to 
the shard with the least type A producers) otherwise goto 1) and try with 
another code

when I add documents or perform searches for producerA1 I use it's producer 
code respectively in the compositeId or in the route parameter
What do you think?


-----------Appendix: murmurhash shard assignment 
simulation-----------------------

import mmh3

hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)]

num_shards = 16
shards = [0]*num_shards

for hash in hashes:
    idx = hash % num_shards
    shards[idx] += 1

print shards
print sum(shards)

-------------

result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]

so with 16 shards and 105 shard keys I can have
shards with 1 key
shards with 11 keys

Reply via email to