Thanks, Erick.   I appreciate the sanity check.

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, May 28, 2015 5:50 PM
To: solr-user@lucene.apache.org
Subject: Re: optimal shard assignment with low shard key cardinality using 
compositeId to enable shard splitting

Charles:

You raise good points, and I didn't mean to say that co-locating docs due to 
some critera was never a good idea. That said, it does add administrative 
complexity that I'd prefer to avoid unless necessary.

I suppose it largely depends on what the load and response SLAs are.
If there's 1 query/second peak load, the sharding overhead for queries is 
probably not noticeable. If there are 1,000 QPS, then it might be worth it.

Measure, measure, measure......

I think your composite ID understanding is fine.

Best,
Erick

On Thu, May 28, 2015 at 1:40 PM, Reitzel, Charles 
<charles.reit...@tiaa-cref.org> wrote:
> We have used a similar sharding strategy for exactly the reasons you say.   
> But we are fairly certain that the # of documents per user ID is < 5000 and, 
> typically, <500.   Thus, we think the overhead of distributed searches 
> clearly outweighs the benefits.   Would you agree?   We have done some load 
> testing (with 100's of simultaneous users) and performance has been good with 
> data and queries distributed evenly across shards.
>
> In Matteo's case, this model appears to apply well to user types B and C.    
> Not sure about user type A, though.    At > 100,000 docs per user per year, 
> on average, that load seems ok for one node.   But, is it enough to benefit 
> significantly from a parallel search?
>
> With a 2 part composite ID, each part will contribute 16 bits to a 32 bit 
> hash value, which is then compared to the set of hash ranges for each active 
> shard.   Since the user ID will contribute the high-order bytes, it will 
> dominate in matching the target shard(s).   But dominance doesn't mean the 
> lower order 16 bits will always be ignored, does it?   I.e. if the original 
> shard has been split, perhaps multiple times, isn't it possible that one user 
> IDs documents will be spread over a multiple shards?
>
> In Matteo's case, it might make sense to specify fewer bits to the user ID 
> for user category A.   I.e. what I described above is the default for 
> userId!docId.   But if you use userId/8!docId/24 (8 bits for userId and 24 
> bits for the document ID), then couldn't one user's docs might be split over 
> multiple shards, even without splitting?
>
> I'm just making sure I understand how composite ID sharding works correctly.  
>  Have I got it right?  Has any of this logic changed in 5.x?
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Thursday, May 21, 2015 11:30 AM
> To: solr-user@lucene.apache.org
> Subject: Re: optimal shard assignment with low shard key cardinality 
> using compositeId to enable shard splitting
>
> I question your base assumption:
>
> bq: So shard by document producer seems a good choice
>
>  Because what this _also_ does is force all of the work for a query onto one 
> node and all indexing for a particular producer ditto. And will cause you to 
> manually monitor your shards to see if some of them grow out of proportion to 
> others. And....
>
> I think it would be much less hassle to just let Solr distribute the docs as 
> it may based on the uniqueKey and forget about it. Unless you want, say, to 
> do joins etc.... There will, of course, be some overhead that you pay here, 
> but unless you an measure it and it's a pain I wouldn't add the complexity 
> you're talking about, especially at the volumes you're talking.
>
> Best,
> Erick
>
> On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla <matteo.gro...@gmail.com> 
> wrote:
>> Hi
>> I'd like some feedback on how I'd like to solve the following 
>> sharding problem
>>
>>
>> I have a collection that will eventually become big
>>
>> Average document size is 1.5kb
>> Every year 30 Million documents will be indexed
>>
>> Data come from different document producers (a person, owner of his
>> documents) and queries are almost always performed by a document 
>> producer who can only query his own document. So shard by document 
>> producer seems a good choice
>>
>> there are 3 types of doc producer
>> type A,
>> cardinality 105 (there are 105 producers of this type) produce 17M 
>> docs/year (the aggregated production af all type A producers) type B 
>> cardinality ~10k produce 4M docs/year type C cardinality ~10M produce 
>> 9M docs/year
>>
>> I'm thinking about
>> use compositeId ( solrDocId = producerId!docId ) to send all docs of the 
>> same producer to the same shards. When a shard becomes too large I can use 
>> shard splitting.
>>
>> problems
>> -documents from type A producers could be oddly distributed among 
>> shards, because hashing doesn't work well on small numbers (105) see 
>> Appendix
>>
>> As a solution I could do this when a new typeA producer (producerA1) arrives:
>>
>> 1) client app: generate a producer code
>> 2) client app: simulate murmurhashing and shard assignment
>> 3) client app: check shard assignment is optimal (producer code is 
>> assigned to the shard with the least type A producers) otherwise goto
>> 1) and try with another code
>>
>> when I add documents or perform searches for producerA1 I use it's 
>> producer code respectively in the compositeId or in the route parameter What 
>> do you think?
>>
>>
>> -----------Appendix: murmurhash shard assignment
>> simulation-----------------------
>>
>> import mmh3
>>
>> hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)]
>>
>> num_shards = 16
>> shards = [0]*num_shards
>>
>> for hash in hashes:
>>     idx = hash % num_shards
>>     shards[idx] += 1
>>
>> print shards
>> print sum(shards)
>>
>> -------------
>>
>> result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]
>>
>> so with 16 shards and 105 shard keys I can have shards with 1 key 
>> shards with 11 keys
>>
>
> **********************************************************************
> *** This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
>
> TIAA-CREF
> **********************************************************************
> ***

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

Reply via email to