Hi! Thank you for your thoughtful response Shawn! I have a couple of follow up questions and points to check my understanding on.
Thanks for explaining my misunderstanding on implicit routing. So to repeat back and check my understanding: implicit routing may be either left up to SOLR to distribute, or you may specify the router.field parameter at collection creation time. If there is no router.field parameter specified, SOLR will distribute documents to shards based solely on a hash of the docID; if router.field is defined, SOLR will distribute documents to shards based solely on the hash of the router.field value on the doc. Correct? So let me focus a bit more on the composite ID router. The options available here are: - single prepend routing (tenant1!id1) - multi prepend routing (tenant1!piece1!id1) - num-bits prepend routing (tenant1/B!id1) I think the first two are relatively straight forward; the ask is on the application layer to supply one or two prepends, and then SOLR will find an appropriate shard to host the document based on a hash of the prepend(s). I'm very interested though by the num-bits prepend. (By the way, I never found an agreed-upon name for this, so let me know if there is something standard I should call this). Originally when I wrote I had a misunderstanding here, but I believe I've understood it full now. If B is your "bits" param, then a tenant will be spread out over 1/(2^B) fraction of the shards; so if B = 0, any shard may end up hosting the doc; if B = 1, half of the shards may be the one to host the doc; if B = 2, one quarter of the shards may be the one to host the doc; etc.... I was still a bit uncertain about the mechanism until looked deeper into this documentation <https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html> and this article <https://lucidworks.com/2013/06/13/solr-cloud-document-routing/>: "[The composite ID router] creates a composite 32 bit hash by taking 16 bits from the shard key’s hash and 16 bits from the document id’s hash.... When a tenant is too large to fit on a single shard it can be spread across multiple shards be specifying the number of bits to use from the shard key. The syntax for this is: shard_key/num!document_id. The /num is the number of bits from the shard key to use in the composite hash." So I think this mean that first the hashes of each are computed, and then the bits are taken from the resultant hash values. Is that correct? So I believe that means that if num = 16, then that would be the same as omitting the /num param. I believe it also implies that if the number of shards is not a power of 2, some irregularity in number of shards will be experienced; e.g., if there is a collection with 11 shards and num = 2; 11/2^2 = 2.75; then every tenant will live on at least 3 shards, some may end up living on 4 depending on exactly how the ranges work out. Some of what implicit routing offers could be quite desirable if the document shard-routing ever needs to be updated. Specifically, the ID of the document could remain constant even when making updates to the routing key, which I believe would allow in-place updates to a document which changes its host shard. So for example, if I create a collection using implicit routing and router.field=shard_key, then I can insert a document with id=1 and shard_key=1, then later insert a document with id=1 and shard_key=2, and the original document sent with shard_key = 1 will be automatically deleted on insert of the new document. Can you confirm whether this is true? Or would the original document with id = 1 and shard_key = 1 not necessarily be deleted? The only drawback then I see of using implicit routing would be that there is no equivalent to the num-bits prepend: if you want to spread out your documents associated with a given tenant across fewer than all but more than one shard, it falls back to the developer to update the shard key. Is this correct, or is there an equivalent notion to num-bits for implicit routing? There's one other thing I want to drill down on a bit more with resource usage: > > > *Query performance should be about the same for any routing type. It does > look like when you use compositeId and actually implement shard keys, you > can limit queries to those shards, but a *general* query is going to hit > all shards.* > Currently I have experience targeting 1-shard at a time with queries. My goal in architecture here is to be a little more flexible, and instead keep the number of shards a given query has to hit approximately constant even as the user base and solr cloud grow. I believe that will keep CPU sprawl at a minimum, more on that below. > > > > > > * If your query rate is very low (or shards are distributed across a lot > of hardware that has significant spare CPU capacity) performance isn't > going to be dramatically different for a query that hits 2 shards versus > one that hits 6 shards. If your query rate is high, then more shards > probably will be noticeably slower than fewer shards. * > Generally I want to plan for a relatively high query rate, but my experience shows that the SOLR clouds I've built before have very very low CPU (spikes go to about 15%, steady state much lower) so starting with assuming there is head room at 1-shard-per-query is a very reasonable assumption for me right now. I guess what I'm asking; let's say the number of shards hosting a query goes from 1 to 2 for every shard_key, tenant, and query. Will the CPU double? How about 2 to 4; should I expect it to double again? My expectation is "yes", but I wanted to see what the SOLR community thinks here. Thanks as well for explaining more on the vertical cap of a shard: I got the 120M approximate guidance and found it to be shockingly accurate on a couple of production clouds I worked on, so I may have over generalized a bit :) One other question I have; what is the best way to route queries? I've seen both the _route_ param, as well as the shard.keys param. Is there a reason to choose one over the other? Thanks again so much for your thoughtful response! Looking forward to hearing back. Cheers, Stephen On Fri, Mar 2, 2018 at 4:51 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 3/2/2018 11:43 AM, Stephen Lewis wrote: > > I'm wondering what information you may be able to provide on performance > > implications of implicit routing VS composite ID routing. In particular, > > I'm curious what the horizontal scaling behavior may be of implicit > routing > > or composite ID routing with and without the "/<bits>" param appended on. > > The hash calculations should probably introduce so little overhead that > you'd never notice it. > > I once implemented a hash algorithm using two hash classes built into > Java. I'm pretty sure that it was NOT a fast implementation ... and it > could calculate over a million hashes per second on input strings of > about 20 characters. > > The hash algorithm used by CompositeId (one of the MurmurHash > implementations) is supposed to be one of the fastest algorithms in the > world. Unless your uniqueId field values are extremely huge, I really > doubt that hash calculation is a significant source of overhead. > > The use of implicit doesn't automatically mean there's no overhead for > routing. The implicit router can still redirect documents to different > shards, it just does it explicitly, usually with a shard name in a > particular field, rather than by hash calculation. > > > A relatively simple assessment I've done belowleads me to believe the > > following is likely the case: if we have S shards and B as our "/bits" > > param, then resource usage would Big O scale as follows (note: Previously > > I've received the advice that any shard should be capped at a max of 120M > > documents, which is where the cap on docs/shard-key comes from) > > Query performance should be about the same for any routing type. It > does look like when you use compositeId and actually implement shard > keys, you can limit queries to those shards, but a *general* query is > going to hit all shards. > > If your query rate is very low (or shards are distributed across a lot > of hardware that has significant spare CPU capacity) performance isn't > going to be dramatically different for a query that hits 2 shards versus > one that hits 6 shards. If your query rate is high, then more shards > probably will be noticeably slower than fewer shards. > > For the maximum docs to allow per shard: It depends. For some indexes > and use cases, a million documents per shard might be way too big. For > others, 500 million per shard might have incredible performance. There > are no hard rules about this. It's entirely dependent on what you're > actually doing. > > Thanks, > Shawn > > -- www.stephen-lewis.net