+1. I've been making a case for this for some time now, and was actually a focus of my talk last week. I'd be very happy to get this into 4.0.
We've tested various num_tokens with the algorithm on various sized clusters and we've found that typically 16 works best. With lower numbers we found that balance is good initially but as a cluster gets larger you have some problems. E.g We saw that on a 60 node cluster with 8 tokens per node we were seeing a difference of 22% in token ownership, but on a <=12 node cluster a difference of only 12%. 16 tokens on the other hand wasn't perfect but generally gave a better balance regardless of cluster size at least up to 100 nodes. TBH we should probably do some proper testing and record all the results for this before we pick a default (I'm happy to do this - think we can use the original testing script for this). But anyway, I'd say Jon is on the right track. Personally how I'd like to see it is that we: 1. Change allocate_tokens_for_keyspace to allocate_tokens_for_rf in the same way that DSE does it. Allowing a user to specify a RF to allocate from, and allowing multiple DC's. 2. Add a new boolean property random_token_allocation, defaults to false. 3. Make allocate_tokens_for_rf default to *unset**. 4. Make allocate_tokens_for_rf *required*** if num_tokens > 1 and random_token_allocation != true. 5. Default num_tokens to 16 (or whatever we find appropriate) * I think setting a default is asking for trouble. When people are going to add new DC's/nodes we don't want to risk them adding a node with the wrong RF. I think it's safe to say that a user should have to think about this before they spin up their cluster. ** Following above, it should be required to be set so that we don't have people accidentally using random allocation. I think we should really be aiming to get rid of random allocation completely, but provide a new property to enable it for backwards compatibility (also for testing). It's worth noting that a smaller number of tokens *theoretically* decreases the time for replacement/rebuild, so if we're considering QUORUM availability with vnodes there's an argument against having a very low num_tokens. I think it's better to utilise NTS and racks to reduce the chance of a QUORUM outage over banking on having a lower number of tokens, as with just a low number of tokens unless you go all the way to 1 you are just relying on luck that 2 nodes don't overlap. Guess what I'm saying is that I think we should be choosing a num_tokens that gives the best distribution for most cluster sizes rather than choosing one that "decreases" the probability of an outage. Also I think we should continue using CASSANDRA-13701 to track this. TBH I think in general we should be a bit better at searching for and using existing tickets... On Sat, 22 Sep 2018 at 18:13, Stefan Podkowinski <s...@apache.org> wrote: > There already have been some discussions on this here: > https://issues.apache.org/jira/browse/CASSANDRA-13701 > > The mentioned blocker there on the token allocation shouldn't exist > anymore. Although it would be good to get more feedback on it, in case > we want to enable it by default, along with new defaults for number of > tokens. > > > On 22.09.18 06:30, Dinesh Joshi wrote: > > Jon, thanks for starting this thread! > > > > I have created CASSANDRA-14784 to track this. > > > > Dinesh > > > >> On Sep 21, 2018, at 9:18 PM, Sankalp Kohli <kohlisank...@gmail.com> > wrote: > >> > >> Putting it on JIRA is to make sure someone is assigned to it and it is > tracked. Changes should be discussed over ML like you are saying. > >> > >> On Sep 21, 2018, at 21:02, Jonathan Haddad <j...@jonhaddad.com> wrote: > >> > >>>> We should create a JIRA to find what other defaults we need revisit. > >>> Changing a default is a pretty big deal, I think we should discuss any > >>> changes to defaults here on the ML before moving it into JIRA. It's > nice > >>> to get a bit more discussion around the change than what happens in > JIRA. > >>> > >>> We (TLP) did some testing on 4 tokens and found it to work surprisingly > >>> well. It wasn't particularly formal, but we verified the load stays > >>> pretty even with only 4 tokens as we added nodes to the cluster. > Higher > >>> token count hurts availability by increasing the number of nodes any > given > >>> node is a neighbor with, meaning any 2 nodes that fail have an > increased > >>> chance of downtime when using QUORUM. In addition, with the recent > >>> streaming optimization it seems the token counts will give a greater > chance > >>> of a node streaming entire sstables (with LCS), meaning we'll do a > better > >>> job with node density out of the box. > >>> > >>> Next week I can try to put together something a little more convincing. > >>> Weekend time. > >>> > >>> Jon > >>> > >>> > >>> On Fri, Sep 21, 2018 at 8:45 PM sankalp kohli <kohlisank...@gmail.com> > >>> wrote: > >>> > >>>> +1 to lowering it. > >>>> Thanks Jon for starting this.We should create a JIRA to find what > other > >>>> defaults we need revisit. (Please keep this discussion for "default > token" > >>>> only. ) > >>>> > >>>>> On Fri, Sep 21, 2018 at 8:26 PM Jeff Jirsa <jji...@gmail.com> wrote: > >>>>> > >>>>> Also agree it should be lowered, but definitely not to 1, and > probably > >>>>> something closer to 32 than 4. > >>>>> > >>>>> -- > >>>>> Jeff Jirsa > >>>>> > >>>>> > >>>>>> On Sep 21, 2018, at 8:24 PM, Jeremy Hanna < > jeremy.hanna1...@gmail.com> > >>>>> wrote: > >>>>>> I agree that it should be lowered. What I’ve seen debated a bit in > the > >>>>> past is the number but I don’t think anyone thinks that it should > remain > >>>>> 256. > >>>>>>> On Sep 21, 2018, at 7:05 PM, Jonathan Haddad <j...@jonhaddad.com> > >>>> wrote: > >>>>>>> One thing that's really, really bothered me for a while is how we > >>>>> default > >>>>>>> to 256 tokens still. There's no experienced operator that leaves > it > >>>> as > >>>>> is > >>>>>>> at this point, meaning the only people using 256 are the poor folks > >>>> that > >>>>>>> just got started using C*. I've worked with over a hundred > clusters > >>>> in > >>>>> the > >>>>>>> last couple years, and I think I only worked with one that had > lowered > >>>>> it > >>>>>>> to something else. > >>>>>>> > >>>>>>> I think it's time we changed the default to 4 (or 8, up for > debate). > >>>>>>> > >>>>>>> To improve the behavior, we need to change a couple other things. > The > >>>>>>> allocate_tokens_for_keyspace setting is... odd. It requires you > have > >>>> a > >>>>>>> keyspace already created, which doesn't help on new clusters. What > >>>> I'd > >>>>>>> like to do is add a new setting, allocate_tokens_for_rf, and set > it to > >>>>> 3 by > >>>>>>> default. > >>>>>>> > >>>>>>> To handle clusters that are already using 256 tokens, we could > prevent > >>>>> the > >>>>>>> new node from joining unless a -D flag is set to explicitly allow > >>>>>>> imbalanced tokens. > >>>>>>> > >>>>>>> We've agreed to a trunk freeze, but I feel like this is important > >>>> enough > >>>>>>> (and pretty trivial) to do now. I'd also personally characterize > this > >>>>> as a > >>>>>>> bug fix since 256 is horribly broken when the cluster gets to any > >>>>>>> reasonable size, but maybe I'm alone there. > >>>>>>> > >>>>>>> I honestly can't think of a use case where random tokens is a good > >>>>> choice > >>>>>>> anymore, so I'd be fine / ecstatic with removing it completely and > >>>>>>> requiring either allocate_tokens_for_keyspace (for existing > clusters) > >>>>>>> or allocate_tokens_for_rf > >>>>>>> to be set. > >>>>>>> > >>>>>>> Thoughts? Objections? > >>>>>>> -- > >>>>>>> Jon Haddad > >>>>>>> http://www.rustyrazorblade.com > >>>>>>> twitter: rustyrazorblade > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > >>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org > >>>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > >>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org > >>>>> > >>>>> > >>> > >>> -- > >>> Jon Haddad > >>> http://www.rustyrazorblade.com > >>> twitter: rustyrazorblade > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > >> For additional commands, e-mail: dev-h...@cassandra.apache.org > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > >