Re: RFC: Cassandra Virtual Nodes
On 20 March 2012 04:35, Vijay wrote: > On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans wrote: > >> I'm guessing you're referring to Rick's proposal about ranges per node? >> > > May be, what i mean is little more simple than that... We can consider > every node having a multiple conservative ranges and moving those ranges > for bootstrap etc, instead of finding the mid point etc in the bootstrap > code. Once we get that working all the way to the FS/Streaming then we can > move those ranges and assign those ranges to nodes in random orders. Hope > it makes sense. I agree that this should be approached in incremental steps. Rick already raised concerns about stability issues which might arise from changing large parts of code at once. I would anticipate the first step to be, exactly as you suggest, to support multiple tokens per host instead of just one. Presumably in your suggestion you imagine these tokens to define contiguous ranges for a given host, so that the distribution model is the same as before, but bootstrap can be done incrementally. This would be a great first step. The extension to a virtual node scheme as described previously is then fairly trivial. The only additional change needed is to assign the tokens in some other way which does not restrict the ranges to being contiguous. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton wrote: > On 20 March 2012 04:35, Vijay wrote: >> On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans wrote: >> >>> I'm guessing you're referring to Rick's proposal about ranges per node? >>> >> >> May be, what i mean is little more simple than that... We can consider >> every node having a multiple conservative ranges and moving those ranges >> for bootstrap etc, instead of finding the mid point etc in the bootstrap >> code. Once we get that working all the way to the FS/Streaming then we can >> move those ranges and assign those ranges to nodes in random orders. Hope >> it makes sense. > > I agree that this should be approached in incremental steps. Rick > already raised concerns about stability issues which might arise from > changing large parts of code at once. > > I would anticipate the first step to be, exactly as you suggest, to > support multiple tokens per host instead of just one. Presumably in > your suggestion you imagine these tokens to define contiguous ranges > for a given host, so that the distribution model is the same as > before, but bootstrap can be done incrementally. > > This would be a great first step. The extension to a virtual node > scheme as described previously is then fairly trivial. The only > additional change needed is to assign the tokens in some other way > which does not restrict the ranges to being contiguous. Sounds good to me. What can an upgrading user expect in the way of disruption? What would be required to move an existing cluster from one token per node to virtual nodes? Could this be made transparent? -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
I like this idea. It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort. More like 5% of the effort. I can't even enumerate all the places full vnode support would change, but an "active token range" concept would be relatively limited in scope. Full vnodes feels a lot more like the counters quagmire, where Digg/Twitter worked on it for... 8? months, and then DataStax worked on it about for about 6 months post-commit, and we're still finding the occasional bug-since-0.7 there. With the benefit of hindsight, as bad as maintaining that patchset was out of tree, committing it as early as we did was a mistake. We won't do that again. (On the bright side, git makes maintaining such a patchset easier now.) On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson wrote: > I think if we could go back and rebuild Cassandra from scratch, vnodes > would likely be implemented from the beginning. However, I'm concerned that > implementing them now could be a big distraction from more productive uses > of all of our time and introduce major potential stability issues into what > is becoming a business critical piece of infrastructure for many people. > However, instead of just complaining and pedantry, I'd like to offer a > feasible alternative: > > Has there been consideration given to the idea of a supporting a single > token range for a node? > > While not theoretically as capable as vnodes, it seems to me to be more > practical as it would have a significantly lower impact on the codebase and > provides a much clearer migration path. It also seems to solve a majority > of complaints regarding operational issues with Cassandra clusters. > > Each node would have a lower and an upper token, which would form a range > that would be actively distributed via gossip. Read and replication > requests would only be routed to a replica when the key of these operations > matched the replica's token range in the gossip tables. Each node would > locally store it's own current active token range as well as a target token > range it's "moving" towards. > > As a new node undergoes bootstrap, the bounds would be gradually expanded > to allow it to handle requests for a wider range of the keyspace as it > moves towards it's target token range. This idea boils down to a move from > hard cutovers to smoother operations by gradually adjusting active token > ranges over a period of time. It would apply to token change operations > (nodetool 'move' and 'removetoken') as well. > > Failure during streaming could be recovered at the bounds instead of > restarting the whole process as the active bounds would effectively track > the progress for bootstrap & target token changes. Implicitly these > operations would be throttled to some degree. Node repair (AES) could also > be modified using the same overall ideas provide a more gradual impact on > the cluster overall similar as the ideas given in CASSANDRA-3721. > > While this doesn't spread the load over the cluster for these operations > evenly like vnodes does, this is likely an issue that could be worked > around by performing concurrent (throttled) bootstrap & node repair (AES) > operations. It does allow some kind of "active" load balancing, but clearly > this is not as flexible or as useful as vnodes, but you should be using > RandomPartitioner or sort-of-randomized keys with OPP right? ;) > > As a side note: vnodes fail to provide solutions to node-based limitations > that seem to me to cause a substantial portion of operational issues such > as impact of node restarts / upgrades, GC and compaction induced latency. I > think some progress could be made here by allowing a "pack" of independent > Cassandra nodes to be ran on a single host; somewhat (but nowhere near > entirely) similar to a pre-fork model used by some UNIX-based servers. > > Input? > > -- > Rick Branson > DataStax -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: RFC: Cassandra Virtual Nodes
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis wrote: > I like this idea. It feels like a good 80/20 solution -- 80% of the > benefits, 20% of the effort. More like 5% of the effort. I can't > even enumerate all the places full vnode support would change, but an > "active token range" concept would be relatively limited in scope. It only addresses 1 of Sam's original 5 points, so I wouldn't call it an "80% solution". > Full vnodes feels a lot more like the counters quagmire, where > Digg/Twitter worked on it for... 8? months, and then DataStax worked > on it about for about 6 months post-commit, and we're still finding > the occasional bug-since-0.7 there. With the benefit of hindsight, as > bad as maintaining that patchset was out of tree, committing it as > early as we did was a mistake. We won't do that again. (On the > bright side, git makes maintaining such a patchset easier now.) And yet counters have become a very important feature for Cassandra; We're better off with them, than without. I think there were a number of problems with how counters went down that could be avoided here. For one, we can take a phased, incremental approach, rather than waiting 8 months to drop a large patchset. > On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson wrote: >> I think if we could go back and rebuild Cassandra from scratch, vnodes >> would likely be implemented from the beginning. However, I'm concerned that >> implementing them now could be a big distraction from more productive uses >> of all of our time and introduce major potential stability issues into what >> is becoming a business critical piece of infrastructure for many people. >> However, instead of just complaining and pedantry, I'd like to offer a >> feasible alternative: >> >> Has there been consideration given to the idea of a supporting a single >> token range for a node? >> >> While not theoretically as capable as vnodes, it seems to me to be more >> practical as it would have a significantly lower impact on the codebase and >> provides a much clearer migration path. It also seems to solve a majority >> of complaints regarding operational issues with Cassandra clusters. >> >> Each node would have a lower and an upper token, which would form a range >> that would be actively distributed via gossip. Read and replication >> requests would only be routed to a replica when the key of these operations >> matched the replica's token range in the gossip tables. Each node would >> locally store it's own current active token range as well as a target token >> range it's "moving" towards. >> >> As a new node undergoes bootstrap, the bounds would be gradually expanded >> to allow it to handle requests for a wider range of the keyspace as it >> moves towards it's target token range. This idea boils down to a move from >> hard cutovers to smoother operations by gradually adjusting active token >> ranges over a period of time. It would apply to token change operations >> (nodetool 'move' and 'removetoken') as well. >> >> Failure during streaming could be recovered at the bounds instead of >> restarting the whole process as the active bounds would effectively track >> the progress for bootstrap & target token changes. Implicitly these >> operations would be throttled to some degree. Node repair (AES) could also >> be modified using the same overall ideas provide a more gradual impact on >> the cluster overall similar as the ideas given in CASSANDRA-3721. >> >> While this doesn't spread the load over the cluster for these operations >> evenly like vnodes does, this is likely an issue that could be worked >> around by performing concurrent (throttled) bootstrap & node repair (AES) >> operations. It does allow some kind of "active" load balancing, but clearly >> this is not as flexible or as useful as vnodes, but you should be using >> RandomPartitioner or sort-of-randomized keys with OPP right? ;) >> >> As a side note: vnodes fail to provide solutions to node-based limitations >> that seem to me to cause a substantial portion of operational issues such >> as impact of node restarts / upgrades, GC and compaction induced latency. I >> think some progress could be made here by allowing a "pack" of independent >> Cassandra nodes to be ran on a single host; somewhat (but nowhere near >> entirely) similar to a pre-fork model used by some UNIX-based servers. >> >> Input? >> >> -- >> Rick Branson >> DataStax > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
> > I like this idea. It feels like a good 80/20 solution -- 80% of the > > benefits, 20% of the effort. More like 5% of the effort. I can't > > even enumerate all the places full vnode support would change, but an > > "active token range" concept would be relatively limited in scope. > > > It only addresses 1 of Sam's original 5 points, so I wouldn't call it > an "80% solution". > To support a form of DF, I think some tweaking of the replica placement could achieve this effect quite well. We could introduce a variable into replica placement, which I'm going to incorrectly call DF for the purposes of illustration. The key range for a node would be sub-divided by DF (1 by default) and this would be used to further distribution replica selection based on this "sub-partition". Currently, the offset formula works out to be something like this: offset = replica For RandomPartitioner, DF placement might look something like: offset = replica + (token % DF) Now, I realize replica selection is actually much more complicated than this, but these formulas are for illustration purposes. Modifying replica placement & the partitioners to support this seems straightforward, but I'm unsure of what's required to get it working for ring management operations. On the surface, it does seem like this could be added without any kind of difficult migration support. Thoughts?
Re: RFC: Cassandra Virtual Nodes
On 20 March 2012 13:37, Eric Evans wrote: > On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton wrote: >> On 20 March 2012 04:35, Vijay wrote: >>> May be, what i mean is little more simple than that... We can consider >>> every node having a multiple conservative ranges and moving those ranges >>> for bootstrap etc, instead of finding the mid point etc in the bootstrap >>> code. Once we get that working all the way to the FS/Streaming then we can >>> move those ranges and assign those ranges to nodes in random orders. Hope >>> it makes sense. >> >> I agree that this should be approached in incremental steps. Rick >> already raised concerns about stability issues which might arise from >> changing large parts of code at once. >> >> I would anticipate the first step to be, exactly as you suggest, to >> support multiple tokens per host instead of just one. Presumably in >> your suggestion you imagine these tokens to define contiguous ranges >> for a given host, so that the distribution model is the same as >> before, but bootstrap can be done incrementally. >> >> This would be a great first step. The extension to a virtual node >> scheme as described previously is then fairly trivial. The only >> additional change needed is to assign the tokens in some other way >> which does not restrict the ranges to being contiguous. > > Sounds good to me. > > What can an upgrading user expect in the way of disruption? What > would be required to move an existing cluster from one token per node > to virtual nodes? Could this be made transparent? > The disruption for an end-user would be no more than the same rolling upgrade process that they have to go through currently to upgrade to a new version. This is how I envisage it working: * When a node is upgraded and the new version starts up in an old cluster, it would split its own token range into multiple sub-ranges by assigning itself more tokens in its own range * These tokens could then be gossiped to any other new versions in the cluster. The old versions don't need to know about these intermediate tokens because distribution is exactly the same - node ranges are still contiguous * Once every node has been upgraded, distribution is still the same as before, but now ranges are split into sub-ranges * The benefits of vnodes start to become apparent when adding new nodes to the cluster - a new node bootstrapping would take an even amount of data from each other node and would not require doubling the cluster to maintain balance * As more nodes are added to the cluster it gets closer to full vnode distribution as more of the original hosts' ranges get reassigned to new nodes If the user wants to migrate to full vnode functionality straight away then they can do a rolling migration (decommission/bootstrap). During this migration there would be some imbalance in the cluster, but once all of the old nodes have been migrated, the cluster would be balanced. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Tue, Mar 20, 2012 at 9:08 AM, Eric Evans wrote: > On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis wrote: >> I like this idea. It feels like a good 80/20 solution -- 80% of the >> benefits, 20% of the effort. More like 5% of the effort. I can't >> even enumerate all the places full vnode support would change, but an >> "active token range" concept would be relatively limited in scope. > > It only addresses 1 of Sam's original 5 points, so I wouldn't call it > an "80% solution". I guess a more accurate way to put this is, "only 20% of Sam's list is an actual pain point that doesn't get addressed by The Rick Proposal [TRP]." Here's how I see Sam's list: * Even load balancing when growing and shrinking the cluster Nice to have, but post-bootstrap load balancing works well in practice (and is improved by TRP). * Greater failure tolerance in streaming Directly addressed by TRP. * Evenly distributed impact of streaming operations Not a problem in practice with stream throttling. * Possibility for active load balancing Not really a feature of vnodes per se, but as with the other load balancing point, this is also improved by TRP. * Distributed rebuild This is the 20% that TRP does not address. Nice to have? Yes. Can I live without it? I have so far. Is this alone worth the complexity of vnodes? No, it is not. Especially since there are probably other approaches that we can take to mitigate this, one of which Rick has suggested in a separate sub-thread. >> Full vnodes feels a lot more like the counters quagmire, where >> Digg/Twitter worked on it for... 8? months, and then DataStax worked >> on it about for about 6 months post-commit, and we're still finding >> the occasional bug-since-0.7 there. With the benefit of hindsight, as >> bad as maintaining that patchset was out of tree, committing it as >> early as we did was a mistake. We won't do that again. (On the >> bright side, git makes maintaining such a patchset easier now.) > > And yet counters have become a very important feature for Cassandra; > We're better off with them, than without. False dichotomy (we could have waited for a better counter design), but that's mostly irrelevant to my point that jamming incomplete code in-tree to sort out later is a bad idea. > I think there were a number of problems with how counters went down > that could be avoided here. For one, we can take a phased, > incremental approach, rather than waiting 8 months to drop a large > patchset. If there are incremental improvements to be made that justify themselves independently, then I agree. Small, self-contained steps are a good thing. A good example is https://issues.apache.org/jira/browse/CASSANDRA-2319, a product of The Grand Storage Engine Redesign of 674 fame. But, when things don't naturally break down into such mini-features, then I'm -1 on committing code that has no purpose other than to be a foundation for later commits. I've seen people get bored or assigned to other projects too often to just trust that those later commits will indeed be forthcoming. Or even if Sam [for instance] is still working hard on it, it's very easy for unforseen difficulties to come up that invalidate the original approach. Since we were talking about counters, the original vector clock approach -- that we ended up ripping out, painfully -- is a good example. Once bitten, twice shy. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
RE: RFC: Cassandra Virtual Nodes
So taking a step back, if we want "vnodes" why can't we just give every node 100 tokens instead of only one? Seems to me this would have less impact on the rest of the code. It would just look like you had a 500 node cluster, instead of a 5 node cluster. Your replication strategy would have to know about the physical machines so that data gets replicated right, but there is already some concept of this with the data center aware and rack aware stuff. >From what I see I think you could get most of the benefits of vnodes by >implementing a new Placement Strategy that did something like this, and you >wouldn't have to touch (and maybe break) code in other places. Am I crazy? Naive? Once you had this setup, you can start to implement the vnode like stuff on top of it. Like bootstrapping nodes in one token at a time, and taking them on from the whole cluster, not just your neighbor. etc. etc. -Jeremiah Jordan From: Rick Branson [rbran...@datastax.com] Sent: Monday, March 19, 2012 5:16 PM To: dev@cassandra.apache.org Subject: Re: RFC: Cassandra Virtual Nodes I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what is becoming a business critical piece of infrastructure for many people. However, instead of just complaining and pedantry, I'd like to offer a feasible alternative: Has there been consideration given to the idea of a supporting a single token range for a node? While not theoretically as capable as vnodes, it seems to me to be more practical as it would have a significantly lower impact on the codebase and provides a much clearer migration path. It also seems to solve a majority of complaints regarding operational issues with Cassandra clusters. Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would locally store it's own current active token range as well as a target token range it's "moving" towards. As a new node undergoes bootstrap, the bounds would be gradually expanded to allow it to handle requests for a wider range of the keyspace as it moves towards it's target token range. This idea boils down to a move from hard cutovers to smoother operations by gradually adjusting active token ranges over a period of time. It would apply to token change operations (nodetool 'move' and 'removetoken') as well. Failure during streaming could be recovered at the bounds instead of restarting the whole process as the active bounds would effectively track the progress for bootstrap & target token changes. Implicitly these operations would be throttled to some degree. Node repair (AES) could also be modified using the same overall ideas provide a more gradual impact on the cluster overall similar as the ideas given in CASSANDRA-3721. While this doesn't spread the load over the cluster for these operations evenly like vnodes does, this is likely an issue that could be worked around by performing concurrent (throttled) bootstrap & node repair (AES) operations. It does allow some kind of "active" load balancing, but clearly this is not as flexible or as useful as vnodes, but you should be using RandomPartitioner or sort-of-randomized keys with OPP right? ;) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I think some progress could be made here by allowing a "pack" of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. Input? -- Rick Branson DataStax
Re: RFC: Cassandra Virtual Nodes
On 19 March 2012 23:41, Peter Schuller wrote: >>> Using this ring bucket in the CRUSH topology, (with the hash function >>> being the identity function) would give the exact same distribution >>> properties as the virtual node strategy that I suggested previously, >>> but of course with much better topology awareness. >> >> I will have to re-read your orignal post. I seem to have missed something :) > > I did, and I may or may not understand what you mean. > > Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by > hash + identity hash as you traverse down the topology tree? Yes. I was just trying to illustrate that it's not necessary to have CRUSH doing the partitioning and placement of primary replicas. The same functionality can be achieved by having logically separate placement (a ring with virtual nodes) and a replication strategy which implements the CRUSH algorithm for replica placement. I think you agreed with this further down your previous reply anyway, perhaps I was just being too verbose :) The reason I'm trying to make that distinction is because it will be less work than wholesale replacing the entire distribution logic in Cassandra with CRUSH. I'm not sure if that's exactly what your design is suggesting? -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On 20 March 2012 14:50, Rick Branson wrote: > To support a form of DF, I think some tweaking of the replica placement could > achieve this effect quite well. We could introduce a variable into replica > placement, which I'm going to incorrectly call DF for the purposes of > illustration. The key range for a node would be sub-divided by DF (1 by > default) and this would be used to further distribution replica selection > based on this "sub-partition". > > Currently, the offset formula works out to be something like this: > > offset = replica > > For RandomPartitioner, DF placement might look something like: > > offset = replica + (token % DF) > > Now, I realize replica selection is actually much more complicated than this, > but these formulas are for illustration purposes. > > Modifying replica placement & the partitioners to support this seems > straightforward, but I'm unsure of what's required to get it working for ring > management operations. On the surface, it does seem like this could be added > without any kind of difficult migration support. > > Thoughts? This solution increases the DF, which has the advantage of providing some balancing when a node is down temporarily. The reads and writes it would have served are now distributed around ~DF nodes. However, it doesn't have any distributed rebuild. In fact, any distribution mechanism with one token per node cannot have distributed rebuild. Should a node fail, the next node in the ring has twice the token range so must have twice the data. This node will limit the rebuild time - 'nodetool removetoken' will have to replicate the data of the failed node onto this node. Increasing the distribution factor without speeding up rebuild increases the failure probability - both for data loss or being unable to reach required consistency levels. The failure probability is a trade-off between rebuild time and distribution factor. Lower rebuild time helps, and lower distribution factor helps. Cassandra as it is now has the longest rebuild time and lowest possible distribution factor. The original vnodes scheme is the other extreme - shortest rebuild time and largest possible distribution factor. It turns out that the rebuild time is more important, so this decreases failure probability (with some assumptions you can show it decreases by a factor RF! - I'll spare you the math but can send it if you're interested). This scheme has the longest rebuild time and a (tuneable) distribution factor, but larger than the lowest. That necessarily increases the failure probability over both Cassandra now and vnode schemes, so I'd be very careful about choosing it. Richard.
Re: RFC: Cassandra Virtual Nodes
On 20 March 2012 14:55, Jonathan Ellis wrote: > Here's how I see Sam's list: > > * Even load balancing when growing and shrinking the cluster > > Nice to have, but post-bootstrap load balancing works well in practice > (and is improved by TRP). Post-bootstrap load balancing without vnodes necessarily streams more data than is necessary. Vnodes streams the minimal amount. In fact, post-bootstrap load balancing currently streams a constant fraction of your data - the network traffic involved in a rebalance increases linearly with the size of your cluster. With vnodes it decreases linearly. Including removing the ops overhead of running the load balance and calculating new tokens, this makes removing post-bootstrap load balancing a pretty big deal. > * Greater failure tolerance in streaming > > Directly addressed by TRP. Agreed. > * Evenly distributed impact of streaming operations > > Not a problem in practice with stream throttling. Throttling slows them down, increasing rebuild times so increasing downtime. > * Possibility for active load balancing > > Not really a feature of vnodes per se, but as with the other load > balancing point, this is also improved by TRP. Again with the caveat that more data is streamed with TRP. Vnodes removes the need for any load balancing with RP. > * Distributed rebuild > > This is the 20% that TRP does not address. Nice to have? Yes. Can I > live without it? I have so far. Is this alone worth the complexity > of vnodes? No, it is not. Especially since there are probably other > approaches that we can take to mitigate this, one of which Rick has > suggested in a separate sub-thread. Distributed rebuild means you can store more data per node with the same failure probabilities. This is frequently a limiting factor on how much data you can store per node, increasing cluster sizes unnecessarily. I'd argue that this alone is worth the complexity of vnodes. Richard.
Re: RFC: Cassandra Virtual Nodes
> Each node would have a lower and an upper token, which would form a range > that would be actively distributed via gossip. Read and replication > requests would only be routed to a replica when the key of these operations > matched the replica's token range in the gossip tables. Each node would > locally store it's own current active token range as well as a target token > range it's "moving" towards. How is this significantly different than just using "nodetool move" (in post-1.x) more rapidly and on smaller segments at a time? There is the ring delay stuff which makes it un-workable to do at high granularity, but that should apply to the active range solution too. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
It's reasonable that we can attach different levels of importance to these things. Taking a step back, I have two main points: 1) vnodes add enormous complexity to *many* parts of Cassandra. I'm skeptical of the cost:benefit ratio here. 1a) The benefit is lower in my mind because many of the problems solved by vnodes can be solved "well enough" for "most people," for some value of those two phrases, without vnodes. 2) I'm not okay with a "commit something half-baked and sort it out later" approach. On Tue, Mar 20, 2012 at 11:10 AM, Richard Low wrote: > On 20 March 2012 14:55, Jonathan Ellis wrote: >> Here's how I see Sam's list: >> >> * Even load balancing when growing and shrinking the cluster >> >> Nice to have, but post-bootstrap load balancing works well in practice >> (and is improved by TRP). > > Post-bootstrap load balancing without vnodes necessarily streams more > data than is necessary. Vnodes streams the minimal amount. > > In fact, post-bootstrap load balancing currently streams a constant > fraction of your data - the network traffic involved in a rebalance > increases linearly with the size of your cluster. With vnodes it > decreases linearly. > > Including removing the ops overhead of running the load balance and > calculating new tokens, this makes removing post-bootstrap load > balancing a pretty big deal. > >> * Greater failure tolerance in streaming >> >> Directly addressed by TRP. > > Agreed. > >> * Evenly distributed impact of streaming operations >> >> Not a problem in practice with stream throttling. > > Throttling slows them down, increasing rebuild times so increasing downtime. > >> * Possibility for active load balancing >> >> Not really a feature of vnodes per se, but as with the other load >> balancing point, this is also improved by TRP. > > Again with the caveat that more data is streamed with TRP. Vnodes > removes the need for any load balancing with RP. > >> * Distributed rebuild >> >> This is the 20% that TRP does not address. Nice to have? Yes. Can I >> live without it? I have so far. Is this alone worth the complexity >> of vnodes? No, it is not. Especially since there are probably other >> approaches that we can take to mitigate this, one of which Rick has >> suggested in a separate sub-thread. > > Distributed rebuild means you can store more data per node with the > same failure probabilities. This is frequently a limiting factor on > how much data you can store per node, increasing cluster sizes > unnecessarily. I'd argue that this alone is worth the complexity of > vnodes. > > Richard. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com