Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 04:35, Vijay  wrote:
> On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans  wrote:
>
>> I'm guessing you're referring to Rick's proposal about ranges per node?
>>
>
> May be, what i mean is little more simple than that... We can consider
> every node having a multiple conservative ranges and moving those ranges
> for bootstrap etc, instead of finding the mid point etc in the bootstrap
> code. Once we get that working all the way to the FS/Streaming then we can
> move those ranges and assign those ranges to nodes in random orders. Hope
> it makes sense.

I agree that this should be approached in incremental steps. Rick
already raised concerns about stability issues which might arise from
changing large parts of code at once.

I would anticipate the first step to be, exactly as you suggest, to
support multiple tokens per host instead of just one. Presumably in
your suggestion you imagine these tokens to define contiguous ranges
for a given host, so that the distribution model is the same as
before, but bootstrap can be done incrementally.

This would be a great first step. The extension to a virtual node
scheme as described previously is then fairly trivial. The only
additional change needed is to assign the tokens in some other way
which does not restrict the ranges to being contiguous.

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Eric Evans
On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton  wrote:
> On 20 March 2012 04:35, Vijay  wrote:
>> On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans  wrote:
>>
>>> I'm guessing you're referring to Rick's proposal about ranges per node?
>>>
>>
>> May be, what i mean is little more simple than that... We can consider
>> every node having a multiple conservative ranges and moving those ranges
>> for bootstrap etc, instead of finding the mid point etc in the bootstrap
>> code. Once we get that working all the way to the FS/Streaming then we can
>> move those ranges and assign those ranges to nodes in random orders. Hope
>> it makes sense.
>
> I agree that this should be approached in incremental steps. Rick
> already raised concerns about stability issues which might arise from
> changing large parts of code at once.
>
> I would anticipate the first step to be, exactly as you suggest, to
> support multiple tokens per host instead of just one. Presumably in
> your suggestion you imagine these tokens to define contiguous ranges
> for a given host, so that the distribution model is the same as
> before, but bootstrap can be done incrementally.
>
> This would be a great first step. The extension to a virtual node
> scheme as described previously is then fairly trivial. The only
> additional change needed is to assign the tokens in some other way
> which does not restrict the ranges to being contiguous.

Sounds good to me.

What can an upgrading user expect in the way of disruption?  What
would be required to move an existing cluster from one token per node
to virtual nodes?  Could this be made transparent?

-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
I like this idea.  It feels like a good 80/20 solution -- 80% of the
benefits, 20% of the effort.  More like 5% of the effort.  I can't
even enumerate all the places full vnode support would change, but an
"active token range" concept would be relatively limited in scope.

Full vnodes feels a lot more like the counters quagmire, where
Digg/Twitter worked on it for... 8? months, and then DataStax worked
on it about for about 6 months post-commit, and we're still finding
the occasional bug-since-0.7 there.  With the benefit of hindsight, as
bad as maintaining that patchset was out of tree, committing it as
early as we did was a mistake.  We won't do that again.  (On the
bright side, git makes maintaining such a patchset easier now.)

On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson  wrote:
> I think if we could go back and rebuild Cassandra from scratch, vnodes
> would likely be implemented from the beginning. However, I'm concerned that
> implementing them now could be a big distraction from more productive uses
> of all of our time and introduce major potential stability issues into what
> is becoming a business critical piece of infrastructure for many people.
> However, instead of just complaining and pedantry, I'd like to offer a
> feasible alternative:
>
> Has there been consideration given to the idea of a supporting a single
> token range for a node?
>
> While not theoretically as capable as vnodes, it seems to me to be more
> practical as it would have a significantly lower impact on the codebase and
> provides a much clearer migration path. It also seems to solve a majority
> of complaints regarding operational issues with Cassandra clusters.
>
> Each node would have a lower and an upper token, which would form a range
> that would be actively distributed via gossip. Read and replication
> requests would only be routed to a replica when the key of these operations
> matched the replica's token range in the gossip tables. Each node would
> locally store it's own current active token range as well as a target token
> range it's "moving" towards.
>
> As a new node undergoes bootstrap, the bounds would be gradually expanded
> to allow it to handle requests for a wider range of the keyspace as it
> moves towards it's target token range. This idea boils down to a move from
> hard cutovers to smoother operations by gradually adjusting active token
> ranges over a period of time. It would apply to token change operations
> (nodetool 'move' and 'removetoken') as well.
>
> Failure during streaming could be recovered at the bounds instead of
> restarting the whole process as the active bounds would effectively track
> the progress for bootstrap & target token changes. Implicitly these
> operations would be throttled to some degree. Node repair (AES) could also
> be modified using the same overall ideas provide a more gradual impact on
> the cluster overall similar as the ideas given in CASSANDRA-3721.
>
> While this doesn't spread the load over the cluster for these operations
> evenly like vnodes does, this is likely an issue that could be worked
> around by performing concurrent (throttled) bootstrap & node repair (AES)
> operations. It does allow some kind of "active" load balancing, but clearly
> this is not as flexible or as useful as vnodes, but you should be using
> RandomPartitioner or sort-of-randomized keys with OPP right? ;)
>
> As a side note: vnodes fail to provide solutions to node-based limitations
> that seem to me to cause a substantial portion of operational issues such
> as impact of node restarts / upgrades, GC and compaction induced latency. I
> think some progress could be made here by allowing a "pack" of independent
> Cassandra nodes to be ran on a single host; somewhat (but nowhere near
> entirely) similar to a pre-fork model used by some UNIX-based servers.
>
> Input?
>
> --
> Rick Branson
> DataStax



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Eric Evans
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis  wrote:
> I like this idea.  It feels like a good 80/20 solution -- 80% of the
> benefits, 20% of the effort.  More like 5% of the effort.  I can't
> even enumerate all the places full vnode support would change, but an
> "active token range" concept would be relatively limited in scope.

It only addresses 1 of Sam's original 5 points, so I wouldn't call it
an "80% solution".

> Full vnodes feels a lot more like the counters quagmire, where
> Digg/Twitter worked on it for... 8? months, and then DataStax worked
> on it about for about 6 months post-commit, and we're still finding
> the occasional bug-since-0.7 there.  With the benefit of hindsight, as
> bad as maintaining that patchset was out of tree, committing it as
> early as we did was a mistake.  We won't do that again.  (On the
> bright side, git makes maintaining such a patchset easier now.)

And yet counters have become a very important feature for Cassandra;
We're better off with them, than without.

I think there were a number of problems with how counters went down
that could be avoided here.  For one, we can take a phased,
incremental approach, rather than waiting 8 months to drop a large
patchset.

> On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson  wrote:
>> I think if we could go back and rebuild Cassandra from scratch, vnodes
>> would likely be implemented from the beginning. However, I'm concerned that
>> implementing them now could be a big distraction from more productive uses
>> of all of our time and introduce major potential stability issues into what
>> is becoming a business critical piece of infrastructure for many people.
>> However, instead of just complaining and pedantry, I'd like to offer a
>> feasible alternative:
>>
>> Has there been consideration given to the idea of a supporting a single
>> token range for a node?
>>
>> While not theoretically as capable as vnodes, it seems to me to be more
>> practical as it would have a significantly lower impact on the codebase and
>> provides a much clearer migration path. It also seems to solve a majority
>> of complaints regarding operational issues with Cassandra clusters.
>>
>> Each node would have a lower and an upper token, which would form a range
>> that would be actively distributed via gossip. Read and replication
>> requests would only be routed to a replica when the key of these operations
>> matched the replica's token range in the gossip tables. Each node would
>> locally store it's own current active token range as well as a target token
>> range it's "moving" towards.
>>
>> As a new node undergoes bootstrap, the bounds would be gradually expanded
>> to allow it to handle requests for a wider range of the keyspace as it
>> moves towards it's target token range. This idea boils down to a move from
>> hard cutovers to smoother operations by gradually adjusting active token
>> ranges over a period of time. It would apply to token change operations
>> (nodetool 'move' and 'removetoken') as well.
>>
>> Failure during streaming could be recovered at the bounds instead of
>> restarting the whole process as the active bounds would effectively track
>> the progress for bootstrap & target token changes. Implicitly these
>> operations would be throttled to some degree. Node repair (AES) could also
>> be modified using the same overall ideas provide a more gradual impact on
>> the cluster overall similar as the ideas given in CASSANDRA-3721.
>>
>> While this doesn't spread the load over the cluster for these operations
>> evenly like vnodes does, this is likely an issue that could be worked
>> around by performing concurrent (throttled) bootstrap & node repair (AES)
>> operations. It does allow some kind of "active" load balancing, but clearly
>> this is not as flexible or as useful as vnodes, but you should be using
>> RandomPartitioner or sort-of-randomized keys with OPP right? ;)
>>
>> As a side note: vnodes fail to provide solutions to node-based limitations
>> that seem to me to cause a substantial portion of operational issues such
>> as impact of node restarts / upgrades, GC and compaction induced latency. I
>> think some progress could be made here by allowing a "pack" of independent
>> Cassandra nodes to be ran on a single host; somewhat (but nowhere near
>> entirely) similar to a pre-fork model used by some UNIX-based servers.
>>
>> Input?
>>
>> --
>> Rick Branson
>> DataStax
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com



-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Rick Branson
> > I like this idea. It feels like a good 80/20 solution -- 80% of the
> > benefits, 20% of the effort. More like 5% of the effort. I can't
> > even enumerate all the places full vnode support would change, but an
> > "active token range" concept would be relatively limited in scope.
> 
> 
> It only addresses 1 of Sam's original 5 points, so I wouldn't call it
> an "80% solution".
> 
To support a form of DF, I think some tweaking of the replica placement could 
achieve this effect quite well. We could introduce a variable into replica 
placement, which I'm going to incorrectly call DF for the purposes of 
illustration. The key range for a node would be sub-divided by DF (1 by 
default) and this would be used to further distribution replica selection based 
on this "sub-partition". 

Currently, the offset formula works out to be something like this:

offset = replica

For RandomPartitioner, DF placement might look something like:

offset = replica + (token % DF)

Now, I realize replica selection is actually much more complicated than this, 
but these formulas are for illustration purposes.

Modifying replica placement & the partitioners to support this seems 
straightforward, but I'm unsure of what's required to get it working for ring 
management operations. On the surface, it does seem like this could be added 
without any kind of difficult migration support. 

Thoughts?




Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 13:37, Eric Evans  wrote:
> On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton  wrote:
>> On 20 March 2012 04:35, Vijay  wrote:
>>> May be, what i mean is little more simple than that... We can consider
>>> every node having a multiple conservative ranges and moving those ranges
>>> for bootstrap etc, instead of finding the mid point etc in the bootstrap
>>> code. Once we get that working all the way to the FS/Streaming then we can
>>> move those ranges and assign those ranges to nodes in random orders. Hope
>>> it makes sense.
>>
>> I agree that this should be approached in incremental steps. Rick
>> already raised concerns about stability issues which might arise from
>> changing large parts of code at once.
>>
>> I would anticipate the first step to be, exactly as you suggest, to
>> support multiple tokens per host instead of just one. Presumably in
>> your suggestion you imagine these tokens to define contiguous ranges
>> for a given host, so that the distribution model is the same as
>> before, but bootstrap can be done incrementally.
>>
>> This would be a great first step. The extension to a virtual node
>> scheme as described previously is then fairly trivial. The only
>> additional change needed is to assign the tokens in some other way
>> which does not restrict the ranges to being contiguous.
>
> Sounds good to me.
>
> What can an upgrading user expect in the way of disruption?  What
> would be required to move an existing cluster from one token per node
> to virtual nodes?  Could this be made transparent?
>

The disruption for an end-user would be no more than the same rolling
upgrade process that they have to go through currently to upgrade to a
new version.

This is how I envisage it working:
* When a node is upgraded and the new version starts up in an old
cluster, it would split its own token range into multiple sub-ranges
by assigning itself more tokens in its own range
* These tokens could then be gossiped to any other new versions in the
cluster. The old versions don't need to know about these intermediate
tokens because distribution is exactly the same - node ranges are
still contiguous
* Once every node has been upgraded, distribution is still the same as
before, but now ranges are split into sub-ranges
* The benefits of vnodes start to become apparent when adding new
nodes to the cluster - a new node bootstrapping would take an even
amount of data from each other node and would not require doubling the
cluster to maintain balance
* As more nodes are added to the cluster it gets closer to full vnode
distribution as more of the original hosts' ranges get reassigned to
new nodes

If the user wants to migrate to full vnode functionality straight away
then they can do a rolling migration (decommission/bootstrap). During
this migration there would be some imbalance in the cluster, but once
all of the old nodes have been migrated, the cluster would be
balanced.

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
On Tue, Mar 20, 2012 at 9:08 AM, Eric Evans  wrote:
> On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis  wrote:
>> I like this idea.  It feels like a good 80/20 solution -- 80% of the
>> benefits, 20% of the effort.  More like 5% of the effort.  I can't
>> even enumerate all the places full vnode support would change, but an
>> "active token range" concept would be relatively limited in scope.
>
> It only addresses 1 of Sam's original 5 points, so I wouldn't call it
> an "80% solution".

I guess a more accurate way to put this is, "only 20% of Sam's list is
an actual pain point that doesn't get addressed by The Rick Proposal
[TRP]."

Here's how I see Sam's list:

* Even load balancing when growing and shrinking the cluster

Nice to have, but post-bootstrap load balancing works well in practice
(and is improved by TRP).

* Greater failure tolerance in streaming

Directly addressed by TRP.

* Evenly distributed impact of streaming operations

Not a problem in practice with stream throttling.

* Possibility for active load balancing

Not really a feature of vnodes per se, but as with the other load
balancing point, this is also improved by TRP.

* Distributed rebuild

This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
live without it?  I have so far.  Is this alone worth the complexity
of vnodes?  No, it is not.  Especially since there are probably other
approaches that we can take to mitigate this, one of which Rick has
suggested in a separate sub-thread.

>> Full vnodes feels a lot more like the counters quagmire, where
>> Digg/Twitter worked on it for... 8? months, and then DataStax worked
>> on it about for about 6 months post-commit, and we're still finding
>> the occasional bug-since-0.7 there.  With the benefit of hindsight, as
>> bad as maintaining that patchset was out of tree, committing it as
>> early as we did was a mistake.  We won't do that again.  (On the
>> bright side, git makes maintaining such a patchset easier now.)
>
> And yet counters have become a very important feature for Cassandra;
> We're better off with them, than without.

False dichotomy (we could have waited for a better counter design),
but that's mostly irrelevant to my point that jamming incomplete code
in-tree to sort out later is a bad idea.

> I think there were a number of problems with how counters went down
> that could be avoided here.  For one, we can take a phased,
> incremental approach, rather than waiting 8 months to drop a large
> patchset.

If there are incremental improvements to be made that justify
themselves independently, then I agree.  Small, self-contained steps
are a good thing.  A good example is
https://issues.apache.org/jira/browse/CASSANDRA-2319, a product of The
Grand Storage Engine Redesign of 674 fame.

But, when things don't naturally break down into such mini-features,
then I'm -1 on committing code that has no purpose other than to be a
foundation for later commits.  I've seen people get bored or assigned
to other projects too often to just trust that those later commits
will indeed be forthcoming.  Or even if Sam [for instance] is still
working hard on it, it's very easy for unforseen difficulties to come
up that invalidate the original approach.  Since we were talking about
counters, the original vector clock approach -- that we ended up
ripping out, painfully -- is a good example.  Once bitten, twice shy.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


RE: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jeremiah Jordan
So taking a step back, if we want "vnodes" why can't we just give every node 
100 tokens instead of only one?  Seems to me this would have less impact on the 
rest of the code.  It would just look like you had a 500 node cluster, instead 
of a 5 node cluster.  Your replication strategy would have to know about the 
physical machines so that data gets replicated right, but there is already some 
concept of this with the data center aware and rack aware stuff.

>From what I see I think you could get most of the benefits of vnodes by 
>implementing a new Placement Strategy that did something like this, and you 
>wouldn't have to touch (and maybe break) code in other places.

Am I crazy? Naive?

Once you had this setup, you can start to implement the vnode like stuff on top 
of it.  Like bootstrapping nodes in one token at a time, and taking them on 
from the whole cluster, not just your neighbor. etc. etc.

-Jeremiah Jordan


From: Rick Branson [rbran...@datastax.com]
Sent: Monday, March 19, 2012 5:16 PM
To: dev@cassandra.apache.org
Subject: Re: RFC: Cassandra Virtual Nodes

I think if we could go back and rebuild Cassandra from scratch, vnodes
would likely be implemented from the beginning. However, I'm concerned that
implementing them now could be a big distraction from more productive uses
of all of our time and introduce major potential stability issues into what
is becoming a business critical piece of infrastructure for many people.
However, instead of just complaining and pedantry, I'd like to offer a
feasible alternative:

Has there been consideration given to the idea of a supporting a single
token range for a node?

While not theoretically as capable as vnodes, it seems to me to be more
practical as it would have a significantly lower impact on the codebase and
provides a much clearer migration path. It also seems to solve a majority
of complaints regarding operational issues with Cassandra clusters.

Each node would have a lower and an upper token, which would form a range
that would be actively distributed via gossip. Read and replication
requests would only be routed to a replica when the key of these operations
matched the replica's token range in the gossip tables. Each node would
locally store it's own current active token range as well as a target token
range it's "moving" towards.

As a new node undergoes bootstrap, the bounds would be gradually expanded
to allow it to handle requests for a wider range of the keyspace as it
moves towards it's target token range. This idea boils down to a move from
hard cutovers to smoother operations by gradually adjusting active token
ranges over a period of time. It would apply to token change operations
(nodetool 'move' and 'removetoken') as well.

Failure during streaming could be recovered at the bounds instead of
restarting the whole process as the active bounds would effectively track
the progress for bootstrap & target token changes. Implicitly these
operations would be throttled to some degree. Node repair (AES) could also
be modified using the same overall ideas provide a more gradual impact on
the cluster overall similar as the ideas given in CASSANDRA-3721.

While this doesn't spread the load over the cluster for these operations
evenly like vnodes does, this is likely an issue that could be worked
around by performing concurrent (throttled) bootstrap & node repair (AES)
operations. It does allow some kind of "active" load balancing, but clearly
this is not as flexible or as useful as vnodes, but you should be using
RandomPartitioner or sort-of-randomized keys with OPP right? ;)

As a side note: vnodes fail to provide solutions to node-based limitations
that seem to me to cause a substantial portion of operational issues such
as impact of node restarts / upgrades, GC and compaction induced latency. I
think some progress could be made here by allowing a "pack" of independent
Cassandra nodes to be ran on a single host; somewhat (but nowhere near
entirely) similar to a pre-fork model used by some UNIX-based servers.

Input?

--
Rick Branson
DataStax


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 19 March 2012 23:41, Peter Schuller  wrote:
>>> Using this ring bucket in the CRUSH topology, (with the hash function
>>> being the identity function) would give the exact same distribution
>>> properties as the virtual node strategy that I suggested previously,
>>> but of course with much better topology awareness.
>>
>> I will have to re-read your orignal post. I seem to have missed something :)
>
> I did, and I may or may not understand what you mean.
>
> Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by
> hash + identity hash as you traverse down the topology tree?

Yes. I was just trying to illustrate that it's not necessary to have
CRUSH doing the partitioning and placement of primary replicas. The
same functionality can be achieved by having logically separate
placement (a ring with virtual nodes) and a replication strategy which
implements the CRUSH algorithm for replica placement. I think you
agreed with this further down your previous reply anyway, perhaps I
was just being too verbose :)

The reason I'm trying to make that distinction is because it will be
less work than wholesale replacing the entire distribution logic in
Cassandra with CRUSH. I'm not sure if that's exactly what your design
is suggesting?

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Richard Low
On 20 March 2012 14:50, Rick Branson  wrote:

> To support a form of DF, I think some tweaking of the replica placement could 
> achieve this effect quite well. We could introduce a variable into replica 
> placement, which I'm going to incorrectly call DF for the purposes of 
> illustration. The key range for a node would be sub-divided by DF (1 by 
> default) and this would be used to further distribution replica selection 
> based on this "sub-partition".
>
> Currently, the offset formula works out to be something like this:
>
> offset = replica
>
> For RandomPartitioner, DF placement might look something like:
>
> offset = replica + (token % DF)
>
> Now, I realize replica selection is actually much more complicated than this, 
> but these formulas are for illustration purposes.
>
> Modifying replica placement & the partitioners to support this seems 
> straightforward, but I'm unsure of what's required to get it working for ring 
> management operations. On the surface, it does seem like this could be added 
> without any kind of difficult migration support.
>
> Thoughts?

This solution increases the DF, which has the advantage of providing
some balancing when a node is down temporarily.  The reads and writes
it would have served are now distributed around ~DF nodes.

However, it doesn't have any distributed rebuild.  In fact, any
distribution mechanism with one token per node cannot have distributed
rebuild.  Should a node fail, the next node in the ring has twice the
token range so must have twice the data.  This node will limit the
rebuild time - 'nodetool removetoken' will have to replicate the data
of the failed node onto this node.

Increasing the distribution factor without speeding up rebuild
increases the failure probability - both for data loss or being unable
to reach required consistency levels.  The failure probability is a
trade-off between rebuild time and distribution factor.  Lower rebuild
time helps, and lower distribution factor helps.

Cassandra as it is now has the longest rebuild time and lowest
possible distribution factor.  The original vnodes scheme is the other
extreme - shortest rebuild time and largest possible distribution
factor.
 It turns out that the rebuild time is more important, so this
decreases failure probability (with some assumptions you can show it
decreases by a factor RF! - I'll spare you the math but can send it if
you're interested).

This scheme has the longest rebuild time and a (tuneable) distribution
factor, but larger than the lowest.  That necessarily increases the
failure probability over both Cassandra now and vnode schemes, so I'd
be very careful about choosing it.

Richard.


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Richard Low
On 20 March 2012 14:55, Jonathan Ellis  wrote:
> Here's how I see Sam's list:
>
> * Even load balancing when growing and shrinking the cluster
>
> Nice to have, but post-bootstrap load balancing works well in practice
> (and is improved by TRP).

Post-bootstrap load balancing without vnodes necessarily streams more
data than is necessary.  Vnodes streams the minimal amount.

In fact, post-bootstrap load balancing currently streams a constant
fraction of your data - the network traffic involved in a rebalance
increases linearly with the size of your cluster.  With vnodes it
decreases linearly.

Including removing the ops overhead of running the load balance and
calculating new tokens, this makes removing post-bootstrap load
balancing a pretty big deal.

> * Greater failure tolerance in streaming
>
> Directly addressed by TRP.

Agreed.

> * Evenly distributed impact of streaming operations
>
> Not a problem in practice with stream throttling.

Throttling slows them down, increasing rebuild times so increasing downtime.

> * Possibility for active load balancing
>
> Not really a feature of vnodes per se, but as with the other load
> balancing point, this is also improved by TRP.

Again with the caveat that more data is streamed with TRP.  Vnodes
removes the need for any load balancing with RP.

> * Distributed rebuild
>
> This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
> live without it?  I have so far.  Is this alone worth the complexity
> of vnodes?  No, it is not.  Especially since there are probably other
> approaches that we can take to mitigate this, one of which Rick has
> suggested in a separate sub-thread.

Distributed rebuild means you can store more data per node with the
same failure probabilities.  This is frequently a limiting factor on
how much data you can store per node, increasing cluster sizes
unnecessarily.  I'd argue that this alone is worth the complexity of
vnodes.

Richard.


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Peter Schuller
> Each node would have a lower and an upper token, which would form a range
> that would be actively distributed via gossip. Read and replication
> requests would only be routed to a replica when the key of these operations
> matched the replica's token range in the gossip tables. Each node would
> locally store it's own current active token range as well as a target token
> range it's "moving" towards.

How is this significantly different than just using "nodetool move"
(in post-1.x) more rapidly and on smaller segments at a time?

There is the ring delay stuff which makes it un-workable to do at high
granularity, but that should apply to the active range solution too.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
It's reasonable that we can attach different levels of importance to
these things.  Taking a step back, I have two main points:

1) vnodes add enormous complexity to *many* parts of Cassandra.  I'm
skeptical of the cost:benefit ratio here.

1a) The benefit is lower in my mind because many of the problems
solved by vnodes can be solved "well enough" for "most people," for
some value of those two phrases, without vnodes.

2) I'm not okay with a "commit something half-baked and sort it out
later" approach.

On Tue, Mar 20, 2012 at 11:10 AM, Richard Low  wrote:
> On 20 March 2012 14:55, Jonathan Ellis  wrote:
>> Here's how I see Sam's list:
>>
>> * Even load balancing when growing and shrinking the cluster
>>
>> Nice to have, but post-bootstrap load balancing works well in practice
>> (and is improved by TRP).
>
> Post-bootstrap load balancing without vnodes necessarily streams more
> data than is necessary.  Vnodes streams the minimal amount.
>
> In fact, post-bootstrap load balancing currently streams a constant
> fraction of your data - the network traffic involved in a rebalance
> increases linearly with the size of your cluster.  With vnodes it
> decreases linearly.
>
> Including removing the ops overhead of running the load balance and
> calculating new tokens, this makes removing post-bootstrap load
> balancing a pretty big deal.
>
>> * Greater failure tolerance in streaming
>>
>> Directly addressed by TRP.
>
> Agreed.
>
>> * Evenly distributed impact of streaming operations
>>
>> Not a problem in practice with stream throttling.
>
> Throttling slows them down, increasing rebuild times so increasing downtime.
>
>> * Possibility for active load balancing
>>
>> Not really a feature of vnodes per se, but as with the other load
>> balancing point, this is also improved by TRP.
>
> Again with the caveat that more data is streamed with TRP.  Vnodes
> removes the need for any load balancing with RP.
>
>> * Distributed rebuild
>>
>> This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
>> live without it?  I have so far.  Is this alone worth the complexity
>> of vnodes?  No, it is not.  Especially since there are probably other
>> approaches that we can take to mitigate this, one of which Rick has
>> suggested in a separate sub-thread.
>
> Distributed rebuild means you can store more data per node with the
> same failure probabilities.  This is frequently a limiting factor on
> how much data you can store per node, increasing cluster sizes
> unnecessarily.  I'd argue that this alone is worth the complexity of
> vnodes.
>
> Richard.



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com