Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Eric Evans
On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis  wrote:
> It's reasonable that we can attach different levels of importance to
> these things.  Taking a step back, I have two main points:
>
> 1) vnodes add enormous complexity to *many* parts of Cassandra.  I'm
> skeptical of the cost:benefit ratio here.
>
> 1a) The benefit is lower in my mind because many of the problems
> solved by vnodes can be solved "well enough" for "most people," for
> some value of those two phrases, without vnodes.
>
> 2) I'm not okay with a "commit something half-baked and sort it out
> later" approach.

I must admit I find this a little disheartening.  The discussion has
barely started.  No one has had a chance to discuss implementation
specifics so that the rest of us could understand *how* disruptive it
would be (a necessary requirement in weighing cost:benefit), or what
an incremental approach would look like, and yet work has already
begun on shutting this down.

Unless I'm reading you wrong, your mandate (I say mandate because you
hinted at a veto elsewhere), is No to anything complex or invasive
(for some value of each).  The only alternative would seem to be a
phased or incremental approach, but you seem to be saying No to that
as well.

There seems to be quite a bit of interest in having virtual nodes (and
there has been for as long as I can remember), the only serious
reservations relate to the difficulty/complexity.  Is there really no
way to put our heads together and figure out how to properly manage
that aspect?

> On Tue, Mar 20, 2012 at 11:10 AM, Richard Low  wrote:
>> On 20 March 2012 14:55, Jonathan Ellis  wrote:
>>> Here's how I see Sam's list:
>>>
>>> * Even load balancing when growing and shrinking the cluster
>>>
>>> Nice to have, but post-bootstrap load balancing works well in practice
>>> (and is improved by TRP).
>>
>> Post-bootstrap load balancing without vnodes necessarily streams more
>> data than is necessary.  Vnodes streams the minimal amount.
>>
>> In fact, post-bootstrap load balancing currently streams a constant
>> fraction of your data - the network traffic involved in a rebalance
>> increases linearly with the size of your cluster.  With vnodes it
>> decreases linearly.
>>
>> Including removing the ops overhead of running the load balance and
>> calculating new tokens, this makes removing post-bootstrap load
>> balancing a pretty big deal.
>>
>>> * Greater failure tolerance in streaming
>>>
>>> Directly addressed by TRP.
>>
>> Agreed.
>>
>>> * Evenly distributed impact of streaming operations
>>>
>>> Not a problem in practice with stream throttling.
>>
>> Throttling slows them down, increasing rebuild times so increasing downtime.
>>
>>> * Possibility for active load balancing
>>>
>>> Not really a feature of vnodes per se, but as with the other load
>>> balancing point, this is also improved by TRP.
>>
>> Again with the caveat that more data is streamed with TRP.  Vnodes
>> removes the need for any load balancing with RP.
>>
>>> * Distributed rebuild
>>>
>>> This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
>>> live without it?  I have so far.  Is this alone worth the complexity
>>> of vnodes?  No, it is not.  Especially since there are probably other
>>> approaches that we can take to mitigate this, one of which Rick has
>>> suggested in a separate sub-thread.
>>
>> Distributed rebuild means you can store more data per node with the
>> same failure probabilities.  This is frequently a limiting factor on
>> how much data you can store per node, increasing cluster sizes
>> unnecessarily.  I'd argue that this alone is worth the complexity of
>> vnodes.
>>
>> Richard.
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com



-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
On Wed, Mar 21, 2012 at 9:50 AM, Eric Evans  wrote:
> On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis  wrote:
>> It's reasonable that we can attach different levels of importance to
>> these things.  Taking a step back, I have two main points:
>>
>> 1) vnodes add enormous complexity to *many* parts of Cassandra.  I'm
>> skeptical of the cost:benefit ratio here.
>>
>> 1a) The benefit is lower in my mind because many of the problems
>> solved by vnodes can be solved "well enough" for "most people," for
>> some value of those two phrases, without vnodes.
>>
>> 2) I'm not okay with a "commit something half-baked and sort it out
>> later" approach.
>
> I must admit I find this a little disheartening.  The discussion has
> barely started.  No one has had a chance to discuss implementation
> specifics so that the rest of us could understand *how* disruptive it
> would be (a necessary requirement in weighing cost:benefit), or what
> an incremental approach would look like, and yet work has already
> begun on shutting this down.
>
> Unless I'm reading you wrong, your mandate (I say mandate because you
> hinted at a veto elsewhere), is No to anything complex or invasive
> (for some value of each).  The only alternative would seem to be a
> phased or incremental approach, but you seem to be saying No to that
> as well.
>
> There seems to be quite a bit of interest in having virtual nodes (and
> there has been for as long as I can remember), the only serious
> reservations relate to the difficulty/complexity.  Is there really no
> way to put our heads together and figure out how to properly manage
> that aspect?
>
>> On Tue, Mar 20, 2012 at 11:10 AM, Richard Low  wrote:
>>> On 20 March 2012 14:55, Jonathan Ellis  wrote:
 Here's how I see Sam's list:

 * Even load balancing when growing and shrinking the cluster

 Nice to have, but post-bootstrap load balancing works well in practice
 (and is improved by TRP).
>>>
>>> Post-bootstrap load balancing without vnodes necessarily streams more
>>> data than is necessary.  Vnodes streams the minimal amount.
>>>
>>> In fact, post-bootstrap load balancing currently streams a constant
>>> fraction of your data - the network traffic involved in a rebalance
>>> increases linearly with the size of your cluster.  With vnodes it
>>> decreases linearly.
>>>
>>> Including removing the ops overhead of running the load balance and
>>> calculating new tokens, this makes removing post-bootstrap load
>>> balancing a pretty big deal.
>>>
 * Greater failure tolerance in streaming

 Directly addressed by TRP.
>>>
>>> Agreed.
>>>
 * Evenly distributed impact of streaming operations

 Not a problem in practice with stream throttling.
>>>
>>> Throttling slows them down, increasing rebuild times so increasing downtime.
>>>
 * Possibility for active load balancing

 Not really a feature of vnodes per se, but as with the other load
 balancing point, this is also improved by TRP.
>>>
>>> Again with the caveat that more data is streamed with TRP.  Vnodes
>>> removes the need for any load balancing with RP.
>>>
 * Distributed rebuild

 This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
 live without it?  I have so far.  Is this alone worth the complexity
 of vnodes?  No, it is not.  Especially since there are probably other
 approaches that we can take to mitigate this, one of which Rick has
 suggested in a separate sub-thread.
>>>
>>> Distributed rebuild means you can store more data per node with the
>>> same failure probabilities.  This is frequently a limiting factor on
>>> how much data you can store per node, increasing cluster sizes
>>> unnecessarily.  I'd argue that this alone is worth the complexity of
>>> vnodes.
>>>
>>> Richard.
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>
>
>
> --
> Eric Evans
> Acunu | http://www.acunu.com | @acunu

I have also thought of how I would like Vnodes to work from an
operational prospective rather then a software one. I would like these
features.
1) No more raid 0. If a machine is responsible for 4 vnodes they
should correspond to for JBOD.

2) Vnodes should be able to be hot pluged. My normal cassandra chassis
would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my
chassis dies I should be able to take the disks out and physically
plug them into another chassis. Then in cassandra I should be able to
run a command like.
nodetool attach '/mnt/disk6'. disk6 should contain all data an it's
vnode information.

Now this would be awesome for upgrades/migrations/etc.


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Chris Goffinet
I'm going to agree with Eric on this one. Twitter has wanted some sort of vnode 
support for quite sometime. We even were willing to do all the work. I have 
reservations about that now  We have been silent due to the community and how 
this is more like an exclusive Datastax project than an Apache one. I share 
Eric's frustration and do not like this veto/control attitude I see on this 
thread. 

Sent from my iPhone 

On Mar 21, 2012, at 6:50 AM, Eric Evans  wrote:

> On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis  wrote:
>> It's reasonable that we can attach different levels of importance to
>> these things.  Taking a step back, I have two main points:
>> 
>> 1) vnodes add enormous complexity to *many* parts of Cassandra.  I'm
>> skeptical of the cost:benefit ratio here.
>> 
>> 1a) The benefit is lower in my mind because many of the problems
>> solved by vnodes can be solved "well enough" for "most people," for
>> some value of those two phrases, without vnodes.
>> 
>> 2) I'm not okay with a "commit something half-baked and sort it out
>> later" approach.
> 
> I must admit I find this a little disheartening.  The discussion has
> barely started.  No one has had a chance to discuss implementation
> specifics so that the rest of us could understand *how* disruptive it
> would be (a necessary requirement in weighing cost:benefit), or what
> an incremental approach would look like, and yet work has already
> begun on shutting this down.
> 
> Unless I'm reading you wrong, your mandate (I say mandate because you
> hinted at a veto elsewhere), is No to anything complex or invasive
> (for some value of each).  The only alternative would seem to be a
> phased or incremental approach, but you seem to be saying No to that
> as well.
> 
> There seems to be quite a bit of interest in having virtual nodes (and
> there has been for as long as I can remember), the only serious
> reservations relate to the difficulty/complexity.  Is there really no
> way to put our heads together and figure out how to properly manage
> that aspect?
> 
>> On Tue, Mar 20, 2012 at 11:10 AM, Richard Low  wrote:
>>> On 20 March 2012 14:55, Jonathan Ellis  wrote:
 Here's how I see Sam's list:
 
 * Even load balancing when growing and shrinking the cluster
 
 Nice to have, but post-bootstrap load balancing works well in practice
 (and is improved by TRP).
>>> 
>>> Post-bootstrap load balancing without vnodes necessarily streams more
>>> data than is necessary.  Vnodes streams the minimal amount.
>>> 
>>> In fact, post-bootstrap load balancing currently streams a constant
>>> fraction of your data - the network traffic involved in a rebalance
>>> increases linearly with the size of your cluster.  With vnodes it
>>> decreases linearly.
>>> 
>>> Including removing the ops overhead of running the load balance and
>>> calculating new tokens, this makes removing post-bootstrap load
>>> balancing a pretty big deal.
>>> 
 * Greater failure tolerance in streaming
 
 Directly addressed by TRP.
>>> 
>>> Agreed.
>>> 
 * Evenly distributed impact of streaming operations
 
 Not a problem in practice with stream throttling.
>>> 
>>> Throttling slows them down, increasing rebuild times so increasing downtime.
>>> 
 * Possibility for active load balancing
 
 Not really a feature of vnodes per se, but as with the other load
 balancing point, this is also improved by TRP.
>>> 
>>> Again with the caveat that more data is streamed with TRP.  Vnodes
>>> removes the need for any load balancing with RP.
>>> 
 * Distributed rebuild
 
 This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
 live without it?  I have so far.  Is this alone worth the complexity
 of vnodes?  No, it is not.  Especially since there are probably other
 approaches that we can take to mitigate this, one of which Rick has
 suggested in a separate sub-thread.
>>> 
>>> Distributed rebuild means you can store more data per node with the
>>> same failure probabilities.  This is frequently a limiting factor on
>>> how much data you can store per node, increasing cluster sizes
>>> unnecessarily.  I'd argue that this alone is worth the complexity of
>>> vnodes.
>>> 
>>> Richard.
>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
> 
> 
> 
> -- 
> Eric Evans
> Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Tom Wilkie
Hi Edward

> 1) No more raid 0. If a machine is responsible for 4 vnodes they
> should correspond to for JBOD.

So each vnode corresponds to a disk?  I suppose we could have a
separate data directory per disk, but I think this should be a
separate, subsequent change.

However, do note that making the vnode ~size of a disk (and only have
4-8 per machine) would make any non-hotswap rebuilds slower.  To get
the fast distributed rebuilds, you need to have at least as many
vnodes per node as you do nodes in the cluster.  And you would still
need the distributed rebuilds to deal with disk failure.

> 2) Vnodes should be able to be hot pluged. My normal cassandra chassis
> would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my
> chassis dies I should be able to take the disks out and physically
> plug them into another chassis. Then in cassandra I should be able to
> run a command like.
> nodetool attach '/mnt/disk6'. disk6 should contain all data an it's
> vnode information.
>
> Now this would be awesome for upgrades/migrations/etc.

You know, your not the first person I've spoke to who has asked for
this!  I do wonder whether it is optimising for the right thing though
- in my experience, disks fail more often than machines.

Thanks

Tom


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Jonathan Ellis
On Wed, Mar 21, 2012 at 8:50 AM, Eric Evans  wrote:
> I must admit I find this a little disheartening.  The discussion has
> barely started.  No one has had a chance to discuss implementation
> specifics so that the rest of us could understand *how* disruptive it
> would be (a necessary requirement in weighing cost:benefit), or what
> an incremental approach would look like, and yet work has already
> begun on shutting this down.

This isn't the first time vnodes has been brought up, so I've thought
at least a little bit about what that would entail for TokenMetadata,
StorageProxy, streaming, CFS, and on down the stack.  And it scares
me.  So, I wanted to make my points about cost:benefit and about not
committing with intentions to work out the details "later," up front.

But if you're still in brainstorming mode, carry on.

> Unless I'm reading you wrong, your mandate (I say mandate because you
> hinted at a veto elsewhere), is No to anything complex or invasive
> (for some value of each).  The only alternative would seem to be a
> phased or incremental approach, but you seem to be saying No to that
> as well.
>
> There seems to be quite a bit of interest in having virtual nodes (and
> there has been for as long as I can remember), the only serious
> reservations relate to the difficulty/complexity.  Is there really no
> way to put our heads together and figure out how to properly manage
> that aspect?

At the risk of putting words in your mouth, I think your concern is
that you don't want to go off and put man-months into a vnode
implementation, only to have me come back and say, "Sorry, it's too
complicated, -1."

Which is totally reasonable, I get that.

I would suggest that the best way to mitigate that is, when you are
ready, to put together as detailed an implementation plan as possible
ahead of time before you start generating patchsets.  Then we can put
some meat on the discussion more meaningful than vague "that scares
me" statements from yours truly.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
On Wed, Mar 21, 2012 at 3:24 PM, Tom Wilkie  wrote:
> Hi Edward
>
>> 1) No more raid 0. If a machine is responsible for 4 vnodes they
>> should correspond to for JBOD.
>
> So each vnode corresponds to a disk?  I suppose we could have a
> separate data directory per disk, but I think this should be a
> separate, subsequent change.
>
I think having more micro-ranges makes the process much easier.
Image a token ring 1-30

Node1 | major range 0-10  | disk 10-2 , disk2 3-4, disk 3 5-7, disk 4 8-10
Node2 | major range 11-20| disk 1 11-12 , disk2 13-14, disk 3 15-17,
disk 4 18-20
Node3 | major range 21-30| disk 1 21-22 , disk2 23-24, disk 3 25-27,
disk 4 28-30

Adding a 4th node is easy:
If you are at the data center, just take disk 4 out of each node and
place it in new server :)
Software wise it is the same deal. Each node streams off only disk 4
to the new node.

Now at this point disk 4 is idle and each machine should re balance
its own data across its 4 disks.

> However, do note that making the vnode ~size of a disk (and only have
> 4-8 per machine) would make any non-hotswap rebuilds slower.  To get
> the fast distributed rebuilds, you need to have at least as many
> vnodes per node as you do nodes in the cluster.  And you would still
> need the distributed rebuilds to deal with disk failure.
>
>> 2) Vnodes should be able to be hot pluged. My normal cassandra chassis
>> would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my
>> chassis dies I should be able to take the disks out and physically
>> plug them into another chassis. Then in cassandra I should be able to
>> run a command like.
>> nodetool attach '/mnt/disk6'. disk6 should contain all data an it's
>> vnode information.
>>
>> Now this would be awesome for upgrades/migrations/etc.
>
> You know, your not the first person I've spoke to who has asked for
> this!  I do wonder whether it is optimising for the right thing though
> - in my experience, disks fail more often than machines.
>
> Thanks
>
> Tom


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Peter Schuller
> Software wise it is the same deal. Each node streams off only disk 4
> to the new node.

I think an implication on software is that if you want to make
specific selections of partitions to move, you are effectively
incompatible with deterministically generating the mapping of
partition to responsible node. I.e., it probably means the vnode
information must be kept as state. It is probably difficult to
reconcile with balancing solutions like consistent hashing/crush/etc.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
I just see vnodes as a way to make the problem smaller and by making the
problem smaller the overall system is more agile. Aka rather then 1 node
streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad
because the take 1/4th the time.

The most simple vnode implementation is vmware.  Just make sure that 3 vm
nodes consecutive nodes so not end up on the same host. This is wasteful
because we have 4 jvms.

I envision vnodes as Cassandra master being a shared cache,memtables, and
manager for what we today consider a Cassandra  instance. Makes it simple
to think about.

On Wednesday, March 21, 2012, Peter Schuller 
wrote:
>> Software wise it is the same deal. Each node streams off only disk 4
>> to the new node.
>
> I think an implication on software is that if you want to make
> specific selections of partitions to move, you are effectively
> incompatible with deterministically generating the mapping of
> partition to responsible node. I.e., it probably means the vnode
> information must be kept as state. It is probably difficult to
> reconcile with balancing solutions like consistent hashing/crush/etc.
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Vijay
>>> I envision vnodes as Cassandra master being a shared cache,memtables,
and manager for what we today consider a Cassandra  instance.

It might be kind of problematic when you are moving the nodes you want the
data associated with the node to move too, otherwise it will be a pain to
cleanup after that (Something like nt clean). I think a vnode should be as
much isolated as possible to reduce the impact when it is moving (which
will become a normal cluster operation), Just my 2 cents.

Regards,




On Wed, Mar 21, 2012 at 5:41 PM, Edward Capriolo wrote:

> I just see vnodes as a way to make the problem smaller and by making the
> problem smaller the overall system is more agile. Aka rather then 1 node
> streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad
> because the take 1/4th the time.
>
> The most simple vnode implementation is vmware.  Just make sure that 3 vm
> nodes consecutive nodes so not end up on the same host. This is wasteful
> because we have 4 jvms.
>
> I envision vnodes as Cassandra master being a shared cache,memtables, and
> manager for what we today consider a Cassandra  instance. Makes it simple
> to think about.
>
> On Wednesday, March 21, 2012, Peter Schuller 
> wrote:
> >> Software wise it is the same deal. Each node streams off only disk 4
> >> to the new node.
> >
> > I think an implication on software is that if you want to make
> > specific selections of partitions to move, you are effectively
> > incompatible with deterministically generating the mapping of
> > partition to responsible node. I.e., it probably means the vnode
> > information must be kept as state. It is probably difficult to
> > reconcile with balancing solutions like consistent hashing/crush/etc.
> >
> > --
> > / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
> >
>


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Jonathan Ellis
A friend pointed out to me privately that I came across pretty harsh
in this thread.  While I stand by my technical concerns, I do want to
acknowledge that Sam's proposal here indicates a strong grasp of the
principles involved, and a deeper level of thought into the issues
than I think anyone else has brought to date.  Thanks for putting that
energy into it, Sam, and I look forward to seeing how you approach the
implementation.

On Fri, Mar 16, 2012 at 6:38 PM, Sam Overton  wrote:
> Hello cassandra-dev,
>
> This is a long email. It concerns a significant change to Cassandra, so
> deserves a thorough introduction.
>
> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Cassandra and we are asking for comments,
> criticism and collaboration!
>
> Cassandra's current partitioning scheme is sub-optimal for bootstrap,
> decommission, repair and re-balance operations, and places the burden on
> users to properly calculate tokens (a common cause of mistakes), which is a
> recurring pain-point.
>
> Virtual nodes have a variety of benefits over the one-to-one mapping of
> host to key range which Cassandra currently supports.
>
> Among these benefits are:
>
> * Even load balancing when growing and shrinking the cluster
> A virtual node scheme ensures that all hosts in a cluster have an even
> portion of the total data, and a new node bootstrapped into the cluster
> will assume its share of the data. Doubling, or halving the cluster to
> ensure even load distribution would no longer be necessary.
>
> * Distributed rebuild
> When sizing a cluster, one of the considerations is the amount of time
> required to recover from a failed node. This is the exposure time, during
> which a secondary failure could cause data loss. In order to guarantee an
> upper bound on the exposure time, the amount of data which can be stored on
> each host is limited by the amount of time taken to recover the required
> replica count. At Acunu we have found that the exposure time is frequently
> the limiting factor which dictates the maximum allowed node size in
> customers' clusters.
>
> Using a virtual node scheme, the data stored on one host is not replicated
> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
> other virtual nodes which may be on a different set of physical hosts to
> replicas of other virtual nodes stored on the same host. This means data
> for one host is replicated evenly across the entire cluster.
>
> In the event of a failure then, restoring the replica count can be done in
> a fully distributed way. Each host in the cluster participates in the
> rebuild, drastically reducing the exposure time, allowing more data to be
> stored on a single host while still maintaining an acceptable upper bound
> on the likelihood of secondary failure. This reduces TCO concerns.
>
> * Greater failure tolerance in streaming
> Operations which require streaming of a large range of data, eg. bootstrap,
> decommission, repair, etc. incur a heavy cost if an error (eg. dropped
> network connection) is encountered during the streaming. Currently the
> whole range must be re-streamed, and this could constitute a very large
> amount of data. Virtual nodes reduce the impact of streaming failures,
> since each virtual node is a much smaller range of the key-space, so
> re-streaming a whole virtual node is a much cheaper process.
>
> * Evenly distributed impact of streaming operations
> Streaming operations such as bootstrap, repair, et al. would involve every
> node in the cluster. This would distribute the load of these operations
> across the whole cluster, and could be staggered so that only a small
> subset of nodes were affected at once, similar to staggered repair[1].
>
> * Possibility for active load balancing
> Load balancing in Cassandra currently involves moving a token to
> increase/reduce the amount of key-space for which a host is responsible.
> This only allows load balancing between neighbouring nodes, so it could
> involve moving more than one token just to redistribute a single overloaded
> node. Virtual nodes could allow load balancing on a much finer granularity,
> so heavily loaded portions of the key-space could be redistributed to
> lighter-loaded hosts by reassigning one or more virtual nodes.
>
>
> Implementing a virtual node scheme in Cassandra is not an insignificant
> amount of work, and it will touch a large amount of the codebase related to
> partitioning, placement, routing, gossip, and so on. We do believe that
> this is possible to do incrementally, and in such a way that there is an
> easy upgrade path for pre-virtual-node deployments.
>
> It would not however touch the storage layer. The virtual node concept is
> solely for partitioning and placement, not for segregating the data storage
> of the host, so all keys for all virtual nodes on a host would be stored in
> the same SSTables.
>
> We are not proposing the adoption of the same sc

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Zhu Han
On Tue, Mar 20, 2012 at 11:24 PM, Jeremiah Jordan <
jeremiah.jor...@morningstar.com> wrote:

> So taking a step back, if we want "vnodes" why can't we just give every
> node 100 tokens instead of only one?  Seems to me this would have less
> impact on the rest of the code.  It would just look like you had a 500 node
> cluster, instead of a 5 node cluster.  Your replication strategy would have
> to know about the physical machines so that data gets replicated right, but
> there is already some concept of this with the data center aware and rack
> aware stuff.
>
> From what I see I think you could get most of the benefits of vnodes by
> implementing a new Placement Strategy that did something like this, and you
> wouldn't have to touch (and maybe break) code in other places.
>
> Am I crazy? Naive?
>
> Once you had this setup, you can start to implement the vnode like stuff
> on top of it.  Like bootstrapping nodes in one token at a time, and taking
> them on from the whole cluster, not just your neighbor. etc. etc.
>

I second it.

Is there some goals we missed which can not be achieved by assigning
multiple tokens to a single node?


>
> -Jeremiah Jordan
>
> 
> From: Rick Branson [rbran...@datastax.com]
> Sent: Monday, March 19, 2012 5:16 PM
> To: dev@cassandra.apache.org
> Subject: Re: RFC: Cassandra Virtual Nodes
>
> I think if we could go back and rebuild Cassandra from scratch, vnodes
> would likely be implemented from the beginning. However, I'm concerned that
> implementing them now could be a big distraction from more productive uses
> of all of our time and introduce major potential stability issues into what
> is becoming a business critical piece of infrastructure for many people.
> However, instead of just complaining and pedantry, I'd like to offer a
> feasible alternative:
>
> Has there been consideration given to the idea of a supporting a single
> token range for a node?
>
> While not theoretically as capable as vnodes, it seems to me to be more
> practical as it would have a significantly lower impact on the codebase and
> provides a much clearer migration path. It also seems to solve a majority
> of complaints regarding operational issues with Cassandra clusters.
>
> Each node would have a lower and an upper token, which would form a range
> that would be actively distributed via gossip. Read and replication
> requests would only be routed to a replica when the key of these operations
> matched the replica's token range in the gossip tables. Each node would
> locally store it's own current active token range as well as a target token
> range it's "moving" towards.
>
> As a new node undergoes bootstrap, the bounds would be gradually expanded
> to allow it to handle requests for a wider range of the keyspace as it
> moves towards it's target token range. This idea boils down to a move from
> hard cutovers to smoother operations by gradually adjusting active token
> ranges over a period of time. It would apply to token change operations
> (nodetool 'move' and 'removetoken') as well.
>
> Failure during streaming could be recovered at the bounds instead of
> restarting the whole process as the active bounds would effectively track
> the progress for bootstrap & target token changes. Implicitly these
> operations would be throttled to some degree. Node repair (AES) could also
> be modified using the same overall ideas provide a more gradual impact on
> the cluster overall similar as the ideas given in CASSANDRA-3721.
>
> While this doesn't spread the load over the cluster for these operations
> evenly like vnodes does, this is likely an issue that could be worked
> around by performing concurrent (throttled) bootstrap & node repair (AES)
> operations. It does allow some kind of "active" load balancing, but clearly
> this is not as flexible or as useful as vnodes, but you should be using
> RandomPartitioner or sort-of-randomized keys with OPP right? ;)
>
> As a side note: vnodes fail to provide solutions to node-based limitations
> that seem to me to cause a substantial portion of operational issues such
> as impact of node restarts / upgrades, GC and compaction induced latency. I
> think some progress could be made here by allowing a "pack" of independent
> Cassandra nodes to be ran on a single host; somewhat (but nowhere near
> entirely) similar to a pre-fork model used by some UNIX-based servers.
>
> Input?
>
> --
> Rick Branson
> DataStax
>