There is now a parent ticket for this issue in JIRA:
https://issues.apache.org/jira/browse/CASSANDRA-4119
Comments and contributions are still welcome!
Cheers,
Sam
On 16 March 2012 23:38, Sam Overton wrote:
> Hello cassandra-dev,
>
> This is a long email. It concerns a significant change to Cassandra, so
> deserves a thorough introduction.
>
> The summary is: we believe virtual nodes are the way forward. We would like
> to add virtual nodes to Cassandra and we are asking for comments, criticism
> and collaboration!
>
> Cassandra's current partitioning scheme is sub-optimal for bootstrap,
> decommission, repair and re-balance operations, and places the burden on
> users to properly calculate tokens (a common cause of mistakes), which is a
> recurring pain-point.
>
> Virtual nodes have a variety of benefits over the one-to-one mapping of host
> to key range which Cassandra currently supports.
>
> Among these benefits are:
>
> * Even load balancing when growing and shrinking the cluster
> A virtual node scheme ensures that all hosts in a cluster have an even
> portion of the total data, and a new node bootstrapped into the cluster will
> assume its share of the data. Doubling, or halving the cluster to ensure
> even load distribution would no longer be necessary.
>
> * Distributed rebuild
> When sizing a cluster, one of the considerations is the amount of time
> required to recover from a failed node. This is the exposure time, during
> which a secondary failure could cause data loss. In order to guarantee an
> upper bound on the exposure time, the amount of data which can be stored on
> each host is limited by the amount of time taken to recover the required
> replica count. At Acunu we have found that the exposure time is frequently
> the limiting factor which dictates the maximum allowed node size in
> customers' clusters.
>
> Using a virtual node scheme, the data stored on one host is not replicated
> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
> other virtual nodes which may be on a different set of physical hosts to
> replicas of other virtual nodes stored on the same host. This means data for
> one host is replicated evenly across the entire cluster.
>
> In the event of a failure then, restoring the replica count can be done in a
> fully distributed way. Each host in the cluster participates in the rebuild,
> drastically reducing the exposure time, allowing more data to be stored on a
> single host while still maintaining an acceptable upper bound on the
> likelihood of secondary failure. This reduces TCO concerns.
>
> * Greater failure tolerance in streaming
> Operations which require streaming of a large range of data, eg. bootstrap,
> decommission, repair, etc. incur a heavy cost if an error (eg. dropped
> network connection) is encountered during the streaming. Currently the whole
> range must be re-streamed, and this could constitute a very large amount of
> data. Virtual nodes reduce the impact of streaming failures, since each
> virtual node is a much smaller range of the key-space, so re-streaming a
> whole virtual node is a much cheaper process.
>
> * Evenly distributed impact of streaming operations
> Streaming operations such as bootstrap, repair, et al. would involve every
> node in the cluster. This would distribute the load of these operations
> across the whole cluster, and could be staggered so that only a small subset
> of nodes were affected at once, similar to staggered repair[1].
>
> * Possibility for active load balancing
> Load balancing in Cassandra currently involves moving a token to
> increase/reduce the amount of key-space for which a host is responsible.
> This only allows load balancing between neighbouring nodes, so it could
> involve moving more than one token just to redistribute a single overloaded
> node. Virtual nodes could allow load balancing on a much finer granularity,
> so heavily loaded portions of the key-space could be redistributed to
> lighter-loaded hosts by reassigning one or more virtual nodes.
>
>
> Implementing a virtual node scheme in Cassandra is not an insignificant
> amount of work, and it will touch a large amount of the codebase related to
> partitioning, placement, routing, gossip, and so on. We do believe that this
> is possible to do incrementally, and in such a way that there is an easy
> upgrade path for pre-virtual-node deployments.
>
> It would not however touch the storage layer. The virtual node concept is
> solely for partitioning and placement, not for segregating the data storage
> of the host, so all keys for all virtual nodes on a host would be stored in
> the same SSTables.
>
> We are not proposing the adoption of the same scheme used by Voldemort[2]
> and described in the Dynamo paper[3]. We feel this scheme is too different
> from Cassandra's current distribution model to be a viable target for
> incremental development. Their scheme also fixes the number of virtual nodes
> for the lifetime of t