date:20120317

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Radim Kolar


I don't like that every node will have same portion of data.

1. We are using nodes with different HW sizes (number of disks)
2.  especially with ordered partitioner there tends to be hotspots and 
you must assign smaller portion of data to nodes holding hotspots

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Sam Overton

On 17 March 2012 11:15, Radim Kolar  wrote:

> I don't like that every node will have same portion of data.
>
> 1. We are using nodes with different HW sizes (number of disks)
> 2.  especially with ordered partitioner there tends to be hotspots and you
> must assign smaller portion of data to nodes holding hotspots
>

Hi Radim,

The number of virtual nodes for each host would be configurable by the
user, in much the same way that initial_token is configurable now. A host
taking a larger number of virtual nodes (tokens) would have proportionately
more of the data. This is how we anticipate support for heterogeneity in
cluster hardware.

Sam

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Zhu Han

On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton  wrote:

> Hello cassandra-dev,
>
> This is a long email. It concerns a significant change to Cassandra, so
> deserves a thorough introduction.
>
> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Cassandra and we are asking for comments,
> criticism and collaboration!
>
> Cassandra's current partitioning scheme is sub-optimal for bootstrap,
> decommission, repair and re-balance operations, and places the burden on
> users to properly calculate tokens (a common cause of mistakes), which is a
> recurring pain-point.
>
> Virtual nodes have a variety of benefits over the one-to-one mapping of
> host to key range which Cassandra currently supports.
>
> Among these benefits are:
>
> * Even load balancing when growing and shrinking the cluster
> A virtual node scheme ensures that all hosts in a cluster have an even
> portion of the total data, and a new node bootstrapped into the cluster
> will assume its share of the data. Doubling, or halving the cluster to
> ensure even load distribution would no longer be necessary.
>
> * Distributed rebuild
> When sizing a cluster, one of the considerations is the amount of time
> required to recover from a failed node. This is the exposure time, during
> which a secondary failure could cause data loss. In order to guarantee an
> upper bound on the exposure time, the amount of data which can be stored on
> each host is limited by the amount of time taken to recover the required
> replica count. At Acunu we have found that the exposure time is frequently
> the limiting factor which dictates the maximum allowed node size in
> customers' clusters.
>
> Using a virtual node scheme, the data stored on one host is not replicated
> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
> other virtual nodes which may be on a different set of physical hosts to
> replicas of other virtual nodes stored on the same host. This means data
> for one host is replicated evenly across the entire cluster.
>
> In the event of a failure then, restoring the replica count can be done in
> a fully distributed way. Each host in the cluster participates in the
> rebuild, drastically reducing the exposure time, allowing more data to be
> stored on a single host while still maintaining an acceptable upper bound
> on the likelihood of secondary failure. This reduces TCO concerns.
>
> * Greater failure tolerance in streaming
> Operations which require streaming of a large range of data, eg. bootstrap,
> decommission, repair, etc. incur a heavy cost if an error (eg. dropped
> network connection) is encountered during the streaming. Currently the
> whole range must be re-streamed, and this could constitute a very large
> amount of data. Virtual nodes reduce the impact of streaming failures,
> since each virtual node is a much smaller range of the key-space, so
> re-streaming a whole virtual node is a much cheaper process.
>
> * Evenly distributed impact of streaming operations
> Streaming operations such as bootstrap, repair, et al. would involve every
> node in the cluster. This would distribute the load of these operations
> across the whole cluster, and could be staggered so that only a small
> subset of nodes were affected at once, similar to staggered repair[1].
>
> * Possibility for active load balancing
> Load balancing in Cassandra currently involves moving a token to
> increase/reduce the amount of key-space for which a host is responsible.
> This only allows load balancing between neighbouring nodes, so it could
> involve moving more than one token just to redistribute a single overloaded
> node. Virtual nodes could allow load balancing on a much finer granularity,
> so heavily loaded portions of the key-space could be redistributed to
> lighter-loaded hosts by reassigning one or more virtual nodes.
>
>
> Implementing a virtual node scheme in Cassandra is not an insignificant
> amount of work, and it will touch a large amount of the codebase related to
> partitioning, placement, routing, gossip, and so on. We do believe that
> this is possible to do incrementally, and in such a way that there is an
> easy upgrade path for pre-virtual-node deployments.
>
> It would not however touch the storage layer. The virtual node concept is
> solely for partitioning and placement, not for segregating the data storage
> of the host, so all keys for all virtual nodes on a host would be stored in
> the same SSTables.
>
> We are not proposing the adoption of the same scheme used by Voldemort[2]
> and described in the Dynamo paper[3]. We feel this scheme is too different
> from Cassandra's current distribution model to be a viable target for
> incremental development. Their scheme also fixes the number of virtual
> nodes for the lifetime of the cluster, which can prove to be a ceiling to
> scaling the cluster if the virtual nodes grow too large.
>
> The proposed design is:
> * Assign each host T

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Eric Evans

On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar  wrote:
> I don't like that every node will have same portion of data.
>
> 1. We are using nodes with different HW sizes (number of disks)
> 2.  especially with ordered partitioner there tends to be hotspots and you
> must assign smaller portion of data to nodes holding hotspots

Yes, these are exactly the sorts of problems that virtual nodes are
meant to make easier.

-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Edward Capriolo

 I agree having smaller regions would help the rebalencing situation both
with rp and bop. However i an not sure if  dividing tables across disk s
will give any better performance. you will have more seeking spindles and
can possibly sub divide token ranges into separate files. But fs cache will
get shared across all disks so that is a wash.

On Saturday, March 17, 2012, Eric Evans  wrote:
> On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar  wrote:
>> I don't like that every node will have same portion of data.
>>
>> 1. We are using nodes with different HW sizes (number of disks)
>> 2.  especially with ordered partitioner there tends to be hotspots and
you
>> must assign smaller portion of data to nodes holding hotspots
>
> Yes, these are exactly the sorts of problems that virtual nodes are
> meant to make easier.
>
> --
> Eric Evans
> Acunu | http://www.acunu.com | @acunu
>

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Eric Evans

On Sat, Mar 17, 2012 at 3:22 PM, Zhu Han  wrote:
> On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton  wrote:
>> This is a long email. It concerns a significant change to Cassandra, so
>> deserves a thorough introduction.
>>
>> *The summary is*: we believe virtual nodes are the way forward. We would
>> like to add virtual nodes to Cassandra and we are asking for comments,
>> criticism and collaboration!
>>
>> Cassandra's current partitioning scheme is sub-optimal for bootstrap,
>> decommission, repair and re-balance operations, and places the burden on
>> users to properly calculate tokens (a common cause of mistakes), which is a
>> recurring pain-point.
>>
>> Virtual nodes have a variety of benefits over the one-to-one mapping of
>> host to key range which Cassandra currently supports.
>>
>> Among these benefits are:
>>
>> * Even load balancing when growing and shrinking the cluster
>> A virtual node scheme ensures that all hosts in a cluster have an even
>> portion of the total data, and a new node bootstrapped into the cluster
>> will assume its share of the data. Doubling, or halving the cluster to
>> ensure even load distribution would no longer be necessary.
>>
>> * Distributed rebuild
>> When sizing a cluster, one of the considerations is the amount of time
>> required to recover from a failed node. This is the exposure time, during
>> which a secondary failure could cause data loss. In order to guarantee an
>> upper bound on the exposure time, the amount of data which can be stored on
>> each host is limited by the amount of time taken to recover the required
>> replica count. At Acunu we have found that the exposure time is frequently
>> the limiting factor which dictates the maximum allowed node size in
>> customers' clusters.
>>
>> Using a virtual node scheme, the data stored on one host is not replicated
>> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
>> other virtual nodes which may be on a different set of physical hosts to
>> replicas of other virtual nodes stored on the same host. This means data
>> for one host is replicated evenly across the entire cluster.
>>
>> In the event of a failure then, restoring the replica count can be done in
>> a fully distributed way. Each host in the cluster participates in the
>> rebuild, drastically reducing the exposure time, allowing more data to be
>> stored on a single host while still maintaining an acceptable upper bound
>> on the likelihood of secondary failure. This reduces TCO concerns.
>>
>> * Greater failure tolerance in streaming
>> Operations which require streaming of a large range of data, eg. bootstrap,
>> decommission, repair, etc. incur a heavy cost if an error (eg. dropped
>> network connection) is encountered during the streaming. Currently the
>> whole range must be re-streamed, and this could constitute a very large
>> amount of data. Virtual nodes reduce the impact of streaming failures,
>> since each virtual node is a much smaller range of the key-space, so
>> re-streaming a whole virtual node is a much cheaper process.
>>
>> * Evenly distributed impact of streaming operations
>> Streaming operations such as bootstrap, repair, et al. would involve every
>> node in the cluster. This would distribute the load of these operations
>> across the whole cluster, and could be staggered so that only a small
>> subset of nodes were affected at once, similar to staggered repair[1].
>>
>> * Possibility for active load balancing
>> Load balancing in Cassandra currently involves moving a token to
>> increase/reduce the amount of key-space for which a host is responsible.
>> This only allows load balancing between neighbouring nodes, so it could
>> involve moving more than one token just to redistribute a single overloaded
>> node. Virtual nodes could allow load balancing on a much finer granularity,
>> so heavily loaded portions of the key-space could be redistributed to
>> lighter-loaded hosts by reassigning one or more virtual nodes.
>>
>>
>> Implementing a virtual node scheme in Cassandra is not an insignificant
>> amount of work, and it will touch a large amount of the codebase related to
>> partitioning, placement, routing, gossip, and so on. We do believe that
>> this is possible to do incrementally, and in such a way that there is an
>> easy upgrade path for pre-virtual-node deployments.
>>
>> It would not however touch the storage layer. The virtual node concept is
>> solely for partitioning and placement, not for segregating the data storage
>> of the host, so all keys for all virtual nodes on a host would be stored in
>> the same SSTables.
>>
>> We are not proposing the adoption of the same scheme used by Voldemort[2]
>> and described in the Dynamo paper[3]. We feel this scheme is too different
>> from Cassandra's current distribution model to be a viable target for
>> incremental development. Their scheme also fixes the number of virtual
>> nodes for the lifetime of the cluster, which can prove to be a ceiling t

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller

> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Cassandra and we are asking for comments,
> criticism and collaboration!

I am very happy to see some momentum on this, and I would like to go
even further than what you propose. The main reasons why I do not
think simply adding vnodes and making random assignments is the best
end goal are:

(1) The probability of data loss increases linearly with cluster size.
(2) It does not take network topology into account.

What follows is mostly a half-finished long text that I have been
sitting on for a few months but not finished/posted. Unfortunately I
do not have the possibility (due to time constraints) to go through
everything in detail and update with current information and to
specifically address what you already said, so there will be overlap
with your original post. However given your E-Mail and the momentum in
this thread, I really wanted to post something rather than not. It
would be awesome if interested parties had a chance to read the
referenced CRUSH paper, and the deltas proposed below.

The goals are to address everything you already wanted in your post,
while also addressing:

(1) Probability of data loss
(2) Network topology awareness

The following text will first try to paint a picture of the goals that
I have in mind, and then go on to the actual proposed solution. The
proposition is very very short and undetailed now and there is plenty
of discussion and details to fill in. I apologize, but again, I really
want to post something now that this is being brought up.

BEGIN un-polished text ("we" = "I"):=

= CRUSHing Cassandra

Author: Peter Schuller 

This is a proposal for a significant re-design of some fundamentals of
Cassandra, aimed at addressing a number of current issues as well as
anticipating future issues. It is particularly aimed at large
clusters, but as a side-effect should improve the small cluster
experience as well.

== New terminology: Distribution factor

A Cassandra cluster is today said to have `N` nodes, and data is
replicated at a particular replication factor (`RF`). The placement of
replicas is such that all rows that has a certain node `N1` as its
primary replica, are located on a specific set of `RF-1` other
nodes. In addition, it holds secondary replicas of data for `RF-1`
other nodes. In total, it shares data with `2RF - 2` other nodes.

The number of nodes with whom a node shares data is the distribution
factor. In the case of Cassandra, `DF = 2RF - 2`.

== Goals

The goals this suggestion attempts to help solve include the following.

=== Goal: DF should not be tied to RF, nor N

`DF` is important for these reasons:

* The `DF` determines how many nodes are involved in re-constructing a
  lost node after failure; the higher the `DF`, the less of a
  performance impact a reconstruction has on remaining nodes.
* The `DF` determines the significance on other nodes, with respect to
  read/write load, on a node being down.
* The `DF` affects the probability of multiple failures causing data
  loss, since one looses data if any `RF` nodes within a group of `DF`
  nodes all go down.

Having `DF` tied to `RF` like Cassandra does now has its problems. A
single node failure has a significant effect on the performance
characteristics of neighboring nodes (in terms relative to the normal
level of load on the neighbors).

On large data sets, a failed node needing reconstruction is a
significant event, as it

* Increases the load on neighbors just from going down.
* Further increases the load on neighbors as they have to stream data
(adding I/O
  and cache thrashing).

This typically leads to the desire to throttle / rate limit
reconstruction, which adds to the reconstruction window in addition to
the fact that it was already naturally bottlenecking on neighbors.

The other extreme is to tie `DF` to `N`, such that the data contained
on one node, has it's secondary replicas spread out over the entire
ring. This is an unacceptable choice because the probabiliy of multiple
failures increases linearly with the cluster size.

In other words, we want `DF` to be tied to neither `RF` nor
`N`. Rather, `DF` should be chosen as a trade-off between the effects
of `DF`:

* The higher the `DF`, the higher the probability of data loss in case
  of multiple failures.
* The higher the `DF`, the faster to reconstruct/replace a lost node.
* The higher the `DF`, the less impact is seen on node failures on the
  performance requirements on other nodes.

In making this determination, one must take into account that if a
larger `DF` makes reconstruction/replacement significantly faster,
that also decreases the time window in which multiple failures can
occurr. Increasing `DF` is thus not *necessarily* increasing the total
probability of data loss (for small values of `DF`).

=== Goal: Topologically aware redundancy

We maintain the goal of being topology aware for the purpose of
ensuring that we p

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller

Point of clarification: My use of the term "bucket" is completely
unrelated to the term "bucket" used in the CRUSH paper.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: RFC: Cassandra Virtual Nodes

Re: RFC: Cassandra Virtual Nodes

Re: RFC: Cassandra Virtual Nodes

Re: RFC: Cassandra Virtual Nodes

Re: RFC: Cassandra Virtual Nodes

Re: RFC: Cassandra Virtual Nodes

Re: RFC: Cassandra Virtual Nodes

Re: RFC: Cassandra Virtual Nodes

8 matches

Site Navigation

Mail list logo

Footer information