Re: RFC: Cassandra Virtual Nodes
I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2. especially with ordered partitioner there tends to be hotspots and you must assign smaller portion of data to nodes holding hotspots
Re: RFC: Cassandra Virtual Nodes
On 17 March 2012 11:15, Radim Kolar wrote: > I don't like that every node will have same portion of data. > > 1. We are using nodes with different HW sizes (number of disks) > 2. especially with ordered partitioner there tends to be hotspots and you > must assign smaller portion of data to nodes holding hotspots > Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for heterogeneity in cluster hardware. Sam -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton wrote: > Hello cassandra-dev, > > This is a long email. It concerns a significant change to Cassandra, so > deserves a thorough introduction. > > *The summary is*: we believe virtual nodes are the way forward. We would > like to add virtual nodes to Cassandra and we are asking for comments, > criticism and collaboration! > > Cassandra's current partitioning scheme is sub-optimal for bootstrap, > decommission, repair and re-balance operations, and places the burden on > users to properly calculate tokens (a common cause of mistakes), which is a > recurring pain-point. > > Virtual nodes have a variety of benefits over the one-to-one mapping of > host to key range which Cassandra currently supports. > > Among these benefits are: > > * Even load balancing when growing and shrinking the cluster > A virtual node scheme ensures that all hosts in a cluster have an even > portion of the total data, and a new node bootstrapped into the cluster > will assume its share of the data. Doubling, or halving the cluster to > ensure even load distribution would no longer be necessary. > > * Distributed rebuild > When sizing a cluster, one of the considerations is the amount of time > required to recover from a failed node. This is the exposure time, during > which a secondary failure could cause data loss. In order to guarantee an > upper bound on the exposure time, the amount of data which can be stored on > each host is limited by the amount of time taken to recover the required > replica count. At Acunu we have found that the exposure time is frequently > the limiting factor which dictates the maximum allowed node size in > customers' clusters. > > Using a virtual node scheme, the data stored on one host is not replicated > on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 > other virtual nodes which may be on a different set of physical hosts to > replicas of other virtual nodes stored on the same host. This means data > for one host is replicated evenly across the entire cluster. > > In the event of a failure then, restoring the replica count can be done in > a fully distributed way. Each host in the cluster participates in the > rebuild, drastically reducing the exposure time, allowing more data to be > stored on a single host while still maintaining an acceptable upper bound > on the likelihood of secondary failure. This reduces TCO concerns. > > * Greater failure tolerance in streaming > Operations which require streaming of a large range of data, eg. bootstrap, > decommission, repair, etc. incur a heavy cost if an error (eg. dropped > network connection) is encountered during the streaming. Currently the > whole range must be re-streamed, and this could constitute a very large > amount of data. Virtual nodes reduce the impact of streaming failures, > since each virtual node is a much smaller range of the key-space, so > re-streaming a whole virtual node is a much cheaper process. > > * Evenly distributed impact of streaming operations > Streaming operations such as bootstrap, repair, et al. would involve every > node in the cluster. This would distribute the load of these operations > across the whole cluster, and could be staggered so that only a small > subset of nodes were affected at once, similar to staggered repair[1]. > > * Possibility for active load balancing > Load balancing in Cassandra currently involves moving a token to > increase/reduce the amount of key-space for which a host is responsible. > This only allows load balancing between neighbouring nodes, so it could > involve moving more than one token just to redistribute a single overloaded > node. Virtual nodes could allow load balancing on a much finer granularity, > so heavily loaded portions of the key-space could be redistributed to > lighter-loaded hosts by reassigning one or more virtual nodes. > > > Implementing a virtual node scheme in Cassandra is not an insignificant > amount of work, and it will touch a large amount of the codebase related to > partitioning, placement, routing, gossip, and so on. We do believe that > this is possible to do incrementally, and in such a way that there is an > easy upgrade path for pre-virtual-node deployments. > > It would not however touch the storage layer. The virtual node concept is > solely for partitioning and placement, not for segregating the data storage > of the host, so all keys for all virtual nodes on a host would be stored in > the same SSTables. > > We are not proposing the adoption of the same scheme used by Voldemort[2] > and described in the Dynamo paper[3]. We feel this scheme is too different > from Cassandra's current distribution model to be a viable target for > incremental development. Their scheme also fixes the number of virtual > nodes for the lifetime of the cluster, which can prove to be a ceiling to > scaling the cluster if the virtual nodes grow too large. > > The proposed design is: > * Assign each host T
Re: RFC: Cassandra Virtual Nodes
On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar wrote: > I don't like that every node will have same portion of data. > > 1. We are using nodes with different HW sizes (number of disks) > 2. especially with ordered partitioner there tends to be hotspots and you > must assign smaller portion of data to nodes holding hotspots Yes, these are exactly the sorts of problems that virtual nodes are meant to make easier. -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
I agree having smaller regions would help the rebalencing situation both with rp and bop. However i an not sure if dividing tables across disk s will give any better performance. you will have more seeking spindles and can possibly sub divide token ranges into separate files. But fs cache will get shared across all disks so that is a wash. On Saturday, March 17, 2012, Eric Evans wrote: > On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar wrote: >> I don't like that every node will have same portion of data. >> >> 1. We are using nodes with different HW sizes (number of disks) >> 2. especially with ordered partitioner there tends to be hotspots and you >> must assign smaller portion of data to nodes holding hotspots > > Yes, these are exactly the sorts of problems that virtual nodes are > meant to make easier. > > -- > Eric Evans > Acunu | http://www.acunu.com | @acunu >
Re: RFC: Cassandra Virtual Nodes
On Sat, Mar 17, 2012 at 3:22 PM, Zhu Han wrote: > On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton wrote: >> This is a long email. It concerns a significant change to Cassandra, so >> deserves a thorough introduction. >> >> *The summary is*: we believe virtual nodes are the way forward. We would >> like to add virtual nodes to Cassandra and we are asking for comments, >> criticism and collaboration! >> >> Cassandra's current partitioning scheme is sub-optimal for bootstrap, >> decommission, repair and re-balance operations, and places the burden on >> users to properly calculate tokens (a common cause of mistakes), which is a >> recurring pain-point. >> >> Virtual nodes have a variety of benefits over the one-to-one mapping of >> host to key range which Cassandra currently supports. >> >> Among these benefits are: >> >> * Even load balancing when growing and shrinking the cluster >> A virtual node scheme ensures that all hosts in a cluster have an even >> portion of the total data, and a new node bootstrapped into the cluster >> will assume its share of the data. Doubling, or halving the cluster to >> ensure even load distribution would no longer be necessary. >> >> * Distributed rebuild >> When sizing a cluster, one of the considerations is the amount of time >> required to recover from a failed node. This is the exposure time, during >> which a secondary failure could cause data loss. In order to guarantee an >> upper bound on the exposure time, the amount of data which can be stored on >> each host is limited by the amount of time taken to recover the required >> replica count. At Acunu we have found that the exposure time is frequently >> the limiting factor which dictates the maximum allowed node size in >> customers' clusters. >> >> Using a virtual node scheme, the data stored on one host is not replicated >> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 >> other virtual nodes which may be on a different set of physical hosts to >> replicas of other virtual nodes stored on the same host. This means data >> for one host is replicated evenly across the entire cluster. >> >> In the event of a failure then, restoring the replica count can be done in >> a fully distributed way. Each host in the cluster participates in the >> rebuild, drastically reducing the exposure time, allowing more data to be >> stored on a single host while still maintaining an acceptable upper bound >> on the likelihood of secondary failure. This reduces TCO concerns. >> >> * Greater failure tolerance in streaming >> Operations which require streaming of a large range of data, eg. bootstrap, >> decommission, repair, etc. incur a heavy cost if an error (eg. dropped >> network connection) is encountered during the streaming. Currently the >> whole range must be re-streamed, and this could constitute a very large >> amount of data. Virtual nodes reduce the impact of streaming failures, >> since each virtual node is a much smaller range of the key-space, so >> re-streaming a whole virtual node is a much cheaper process. >> >> * Evenly distributed impact of streaming operations >> Streaming operations such as bootstrap, repair, et al. would involve every >> node in the cluster. This would distribute the load of these operations >> across the whole cluster, and could be staggered so that only a small >> subset of nodes were affected at once, similar to staggered repair[1]. >> >> * Possibility for active load balancing >> Load balancing in Cassandra currently involves moving a token to >> increase/reduce the amount of key-space for which a host is responsible. >> This only allows load balancing between neighbouring nodes, so it could >> involve moving more than one token just to redistribute a single overloaded >> node. Virtual nodes could allow load balancing on a much finer granularity, >> so heavily loaded portions of the key-space could be redistributed to >> lighter-loaded hosts by reassigning one or more virtual nodes. >> >> >> Implementing a virtual node scheme in Cassandra is not an insignificant >> amount of work, and it will touch a large amount of the codebase related to >> partitioning, placement, routing, gossip, and so on. We do believe that >> this is possible to do incrementally, and in such a way that there is an >> easy upgrade path for pre-virtual-node deployments. >> >> It would not however touch the storage layer. The virtual node concept is >> solely for partitioning and placement, not for segregating the data storage >> of the host, so all keys for all virtual nodes on a host would be stored in >> the same SSTables. >> >> We are not proposing the adoption of the same scheme used by Voldemort[2] >> and described in the Dynamo paper[3]. We feel this scheme is too different >> from Cassandra's current distribution model to be a viable target for >> incremental development. Their scheme also fixes the number of virtual >> nodes for the lifetime of the cluster, which can prove to be a ceiling t
Re: RFC: Cassandra Virtual Nodes
> *The summary is*: we believe virtual nodes are the way forward. We would > like to add virtual nodes to Cassandra and we are asking for comments, > criticism and collaboration! I am very happy to see some momentum on this, and I would like to go even further than what you propose. The main reasons why I do not think simply adding vnodes and making random assignments is the best end goal are: (1) The probability of data loss increases linearly with cluster size. (2) It does not take network topology into account. What follows is mostly a half-finished long text that I have been sitting on for a few months but not finished/posted. Unfortunately I do not have the possibility (due to time constraints) to go through everything in detail and update with current information and to specifically address what you already said, so there will be overlap with your original post. However given your E-Mail and the momentum in this thread, I really wanted to post something rather than not. It would be awesome if interested parties had a chance to read the referenced CRUSH paper, and the deltas proposed below. The goals are to address everything you already wanted in your post, while also addressing: (1) Probability of data loss (2) Network topology awareness The following text will first try to paint a picture of the goals that I have in mind, and then go on to the actual proposed solution. The proposition is very very short and undetailed now and there is plenty of discussion and details to fill in. I apologize, but again, I really want to post something now that this is being brought up. BEGIN un-polished text ("we" = "I"):= = CRUSHing Cassandra Author: Peter Schuller This is a proposal for a significant re-design of some fundamentals of Cassandra, aimed at addressing a number of current issues as well as anticipating future issues. It is particularly aimed at large clusters, but as a side-effect should improve the small cluster experience as well. == New terminology: Distribution factor A Cassandra cluster is today said to have `N` nodes, and data is replicated at a particular replication factor (`RF`). The placement of replicas is such that all rows that has a certain node `N1` as its primary replica, are located on a specific set of `RF-1` other nodes. In addition, it holds secondary replicas of data for `RF-1` other nodes. In total, it shares data with `2RF - 2` other nodes. The number of nodes with whom a node shares data is the distribution factor. In the case of Cassandra, `DF = 2RF - 2`. == Goals The goals this suggestion attempts to help solve include the following. === Goal: DF should not be tied to RF, nor N `DF` is important for these reasons: * The `DF` determines how many nodes are involved in re-constructing a lost node after failure; the higher the `DF`, the less of a performance impact a reconstruction has on remaining nodes. * The `DF` determines the significance on other nodes, with respect to read/write load, on a node being down. * The `DF` affects the probability of multiple failures causing data loss, since one looses data if any `RF` nodes within a group of `DF` nodes all go down. Having `DF` tied to `RF` like Cassandra does now has its problems. A single node failure has a significant effect on the performance characteristics of neighboring nodes (in terms relative to the normal level of load on the neighbors). On large data sets, a failed node needing reconstruction is a significant event, as it * Increases the load on neighbors just from going down. * Further increases the load on neighbors as they have to stream data (adding I/O and cache thrashing). This typically leads to the desire to throttle / rate limit reconstruction, which adds to the reconstruction window in addition to the fact that it was already naturally bottlenecking on neighbors. The other extreme is to tie `DF` to `N`, such that the data contained on one node, has it's secondary replicas spread out over the entire ring. This is an unacceptable choice because the probabiliy of multiple failures increases linearly with the cluster size. In other words, we want `DF` to be tied to neither `RF` nor `N`. Rather, `DF` should be chosen as a trade-off between the effects of `DF`: * The higher the `DF`, the higher the probability of data loss in case of multiple failures. * The higher the `DF`, the faster to reconstruct/replace a lost node. * The higher the `DF`, the less impact is seen on node failures on the performance requirements on other nodes. In making this determination, one must take into account that if a larger `DF` makes reconstruction/replacement significantly faster, that also decreases the time window in which multiple failures can occurr. Increasing `DF` is thus not *necessarily* increasing the total probability of data loss (for small values of `DF`). === Goal: Topologically aware redundancy We maintain the goal of being topology aware for the purpose of ensuring that we p
Re: RFC: Cassandra Virtual Nodes
Point of clarification: My use of the term "bucket" is completely unrelated to the term "bucket" used in the CRUSH paper. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)