Re: RFC: Cassandra Virtual Nodes
Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for heterogeneity in cluster hardware. Yes, but this is good only for random partitioner. For ordered you need to be able split token space on highly loaded servers. With virtual tokens it will move load to random node. What if random node will be also hotspot node? Administration will be more difficult because you don't know where workload lands after you reduce number of tokens held by node.
Re: RFC: Cassandra Virtual Nodes
Hi Peter, It's great to hear that others have come to some of the same conclusions! I think a CRUSH-like strategy for topologically aware replication/routing/locality is a great idea. I think I can see three mostly orthogonal sets of functionality that we're concerned with: a) a virtual node partitioning scheme (to support heterogeneity and management simplicity) b) topology aware replication c) topology aware routing First of all, I think that while (c) depends on (b) it does not affect partitioning or replication directly, so I'm going to set that aside for the moment and talk just about the former two. I'll summarise your design here, mainly to make sure that I understand it, but also to refer back to it: 1. The hash-space is partitioned into a fixed number of partitions 2. The CRUSH algorithm is run - select(1, disk) - over the topology using each partition as a key, to get an assignment of partition -> physical host (primary) 2a. adding or removing a node requires re-running CRUSH to recalculate the partition assignment (and move data) 3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology using each primary host id, to get an assignment of primary host -> RF-1 replicas 3a. adding or removing a node requires re-running CRUSH to recalculate replica assignment (which might be a different set of hosts to before?) Here are some thoughts: (clarification: when I'm talking about buckets, I'm referring to the same concept as in the CRUSH paper!) One of my concerns about using CRUSH exactly as described in the paper is that it seems to be sub-optimal in the amount of data that it moves after modifying the topology. The authors of the paper introduce several "bucket types" (uniform, list, tree, straw) which appear to be various sub-optimal alternatives to consistent hashing, with various trade-offs. Why not use consistent hashing? Given (2a) and (3a) I think we might end up moving way too much data when the set of replicas changes completely for a given host. Let's suppose we introduce our own bucket type called a "ring bucket". Each item in a ring bucket is assigned an equal, non-contiguous portion of the key hash-space, which determines which keys are assigned to it. When an item is added to the ring bucket, it takes an equal portion of the hash-space from every other item already in the bucket. And vice-versa for removals. It's easy to see that this ring bucket implements consistent hashing with some unspecified virtual node scheme. Additions and removals would be optimal (only \deltaw/W keys require moving when the topology changes). Using this ring bucket in the CRUSH topology, (with the hash function being the identity function) would give the exact same distribution properties as the virtual node strategy that I suggested previously, but of course with much better topology awareness. This makes it evident that the partitioning scheme, and a CRUSH-like replication scheme are orthogonal concerns. In the same way as NTS currently uses the ring to provide distribution at DC and rack level by conceptually separating the ring into a distinct logical rings for each DC, a CrushReplicationStrategy could use the ring as its bucketing function to distribute partitions in the topology. This brings me on to (1) and the reasons for our choice of virtual node scheme - choose N random tokens - instead of the Dynamo-like scheme that you suggest where the partitions are fixed in advance. With the Dynamo scheme, the size of a virtual node partition will only ever grow as more data is inserted. Since the number of partitions is fixed when the cluster is created, the partition size is unbounded. There are certain advantages to having a limit on partition size. Streaming failures that cause retries do not have to resend so much data. Streaming operations can be staggered in smaller chunks to minimise the impact on the nodes involved. Load balancing can operate on a finer granularity. In the N tokens per node scheme, adding nodes to the cluster decreases the partition size and so gives some control about how much data is stored in each partition. The average size can be reduced by adding more machines to the cluster. The other concern you mentioned was > The probability of data loss increases linearly with cluster size. but you also acknowledge that > In making this determination, one must take into account that if a > larger `DF` makes reconstruction/replacement significantly faster, > that also decreases the time window in which multiple failures can > occurr. Increasing `DF` is thus not *necessarily* increasing the total > probability of data loss (for small values of `DF`). Our calculations lead us to believe that in fact the shorter rebuild window more than compensates for the increased probability of multiple failure, so with DF=N the probability of data loss is minimised. The CRUSH paper also states: "With 2-way mirroring these two factors cancel each other out, while overall data saf
Re: RFC: Cassandra Virtual Nodes
On 19 March 2012 09:23, Radim Kolar wrote: > >> >> Hi Radim, >> >> The number of virtual nodes for each host would be configurable by the >> user, in much the same way that initial_token is configurable now. A host >> taking a larger number of virtual nodes (tokens) would have >> proportionately >> more of the data. This is how we anticipate support for heterogeneity in >> cluster hardware. > > Yes, but this is good only for random partitioner. For ordered you need to > be able split token space on highly loaded servers. With virtual tokens it > will move load to random node. > What if random node will be also hotspot node? Administration will be more > difficult because you don't know where workload lands after you reduce > number of tokens held by node. For OPP we envisage an external management process performing active load balancing. The initial token assignment would be random within some user-specified range corresponding to the range of their keys. The load would then be monitored and hot-spots would be moved by reassigning virtual nodes to lightly loaded machines, or introducing new tokens into hot ranges. It makes sense that this would not be a manual process, but there would certainly be more control than just increasing or decreasing the number of tokens assigned to a node. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 4:15 PM, Sam Overton wrote: > On 19 March 2012 09:23, Radim Kolar wrote: >> >>> >>> Hi Radim, >>> >>> The number of virtual nodes for each host would be configurable by the >>> user, in much the same way that initial_token is configurable now. A host >>> taking a larger number of virtual nodes (tokens) would have >>> proportionately >>> more of the data. This is how we anticipate support for heterogeneity in >>> cluster hardware. >> >> Yes, but this is good only for random partitioner. For ordered you need to >> be able split token space on highly loaded servers. With virtual tokens it >> will move load to random node. >> What if random node will be also hotspot node? Administration will be more >> difficult because you don't know where workload lands after you reduce >> number of tokens held by node. > > For OPP we envisage an external management process performing active > load balancing. The initial token assignment would be random within > some user-specified range corresponding to the range of their keys. > The load would then be monitored and hot-spots would be moved by > reassigning virtual nodes to lightly loaded machines, or introducing > new tokens into hot ranges. It makes sense that this would not be a > manual process, but there would certainly be more control than just > increasing or decreasing the number of tokens assigned to a node. > > -- > Sam Overton > Acunu | http://www.acunu.com | @acunu For OPP the problem of load balancing is more profound. Now you need vnodes per keyspace because you can not expect each keyspace to have the same distribution. With three keyspaces you are not unsure as to which was is causing the hotness. I think OPP should just go away.
Re: RFC: Cassandra Virtual Nodes
> For OPP the problem of load balancing is more profound. Now you need > vnodes per keyspace because you can not expect each keyspace to have > the same distribution. With three keyspaces you are not unsure as to > which was is causing the hotness. I think OPP should just go away. That's a good point, but isn't that the same problem with trying to balance tokens with OPP currently?
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 4:24 PM, Sam Overton wrote: >> For OPP the problem of load balancing is more profound. Now you need >> vnodes per keyspace because you can not expect each keyspace to have >> the same distribution. With three keyspaces you are not unsure as to >> which was is causing the hotness. I think OPP should just go away. > > That's a good point, but isn't that the same problem with trying to > balance tokens with OPP currently? Yes. I was bringing this up because the external management process you suggested performing active load balancing will have to be smart enough to understand this. Right now since this is done manually it is the users problem.
Re: RFC: Cassandra Virtual Nodes
I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what is becoming a business critical piece of infrastructure for many people. However, instead of just complaining and pedantry, I'd like to offer a feasible alternative: Has there been consideration given to the idea of a supporting a single token range for a node? While not theoretically as capable as vnodes, it seems to me to be more practical as it would have a significantly lower impact on the codebase and provides a much clearer migration path. It also seems to solve a majority of complaints regarding operational issues with Cassandra clusters. Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would locally store it's own current active token range as well as a target token range it's "moving" towards. As a new node undergoes bootstrap, the bounds would be gradually expanded to allow it to handle requests for a wider range of the keyspace as it moves towards it's target token range. This idea boils down to a move from hard cutovers to smoother operations by gradually adjusting active token ranges over a period of time. It would apply to token change operations (nodetool 'move' and 'removetoken') as well. Failure during streaming could be recovered at the bounds instead of restarting the whole process as the active bounds would effectively track the progress for bootstrap & target token changes. Implicitly these operations would be throttled to some degree. Node repair (AES) could also be modified using the same overall ideas provide a more gradual impact on the cluster overall similar as the ideas given in CASSANDRA-3721. While this doesn't spread the load over the cluster for these operations evenly like vnodes does, this is likely an issue that could be worked around by performing concurrent (throttled) bootstrap & node repair (AES) operations. It does allow some kind of "active" load balancing, but clearly this is not as flexible or as useful as vnodes, but you should be using RandomPartitioner or sort-of-randomized keys with OPP right? ;) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I think some progress could be made here by allowing a "pack" of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. Input? -- Rick Branson DataStax
Re: RFC: Cassandra Virtual Nodes
> a) a virtual node partitioning scheme (to support heterogeneity and > management simplicity) > b) topology aware replication > c) topology aware routing I would add (d) limiting the distribution factor to decrease the probability of data loss/multiple failures within a replica set. > First of all, I think that while (c) depends on (b) it does not affect > partitioning or replication directly, so I'm going to set that aside > for the moment and talk just about the former two. Agreed (but I think (d) relates). > 1. The hash-space is partitioned into a fixed number of partitions > 2. The CRUSH algorithm is run - select(1, disk) - over the topology > using each partition as a key, to get an assignment of partition -> > physical host (primary) > 2a. adding or removing a node requires re-running CRUSH to recalculate > the partition assignment (and move data) (or any other arbitrary change, yes) > 3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology > using each primary host id, to get an assignment of primary host -> > RF-1 replicas > 3a. adding or removing a node requires re-running CRUSH to recalculate > replica assignment (which might be a different set of hosts to > before?) Yes. Cassandra would have a minimum of two topologies; the "current" and the "next" topology. Each would imply a mapping of partition -> replica set, and that mapping will potentially be different between the two. Reads would always be served form the "current" topology. Writes would go to the union of the current and the next topology, taking care to "tie" replicas together correctly for consistency level purposes (this is what CASSANDRA-3901 and CASSANDRA-3833 are talking about). Any topology change is treated the same from the read/write path perspective, regardless of whether you're adding a node, removing a node, adding an entire rack, or even an entire data center. No added complexity is introduced beyond the "base case". > One of my concerns about using CRUSH exactly as described in the paper > is that it seems to be sub-optimal in the amount of data that it moves > after modifying the topology. The authors of the paper introduce > several "bucket types" (uniform, list, tree, straw) which appear to be > various sub-optimal alternatives to consistent hashing, with various > trade-offs. Why not use consistent hashing? Given (2a) and (3a) I > think we might end up moving way too much data when the set of > replicas changes completely for a given host. One of the benefits of pre-partitioning to a fixed set of partitions is that we can pre-calculate the mapping. This removes the CPU efficiency trade-off of the straw bucket, and the straw bucket would be a good choice. Consistent hashing: It's totally doable to use consistent hashing at each node in the topology. It is not without its own trade-offs though, because the granularity of weighting you want to support, and the accurace of it, relates directly to the number of vnodes per child you need to keep in your consistent hashing ring. Taking granularity, accuracy target and number of children into account can easily lead to very large amounts of vnodes. (At least experimentally from when I've implemented and played with the simple form of consistent hashing in the past. I don't currently have good mathematical evidence.) > Let's suppose we introduce our own bucket type called a "ring bucket". > Each item in a ring bucket is assigned an equal, non-contiguous > portion of the key hash-space, which determines which keys are > assigned to it. When an item is added to the ring bucket, it takes an > equal portion of the hash-space from every other item already in the > bucket. And vice-versa for removals. It's easy to see that this ring > bucket implements consistent hashing with some unspecified virtual > node scheme. Additions and removals would be optimal (only \deltaw/W > keys require moving when the topology changes). This is my understanding of what you meant by consistent hashing, and what I refer to above. > Using this ring bucket in the CRUSH topology, (with the hash function > being the identity function) would give the exact same distribution > properties as the virtual node strategy that I suggested previously, > but of course with much better topology awareness. I will have to re-read your orignal post. I seem to have missed something :) > This makes it evident that the partitioning scheme, and a CRUSH-like > replication scheme are orthogonal concerns. In the same way as NTS > currently uses the ring to provide distribution at DC and rack level > by conceptually separating the ring into a distinct logical rings for > each DC, a CrushReplicationStrategy could use the ring as its > bucketing function to distribute partitions in the topology. Yes, agreed. Also, the distribution factor limiting is also compatible with non-crush by hash chaining from the primary replica instead of the row key. > This brings me on to (1) and the reasons for our
Re: RFC: Cassandra Virtual Nodes
>> Using this ring bucket in the CRUSH topology, (with the hash function >> being the identity function) would give the exact same distribution >> properties as the virtual node strategy that I suggested previously, >> but of course with much better topology awareness. > > I will have to re-read your orignal post. I seem to have missed something :) I did, and I may or may not understand what you mean. Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by hash + identity hash as you traverse down the topology tree? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
(I may comment on other things more later) > As a side note: vnodes fail to provide solutions to node-based limitations > that seem to me to cause a substantial portion of operational issues such > as impact of node restarts / upgrades, GC and compaction induced latency. I Actually, it does. At least assumign DF > RF (as in the original proposal, and mine). The impact of a node suffering from a performance degradation is mitigated because the effects are spread out over DF-1 (N-1 in the original post) nodes instead of just RF nodes. > think some progress could be made here by allowing a "pack" of independent > Cassandra nodes to be ran on a single host; somewhat (but nowhere near > entirely) similar to a pre-fork model used by some UNIX-based servers. I have pretty significant knee-jerk negative reactions to that idea to be honest, even if the pack is limited to a handful of instances. In order for vnodes to be useful with random placement, we'd need much more than a handful of vnodes per node (cassandra instances in a "pack" in that model). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
I also did create a ticket https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the reason why I would like to see vnodes in cassandra. It can also potentially reduce the SSTable seeks which a node has to do to query data in SizeTireCompaction if extended to the filesystem. But 110% agree with Peter, we need to take incremental steps and start with the existing bootstrapping. May be we can start it by making a set of Ranges/Token to a node insted of one token. And then may be building things around the movement of those ranges. I have been thinking about this for a while but having trouble to get to a point where i am comfortable changing big chunks of code. Regards, On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller wrote: > (I may comment on other things more later) > > > As a side note: vnodes fail to provide solutions to node-based > limitations > > that seem to me to cause a substantial portion of operational issues such > > as impact of node restarts / upgrades, GC and compaction induced > latency. I > > Actually, it does. At least assumign DF > RF (as in the original > proposal, and mine). The impact of a node suffering from a performance > degradation is mitigated because the effects are spread out over DF-1 > (N-1 in the original post) nodes instead of just RF nodes. > > > think some progress could be made here by allowing a "pack" of > independent > > Cassandra nodes to be ran on a single host; somewhat (but nowhere near > > entirely) similar to a pre-fork model used by some UNIX-based servers. > > I have pretty significant knee-jerk negative reactions to that idea to > be honest, even if the pack is limited to a handful of instances. In > order for vnodes to be useful with random placement, we'd need much > more than a handful of vnodes per node (cassandra instances in a > "pack" in that model). > > -- > / Peter Schuller (@scode, http://worldmodscode.wordpress.com) >
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller wrote: > > As a side note: vnodes fail to provide solutions to node-based limitations > > that seem to me to cause a substantial portion of operational issues such > > as impact of node restarts / upgrades, GC and compaction induced latency. I > > Actually, it does. At least assumign DF > RF (as in the original > proposal, and mine). The impact of a node suffering from a performance > degradation is mitigated because the effects are spread out over DF-1 > (N-1 in the original post) nodes instead of just RF nodes. You've got me on one of those after some re-thought. For any node outage (an upgrade/restart) definitely has a big impact by distributed the load more evenly, but (and correct me if I'm wrong) for things like additional latency caused by GC/compaction, those requests will just be slower rather than timing out or getting redirected via the dynamic snitch. > > think some progress could be made here by allowing a "pack" of independent > > Cassandra nodes to be ran on a single host; somewhat (but nowhere near > > entirely) similar to a pre-fork model used by some UNIX-based servers. > > I have pretty significant knee-jerk negative reactions to that idea to > be honest, even if the pack is limited to a handful of instances. In > order for vnodes to be useful with random placement, we'd need much > more than a handful of vnodes per node (cassandra instances in a > "pack" in that model). > Fair enough, I'm not super fond of the idea personally, but I don't see a way around the limitations of the current JVM GC without multiple processes. After some rethinking my ideas a bit, I think actually what I've settled a bit more on is to keep the existing node tokens, but add an additional "active token" that would be used to determine the data range that a node is ready to receive reads for. This should gain all of the benefits highlighted in my earlier post, but with less complexity in implementation. Node repair (AES) would still allow ranges to be specified.
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 9:37 PM, Vijay wrote: > I also did create a ticket > https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the > reason why I would like to see vnodes in cassandra. > It can also potentially reduce the SSTable seeks which a node has to do to > query data in SizeTireCompaction if extended to the filesystem. > > But 110% agree with Peter, we need to take incremental steps and start with > the existing bootstrapping. I'm guessing you're referring to Rick's proposal about ranges per node? > May be we can start it by making a set of Ranges/Token to a node insted of > one token. And then may be building things around the movement of those > ranges. > I have been thinking about this for a while but having trouble to get to a > point where i am comfortable changing big chunks of code. It might help to see some more detail in the proposals. Both ideas seem invasive, vnodes more so, but there are many more benefits as well. > On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller > wrote: > >> (I may comment on other things more later) >> >> > As a side note: vnodes fail to provide solutions to node-based >> limitations >> > that seem to me to cause a substantial portion of operational issues such >> > as impact of node restarts / upgrades, GC and compaction induced >> latency. I >> >> Actually, it does. At least assumign DF > RF (as in the original >> proposal, and mine). The impact of a node suffering from a performance >> degradation is mitigated because the effects are spread out over DF-1 >> (N-1 in the original post) nodes instead of just RF nodes. >> >> > think some progress could be made here by allowing a "pack" of >> independent >> > Cassandra nodes to be ran on a single host; somewhat (but nowhere near >> > entirely) similar to a pre-fork model used by some UNIX-based servers. >> >> I have pretty significant knee-jerk negative reactions to that idea to >> be honest, even if the pack is limited to a handful of instances. In >> order for vnodes to be useful with random placement, we'd need much >> more than a handful of vnodes per node (cassandra instances in a >> "pack" in that model). -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans wrote: > I'm guessing you're referring to Rick's proposal about ranges per node? > May be, what i mean is little more simple than that... We can consider every node having a multiple conservative ranges and moving those ranges for bootstrap etc, instead of finding the mid point etc in the bootstrap code. Once we get that working all the way to the FS/Streaming then we can move those ranges and assign those ranges to nodes in random orders. Hope it makes sense.