Hi all, First off, the ground rules. :)
This is a development/design discussion. If you have general questions about virtual nodes that don't pertain to this discussion, please ask them in another thread, or on user@ and we'll get them answered there. BACKGROUND Currently, an upgrade from 1.1.x to 1.2 will result no change as far as virtual nodes are concerned. Upgraded clusters will continue to operate with the single token per node they were originally configured with. If however you wish to upgrade your cluster to virtual nodes, you do so by setting the num_tokens parameter in cassandra.yaml to something greater than one (recommended: 256), and restarting your nodes. This results in the existing range on each node being split into num_tokens parts. This works fine except that the new ranges are still contiguous, and ideally need to be randomly (re)distributed for optimal effect. Enter CASSANDRA-4443[1] for the creation of a so-called shuffle utility to do exactly that, redistribute the tokens to random locations. I'm ready to start on this, whatever shape it takes, but since it seems the requirements could use some shoring up, I thought I'd raise it for discussion here. Shuffling the ranges to create a random distribution from contiguous ranges has the potential to move a *lot* of data around (all of it, basically). Doing this in an optimal way would mean never moving a range more than once. Since it is a lot of data, and since presumably we're expecting normal operations to continue in the meantime, it would seem an optimal shuffle would need to maintain state. For example, one machine could serve as the "shuffle coordinator", precalculating and persisting all of the moves, starting new transfers as existing ones finish, and tracking the progress, etc. Talking with Brandon Williams earlier, he suggested a simpler approach of treating shuffle as a per node operation (possibly limited to some subset of the ranges). There would be no tracking of state. If a shuffle failed, you would either rerun it in its entirety, or not (if for example you decided it made enough progress to satisfy your requirements for distribution). Of the two, the former benefits from being optimal all around, but it's a fairly involved bit of code for something that I assume will be used on a cluster at most once (is this a safe assumption?). The latter is much simpler to implement, but places more onus on the user to get right, and will result in either or both of, a lot of needless retransfer of data, and poor redistribution (i.e. if a shuffle didn't complete, or if a subset of ranges was used). So the question I pose for discussion is: What are the requirements for a shuffle operation? How optimal does it need to be? How fool-proof? [1]: https://issues.apache.org/jira/browse/CASSANDRA-4443 [2]: http://wiki.apache.org/cassandra/VirtualNodes/Balance -- Eric Evans Acunu | http://www.acunu.com | @acunu