On Mon, Aug 20, 2012 at 4:55 PM, Eric Evans <eev...@acunu.com> wrote: > Shuffling the ranges to create a random distribution from contiguous > ranges has the potential to move a *lot* of data around (all of it, > basically). Doing this in an optimal way would mean never moving a > range more than once. Since it is a lot of data, and since presumably > we're expecting normal operations to continue in the meantime, it > would seem an optimal shuffle would need to maintain state. For > example, one machine could serve as the "shuffle coordinator", > precalculating and persisting all of the moves, starting new transfers > as existing ones finish, and tracking the progress, etc.
Fortunately, we have a distributed storage system.... :) Seriously though, creating a CF mapping vnode from->to tuples, throwing in the list of changes to make once, and deleting them out as they complete, would be a pretty simple way to get what we want. Personally though I don't think likelyhood of "shuffle coordinator" failure during the operation is *that* high. I'd be happy to just assume it works. But if you want to get fancy, having it perform its changes in random or round-robin order (instead of node-at-a-time) then you could "recover" from failure w/o any state at all, for free -- (restart shuffle after failure, kill it when it's "sufficiently" shuffled. Note that you'd have to have cluster-shuffle fail *twice* before even the least optimal retrying policy gets worse than node-shuffle, since node-shuffle will transfer exactly 2x as much data as cluster-shuffle, on average. My vote would be for a stateless, round-robin cluster-shuffle. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com