Hi all,

First off, the ground rules. :)

This is a development/design discussion.  If you have general
questions about virtual nodes that don't pertain to this discussion,
please ask them in another thread, or on user@ and we'll get them
answered there.

BACKGROUND

Currently, an upgrade from 1.1.x to 1.2 will result no change as far
as virtual nodes are concerned.  Upgraded clusters will continue to
operate with the single token per node they were originally configured
with.  If however you wish to upgrade your cluster to virtual nodes,
you do so by setting the num_tokens parameter in cassandra.yaml to
something greater than one (recommended: 256), and restarting your
nodes.  This results in the existing range on each node being split
into num_tokens parts.  This works fine except that the new ranges are
still contiguous, and ideally need to be randomly (re)distributed for
optimal effect.

Enter CASSANDRA-4443[1] for the creation of a so-called shuffle
utility to do exactly that, redistribute the tokens to random
locations.


I'm ready to start on this, whatever shape it takes, but since it
seems the requirements could use some shoring up, I thought I'd raise
it for discussion here.

Shuffling the ranges to create a random distribution from contiguous
ranges has the potential to move a *lot* of data around (all of it,
basically).  Doing this in an optimal way would mean never moving a
range more than once.  Since it is a lot of data, and since presumably
we're expecting normal operations to continue in the meantime, it
would seem an optimal shuffle would need to maintain state.  For
example, one machine could serve as the "shuffle coordinator",
precalculating and persisting all of the moves, starting new transfers
as existing ones finish, and tracking the progress, etc.

Talking with Brandon Williams earlier, he suggested a simpler approach
of treating shuffle as a per node operation (possibly limited to some
subset of the ranges).  There would be no tracking of state.  If a
shuffle failed, you would either rerun it in its entirety, or not (if
for example you decided it made enough progress to satisfy your
requirements for distribution).

Of the two, the former benefits from being optimal all around, but
it's a fairly involved bit of code for something that I assume will be
used on a cluster at most once (is this a safe assumption?).  The
latter is much simpler to implement, but places more onus on the user
to get right, and will result in either or both of, a lot of needless
retransfer of data, and poor redistribution (i.e. if a shuffle didn't
complete, or if a subset of ranges was used).

So the question I pose for discussion is:  What are the requirements
for a shuffle operation?  How optimal does it need to be?  How
fool-proof?


[1]: https://issues.apache.org/jira/browse/CASSANDRA-4443
[2]: http://wiki.apache.org/cassandra/VirtualNodes/Balance

-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu

Reply via email to