Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Radim Kolar




Hi Radim,

The number of virtual nodes for each host would be configurable by the
user, in much the same way that initial_token is configurable now. A host
taking a larger number of virtual nodes (tokens) would have proportionately
more of the data. This is how we anticipate support for heterogeneity in
cluster hardware.
Yes, but this is good only for random partitioner. For ordered you need 
to be able split token space on highly loaded servers. With virtual 
tokens it will move load to random node.
What if random node will be also hotspot node? Administration will be 
more difficult because you don't know where workload lands after you 
reduce number of tokens held by node.


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
Hi Peter,

It's great to hear that others have come to some of the same conclusions!

I think a CRUSH-like strategy for topologically aware
replication/routing/locality is a great idea. I think I can see three
mostly orthogonal sets of functionality that we're concerned with:

a) a virtual node partitioning scheme (to support heterogeneity and
management simplicity)
b) topology aware replication
c) topology aware routing

First of all, I think that while (c) depends on (b) it does not affect
partitioning or replication directly, so I'm going to set that aside
for the moment and talk just about the former two.

I'll summarise your design here, mainly to make sure that I understand
it, but also to refer back to it:

1. The hash-space is partitioned into a fixed number of partitions
2. The CRUSH algorithm is run - select(1, disk) - over the topology
using each partition as a key, to get an assignment of partition ->
physical host (primary)
2a. adding or removing a node requires re-running CRUSH to recalculate
the partition assignment (and move data)
3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology
using each primary host id, to get an assignment of primary host ->
RF-1 replicas
3a. adding or removing a node requires re-running CRUSH to recalculate
replica assignment (which might be a different set of hosts to
before?)

Here are some thoughts:
(clarification: when I'm talking about buckets, I'm referring to the
same concept as in the CRUSH paper!)

One of my concerns about using CRUSH exactly as described in the paper
is that it seems to be sub-optimal in the amount of data that it moves
after modifying the topology. The authors of the paper introduce
several "bucket types" (uniform, list, tree, straw) which appear to be
various sub-optimal alternatives to consistent hashing, with various
trade-offs. Why not use consistent hashing? Given (2a) and (3a) I
think we might end up moving way too much data when the set of
replicas changes completely for a given host.

Let's suppose we introduce our own bucket type called a "ring bucket".
Each item in a ring bucket is assigned an equal, non-contiguous
portion of the key hash-space, which determines which keys are
assigned to it. When an item is added to the ring bucket, it takes an
equal portion of the hash-space from every other item already in the
bucket. And vice-versa for removals. It's easy to see that this ring
bucket implements consistent hashing with some unspecified virtual
node scheme. Additions and removals would be optimal (only \deltaw/W
keys require moving when the topology changes).

Using this ring bucket in the CRUSH topology, (with the hash function
being the identity function) would give the exact same distribution
properties as the virtual node strategy that I suggested previously,
but of course with much better topology awareness.

This makes it evident that the partitioning scheme, and a CRUSH-like
replication scheme are orthogonal concerns. In the same way as NTS
currently uses the ring to provide distribution at DC and rack level
by conceptually separating the ring into a distinct logical rings for
each DC, a CrushReplicationStrategy could use the ring as its
bucketing function to distribute partitions in the topology.

This brings me on to (1) and the reasons for our choice of virtual
node scheme - choose N random tokens - instead of the Dynamo-like
scheme that you suggest where the partitions are fixed in advance.
With the Dynamo scheme, the size of a virtual node partition will only
ever grow as more data is inserted. Since the number of partitions is
fixed when the cluster is created, the partition size is unbounded.

There are certain advantages to having a limit on partition size.
Streaming failures that cause retries do not have to resend so much
data. Streaming operations can be staggered in smaller chunks to
minimise the impact on the nodes involved. Load balancing can operate
on a finer granularity.

In the N tokens per node scheme, adding nodes to the cluster decreases
the partition size and so gives some control about how much data is
stored in each partition. The average size can be reduced by adding
more machines to the cluster.

The other concern you mentioned was
> The probability of data loss increases linearly with cluster size.

but you also acknowledge that

> In making this determination, one must take into account that if a
> larger `DF` makes reconstruction/replacement significantly faster,
> that also decreases the time window in which multiple failures can
> occurr. Increasing `DF` is thus not *necessarily* increasing the total
> probability of data loss (for small values of `DF`).

Our calculations lead us to believe that in fact the shorter rebuild
window more than compensates for the increased probability of multiple
failure, so with DF=N the probability of data loss is minimised.

The CRUSH paper also states:

"With 2-way mirroring these two factors cancel
each other out, while overall data saf

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
On 19 March 2012 09:23, Radim Kolar  wrote:
>
>>
>> Hi Radim,
>>
>> The number of virtual nodes for each host would be configurable by the
>> user, in much the same way that initial_token is configurable now. A host
>> taking a larger number of virtual nodes (tokens) would have
>> proportionately
>> more of the data. This is how we anticipate support for heterogeneity in
>> cluster hardware.
>
> Yes, but this is good only for random partitioner. For ordered you need to
> be able split token space on highly loaded servers. With virtual tokens it
> will move load to random node.
> What if random node will be also hotspot node? Administration will be more
> difficult because you don't know where workload lands after you reduce
> number of tokens held by node.

For OPP we envisage an external management process performing active
load balancing. The initial token assignment would be random within
some user-specified range corresponding to the range of their keys.
The load would then be monitored and hot-spots would be moved by
reassigning virtual nodes to lightly loaded machines, or introducing
new tokens into hot ranges. It makes sense that this would not be a
manual process, but there would certainly be more control than just
increasing or decreasing the number of tokens assigned to a node.

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Edward Capriolo
On Mon, Mar 19, 2012 at 4:15 PM, Sam Overton  wrote:
> On 19 March 2012 09:23, Radim Kolar  wrote:
>>
>>>
>>> Hi Radim,
>>>
>>> The number of virtual nodes for each host would be configurable by the
>>> user, in much the same way that initial_token is configurable now. A host
>>> taking a larger number of virtual nodes (tokens) would have
>>> proportionately
>>> more of the data. This is how we anticipate support for heterogeneity in
>>> cluster hardware.
>>
>> Yes, but this is good only for random partitioner. For ordered you need to
>> be able split token space on highly loaded servers. With virtual tokens it
>> will move load to random node.
>> What if random node will be also hotspot node? Administration will be more
>> difficult because you don't know where workload lands after you reduce
>> number of tokens held by node.
>
> For OPP we envisage an external management process performing active
> load balancing. The initial token assignment would be random within
> some user-specified range corresponding to the range of their keys.
> The load would then be monitored and hot-spots would be moved by
> reassigning virtual nodes to lightly loaded machines, or introducing
> new tokens into hot ranges. It makes sense that this would not be a
> manual process, but there would certainly be more control than just
> increasing or decreasing the number of tokens assigned to a node.
>
> --
> Sam Overton
> Acunu | http://www.acunu.com | @acunu

For OPP the problem of load balancing is more profound. Now you need
vnodes per keyspace because you can not expect each keyspace to have
the same distribution. With three keyspaces you are not unsure as to
which was is causing the hotness. I think OPP should just go away.


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
> For OPP the problem of load balancing is more profound. Now you need
> vnodes per keyspace because you can not expect each keyspace to have
> the same distribution. With three keyspaces you are not unsure as to
> which was is causing the hotness. I think OPP should just go away.

That's a good point, but isn't that the same problem with trying to
balance tokens with OPP currently?


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Edward Capriolo
On Mon, Mar 19, 2012 at 4:24 PM, Sam Overton  wrote:
>> For OPP the problem of load balancing is more profound. Now you need
>> vnodes per keyspace because you can not expect each keyspace to have
>> the same distribution. With three keyspaces you are not unsure as to
>> which was is causing the hotness. I think OPP should just go away.
>
> That's a good point, but isn't that the same problem with trying to
> balance tokens with OPP currently?

Yes. I was bringing this up because the external management process
you suggested performing active load balancing will have to be smart
enough to understand this. Right now since this is done manually it is
the users problem.


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Rick Branson
I think if we could go back and rebuild Cassandra from scratch, vnodes
would likely be implemented from the beginning. However, I'm concerned that
implementing them now could be a big distraction from more productive uses
of all of our time and introduce major potential stability issues into what
is becoming a business critical piece of infrastructure for many people.
However, instead of just complaining and pedantry, I'd like to offer a
feasible alternative:

Has there been consideration given to the idea of a supporting a single
token range for a node?

While not theoretically as capable as vnodes, it seems to me to be more
practical as it would have a significantly lower impact on the codebase and
provides a much clearer migration path. It also seems to solve a majority
of complaints regarding operational issues with Cassandra clusters.

Each node would have a lower and an upper token, which would form a range
that would be actively distributed via gossip. Read and replication
requests would only be routed to a replica when the key of these operations
matched the replica's token range in the gossip tables. Each node would
locally store it's own current active token range as well as a target token
range it's "moving" towards.

As a new node undergoes bootstrap, the bounds would be gradually expanded
to allow it to handle requests for a wider range of the keyspace as it
moves towards it's target token range. This idea boils down to a move from
hard cutovers to smoother operations by gradually adjusting active token
ranges over a period of time. It would apply to token change operations
(nodetool 'move' and 'removetoken') as well.

Failure during streaming could be recovered at the bounds instead of
restarting the whole process as the active bounds would effectively track
the progress for bootstrap & target token changes. Implicitly these
operations would be throttled to some degree. Node repair (AES) could also
be modified using the same overall ideas provide a more gradual impact on
the cluster overall similar as the ideas given in CASSANDRA-3721.

While this doesn't spread the load over the cluster for these operations
evenly like vnodes does, this is likely an issue that could be worked
around by performing concurrent (throttled) bootstrap & node repair (AES)
operations. It does allow some kind of "active" load balancing, but clearly
this is not as flexible or as useful as vnodes, but you should be using
RandomPartitioner or sort-of-randomized keys with OPP right? ;)

As a side note: vnodes fail to provide solutions to node-based limitations
that seem to me to cause a substantial portion of operational issues such
as impact of node restarts / upgrades, GC and compaction induced latency. I
think some progress could be made here by allowing a "pack" of independent
Cassandra nodes to be ran on a single host; somewhat (but nowhere near
entirely) similar to a pre-fork model used by some UNIX-based servers.

Input?

--
Rick Branson
DataStax


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
> a) a virtual node partitioning scheme (to support heterogeneity and
> management simplicity)
> b) topology aware replication
> c) topology aware routing

I would add (d) limiting the distribution factor to decrease the
probability of data loss/multiple failures within a replica set.

> First of all, I think that while (c) depends on (b) it does not affect
> partitioning or replication directly, so I'm going to set that aside
> for the moment and talk just about the former two.

Agreed (but I think (d) relates).

> 1. The hash-space is partitioned into a fixed number of partitions
> 2. The CRUSH algorithm is run - select(1, disk) - over the topology
> using each partition as a key, to get an assignment of partition ->
> physical host (primary)
> 2a. adding or removing a node requires re-running CRUSH to recalculate
> the partition assignment (and move data)

(or any other arbitrary change, yes)

> 3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology
> using each primary host id, to get an assignment of primary host ->
> RF-1 replicas
> 3a. adding or removing a node requires re-running CRUSH to recalculate
> replica assignment (which might be a different set of hosts to
> before?)

Yes. Cassandra would have a minimum of two topologies; the "current"
and the "next" topology. Each would imply a mapping of partition ->
replica set, and that mapping will potentially be different between
the two.

Reads would always be served form the "current" topology. Writes would
go to the union of the current and the next topology, taking care to
"tie" replicas together correctly for consistency level purposes (this
is what CASSANDRA-3901 and CASSANDRA-3833 are talking about).

Any topology change is treated the same from the read/write path
perspective, regardless of whether you're adding a node, removing a
node, adding an entire rack, or even an entire data center. No added
complexity is introduced beyond the "base case".

> One of my concerns about using CRUSH exactly as described in the paper
> is that it seems to be sub-optimal in the amount of data that it moves
> after modifying the topology. The authors of the paper introduce
> several "bucket types" (uniform, list, tree, straw) which appear to be
> various sub-optimal alternatives to consistent hashing, with various
> trade-offs. Why not use consistent hashing? Given (2a) and (3a) I
> think we might end up moving way too much data when the set of
> replicas changes completely for a given host.

One of the benefits of pre-partitioning to a fixed set of partitions
is that we can pre-calculate the mapping. This removes the CPU
efficiency trade-off of the straw bucket, and the straw bucket would
be a good choice.

Consistent hashing: It's totally doable to use consistent hashing at
each node in the topology. It is not without its own trade-offs
though, because the granularity of weighting you want to support, and
the accurace of it, relates directly to the number of vnodes per child
you need to keep in your consistent hashing ring. Taking granularity,
accuracy target and number of children into account can easily lead to
very large amounts of vnodes.

(At least experimentally from when I've implemented and played with
the simple form of consistent hashing in the past. I don't currently
have good mathematical evidence.)

> Let's suppose we introduce our own bucket type called a "ring bucket".
> Each item in a ring bucket is assigned an equal, non-contiguous
> portion of the key hash-space, which determines which keys are
> assigned to it. When an item is added to the ring bucket, it takes an
> equal portion of the hash-space from every other item already in the
> bucket. And vice-versa for removals. It's easy to see that this ring
> bucket implements consistent hashing with some unspecified virtual
> node scheme. Additions and removals would be optimal (only \deltaw/W
> keys require moving when the topology changes).

This is my understanding of what you meant by consistent hashing, and
what I refer to above.

> Using this ring bucket in the CRUSH topology, (with the hash function
> being the identity function) would give the exact same distribution
> properties as the virtual node strategy that I suggested previously,
> but of course with much better topology awareness.

I will have to re-read your orignal post. I seem to have missed something :)

> This makes it evident that the partitioning scheme, and a CRUSH-like
> replication scheme are orthogonal concerns. In the same way as NTS
> currently uses the ring to provide distribution at DC and rack level
> by conceptually separating the ring into a distinct logical rings for
> each DC, a CrushReplicationStrategy could use the ring as its
> bucketing function to distribute partitions in the topology.

Yes, agreed. Also, the distribution factor limiting is also compatible
with non-crush by hash chaining from the primary replica instead of
the row key.

> This brings me on to (1) and the reasons for our

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
>> Using this ring bucket in the CRUSH topology, (with the hash function
>> being the identity function) would give the exact same distribution
>> properties as the virtual node strategy that I suggested previously,
>> but of course with much better topology awareness.
>
> I will have to re-read your orignal post. I seem to have missed something :)

I did, and I may or may not understand what you mean.

Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by
hash + identity hash as you traverse down the topology tree?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
(I may comment on other things more later)

> As a side note: vnodes fail to provide solutions to node-based limitations
> that seem to me to cause a substantial portion of operational issues such
> as impact of node restarts / upgrades, GC and compaction induced latency. I

Actually, it does. At least assumign DF > RF (as in the original
proposal, and mine). The impact of a node suffering from a performance
degradation is mitigated because the effects are spread out over DF-1
(N-1 in the original post) nodes instead of just RF nodes.

> think some progress could be made here by allowing a "pack" of independent
> Cassandra nodes to be ran on a single host; somewhat (but nowhere near
> entirely) similar to a pre-fork model used by some UNIX-based servers.

I have pretty significant knee-jerk negative reactions to that idea to
be honest, even if the pack is limited to a handful of instances. In
order for vnodes to be useful with random placement, we'd need much
more than a handful of vnodes per node (cassandra instances in a
"pack" in that model).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Vijay
I also did create a ticket
https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the
reason why I would like to see vnodes in cassandra.
It can also potentially reduce the SSTable seeks which a node has to do to
query data in SizeTireCompaction if extended to the filesystem.

But 110% agree with Peter, we need to take incremental steps and start with
the existing bootstrapping.
May be we can start it by making a set of Ranges/Token to a node insted of
one token. And then may be building things around the movement of those
ranges.

I have been thinking about this for a while but having trouble to get to a
point where i am comfortable changing big chunks of code.

Regards,




On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller  wrote:

> (I may comment on other things more later)
>
> > As a side note: vnodes fail to provide solutions to node-based
> limitations
> > that seem to me to cause a substantial portion of operational issues such
> > as impact of node restarts / upgrades, GC and compaction induced
> latency. I
>
> Actually, it does. At least assumign DF > RF (as in the original
> proposal, and mine). The impact of a node suffering from a performance
> degradation is mitigated because the effects are spread out over DF-1
> (N-1 in the original post) nodes instead of just RF nodes.
>
> > think some progress could be made here by allowing a "pack" of
> independent
> > Cassandra nodes to be ran on a single host; somewhat (but nowhere near
> > entirely) similar to a pre-fork model used by some UNIX-based servers.
>
> I have pretty significant knee-jerk negative reactions to that idea to
> be honest, even if the pack is limited to a handful of instances. In
> order for vnodes to be useful with random placement, we'd need much
> more than a handful of vnodes per node (cassandra instances in a
> "pack" in that model).
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Rick Branson
On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller
 wrote:
> > As a side note: vnodes fail to provide solutions to node-based limitations
> > that seem to me to cause a substantial portion of operational issues such
> > as impact of node restarts / upgrades, GC and compaction induced latency. I
>
> Actually, it does. At least assumign DF > RF (as in the original
> proposal, and mine). The impact of a node suffering from a performance
> degradation is mitigated because the effects are spread out over DF-1
> (N-1 in the original post) nodes instead of just RF nodes.

You've got me on one of those after some re-thought. For any node
outage (an upgrade/restart) definitely has a big impact by distributed
the load more evenly, but (and correct me if I'm wrong) for things
like additional latency caused by GC/compaction, those requests will
just be slower rather than timing out or getting redirected via the
dynamic snitch.

> > think some progress could be made here by allowing a "pack" of independent
> > Cassandra nodes to be ran on a single host; somewhat (but nowhere near
> > entirely) similar to a pre-fork model used by some UNIX-based servers.
>
> I have pretty significant knee-jerk negative reactions to that idea to
> be honest, even if the pack is limited to a handful of instances. In
> order for vnodes to be useful with random placement, we'd need much
> more than a handful of vnodes per node (cassandra instances in a
> "pack" in that model).
>

Fair enough, I'm not super fond of the idea personally, but I don't
see a way around the limitations of the current JVM GC without
multiple processes.

After some rethinking my ideas a bit, I think actually what I've
settled a bit more on is to keep the existing node tokens, but add an
additional "active token" that would be used to determine the data
range that a node is ready to receive reads for. This should gain all
of the benefits highlighted in my earlier post, but with less
complexity in implementation. Node repair (AES) would still allow
ranges to be specified.


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Eric Evans
On Mon, Mar 19, 2012 at 9:37 PM, Vijay  wrote:
> I also did create a ticket
> https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the
> reason why I would like to see vnodes in cassandra.
> It can also potentially reduce the SSTable seeks which a node has to do to
> query data in SizeTireCompaction if extended to the filesystem.
>
> But 110% agree with Peter, we need to take incremental steps and start with
> the existing bootstrapping.

I'm guessing you're referring to Rick's proposal about ranges per node?

> May be we can start it by making a set of Ranges/Token to a node insted of
> one token. And then may be building things around the movement of those
> ranges.

> I have been thinking about this for a while but having trouble to get to a
> point where i am comfortable changing big chunks of code.

It might help to see some more detail in the proposals.  Both ideas
seem invasive, vnodes more so, but there are many more benefits as
well.

> On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller > wrote:
>
>> (I may comment on other things more later)
>>
>> > As a side note: vnodes fail to provide solutions to node-based
>> limitations
>> > that seem to me to cause a substantial portion of operational issues such
>> > as impact of node restarts / upgrades, GC and compaction induced
>> latency. I
>>
>> Actually, it does. At least assumign DF > RF (as in the original
>> proposal, and mine). The impact of a node suffering from a performance
>> degradation is mitigated because the effects are spread out over DF-1
>> (N-1 in the original post) nodes instead of just RF nodes.
>>
>> > think some progress could be made here by allowing a "pack" of
>> independent
>> > Cassandra nodes to be ran on a single host; somewhat (but nowhere near
>> > entirely) similar to a pre-fork model used by some UNIX-based servers.
>>
>> I have pretty significant knee-jerk negative reactions to that idea to
>> be honest, even if the pack is limited to a handful of instances. In
>> order for vnodes to be useful with random placement, we'd need much
>> more than a handful of vnodes per node (cassandra instances in a
>> "pack" in that model).


-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Vijay
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans  wrote:

> I'm guessing you're referring to Rick's proposal about ranges per node?
>

May be, what i mean is little more simple than that... We can consider
every node having a multiple conservative ranges and moving those ranges
for bootstrap etc, instead of finding the mid point etc in the bootstrap
code. Once we get that working all the way to the FS/Streaming then we can
move those ranges and assign those ranges to nodes in random orders. Hope
it makes sense.