Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Richard Low
I'm also not convinced the problems listed in the paper with removenode are
so serious. With lots of vnodes per node, removenode causes data to be
streamed into all other nodes in parallel, so is (n-1) times quicker than
replacement for n nodes. For R=3, the failure rate goes up with vnodes
(without vnodes, after the first failure, any 4 neighbouring node failures
lose quorum but for vnodes, any other node failure loses quorum) by a
factor of (n-1)/4. The increase in speed more than offsets this so in fact
vnodes with removenode give theoretically 4x higher availability than no
vnodes.

If anyone is interested in using vnodes in large clusters I'd strongly
suggest testing this out to see if the concerns in section 4.3.3 are valid.

Richard.

On 17 April 2018 at 08:29, Jeff Jirsa  wrote:

> There are two huge advantages
>
> 1) during expansion / replacement / decom, you stream from far more
> ranges. Since streaming is single threaded per stream, this enables you to
> max out machines during streaming where single token doesn’t
>
> 2) when adjusting the size of a cluster, you can often grow incrementally
> without rebalancing
>
> Streaming entire wholly covered/contained/owned sstables during range
> movements is probably a huge benefit in many use cases that may make the
> single threaded streaming implementation less of a concern, and likely
> works reasonably well without major changes to LCS in particular  - I’m
> fairly confident there’s a JIRA for this, if not it’s been discussed in
> person among various operators for years as an obvious future improvement.
>
> --
> Jeff Jirsa
>
>
> > On Apr 17, 2018, at 8:17 AM, Carl Mueller 
> wrote:
> >
> > Do Vnodes address anything besides alleviating cluster planners from
> doing
> > token range management on nodes manually? Do we have a centralized list
> of
> > advantages they provide beyond that?
> >
> > There seem to be lots of downsides. 2i index performance, the above
> > availability, etc.
> >
> > I also wonder if in vnodes (and manually managed tokens... I'll return to
> > this) the node recovery scenarios are being hampered by sstables having
> the
> > hash ranges of the vnodes intermingled in the same set of sstables. I
> > wondered in another thread in vnodes why sstables are separated into sets
> > by the vnode ranges they represent. For a manually managed contiguous
> token
> > range, you could separate the sstables into a fixed number of sets, kind
> of
> > vnode-light.
> >
> > So if there was rebalancing or reconstruction, you could sneakernet or
> > reliably send entire sstable sets that would belong in a range.
> >
> > I also thing this would improve compactions and repairs too. Compactions
> > would be naturally parallelizable in all compaction schemes, and repairs
> > would have natural subsets to do merkle tree calculations.
> >
> > Granted sending sstables might result in "overstreaming" due to data
> > replication across the sstables, but you wouldn't have CPU and random I/O
> > to look up the data. Just sequential transfers.
> >
> > For manually managed tokens with subdivided sstables, if there was
> > rebalancing, you would have the "fringe" edges of the hash range
> subdivided
> > already, and you would only need to deal with the data in the border
> areas
> > of the token range, and again could sneakernet / dumb transfer the tables
> > and then let the new node remove the unneeded in future repairs.
> > (Compaction does not remove data that is not longer managed by a node,
> only
> > repair does? Or does only nodetool clean do that?)
> >
> > Pre-subdivided sstables for manually maanged tokens would REALLY pay big
> > dividends in large-scale cluster expansion. Say you wanted to double or
> > triple the cluster. Since the sstables are already split by some numeric
> > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> bulk
> > copy the already-subdivided sstables for the new nodes' hash ranges and
> > you'd basically be done. In AWS EBS volumes, that could just be a drive
> > detach / drive attach.
> >
> >
> >
> >
> >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves 
> wrote:
> >>
> >> Great write up. Glad someone finally did the math for us. I don't think
> >> this will come as a surprise for many of the developers. Availability is
> >> only one issue raised by vnodes. Load distribution and performance are
> also
> >> pretty big concerns.
> >>
> >> I'm always a proponent for fixing vnodes, and removing them as a default
> >> until we do. Happy to help on this and we have ideas in mind that at
> some
> >> point I'll create tickets for...
> >>
> >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, 
> wrote:
> >>>
> >>> If the blob link on github doesn't work for the pdf (looks like mobile
> >>> might not like it), try:
> >>>
> >>>
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> >> availability-virtual.pdf
> >>>
> >>> -Joey
> >>> <
> >>> htt

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-11 Thread Richard Low
+1 Although lots of people are still using thrift, it's not a good use of
time to maintain two interfaces when one is clearly better. But, yes,
retaining thrift for some time is important.


On 11 March 2014 17:27, sankalp kohli  wrote:

> RIP Thrift :)
> +1 with "We will retain it for backwards compatibility". Hopefully most
> people will move out of thrift by 2.1
>
>
> On Tue, Mar 11, 2014 at 10:18 AM, Brandon Williams 
> wrote:
>
> > As someone who has written a thrift wrapper, +1
> >
> >
> > On Tue, Mar 11, 2014 at 12:00 PM, Jonathan Ellis 
> > wrote:
> >
> > > CQL3 is almost two years old now and has proved to be the better API
> > > that Cassandra needed.  CQL drivers have caught up with and passed the
> > > Thrift ones in terms of features, performance, and usability.  CQL is
> > > easier to learn and more productive than Thrift.
> > >
> > > With static columns and LWT batch support [1] landing in 2.0.6, and
> > > UDT in 2.1 [2], I don't know of any use cases for Thrift that can't be
> > > done in CQL.  Contrawise, CQL makes many things easy that are
> > > difficult to impossible in Thrift.  New development is overwhelmingly
> > > done using CQL.
> > >
> > > To date we have had an unofficial and poorly defined policy of "add
> > > support for new features to Thrift when that is 'easy.'"  However,
> > > even relatively simple Thrift changes can create subtle complications
> > > for the rest of the server; for instance, allowing Thrift range
> > > tombtones would make filter conversion for CASSANDRA-6506 more
> > > difficult.
> > >
> > > Thus, I think it's time to officially close the book on Thrift.  We
> > > will retain it for backwards compatibility, but we will commit to
> > > adding no new features or changes to the Thrift API after 2.1.0.  This
> > > will help send an unambiguous message to users and eliminate any
> > > remaining confusion from supporting two APIs.  If any new use cases
> > > come to light that can be done with Thrift but not CQL, we will commit
> > > to supporting those in CQL.
> > >
> > > (To a large degree, this merely formalizes what is already de facto
> > > reality.  Most thrift clients have not even added support for
> > > atomic_batch_mutate and cas from 2.0, and popular clients like
> > > Astyanax are migrating to the native protocol.)
> > >
> > > Reasonable?
> > >
> > > [1] https://issues.apache.org/jira/browse/CASSANDRA-6561
> > > [2] https://issues.apache.org/jira/browse/CASSANDRA-5590
> > >
> > > --
> > > Jonathan Ellis
> > > Project Chair, Apache Cassandra
> > > co-founder, http://www.datastax.com
> > > @spyced
> > >
> >
>


Re: [VOTE] Release Apache Cassandra 2.0.7

2014-04-16 Thread Richard Low
+1 (non-binding) on getting 2.0.7 out soon


On 16 April 2014 07:44, Sylvain Lebresne  wrote:

> On Mon, Apr 14, 2014 at 8:32 PM, Pavel Yaskevich 
> wrote:
>
> > Can I push new release of the thrift-server before we roll 2.0.7?
> >
>
> When the vote email goes, the artifacts are already rolled per-se, so
> pushing anything
> means re-rolling the artifacts and vote. So I guess the question is, does
> this fix some
> regression compared to 2.0.6? If it doesn't, then I'd say that 2.0.7 has
> been long enough
> in the coming as it is that I'd rather get it out, it has enough important
> fixes over 2.0.6.
>
> --
> Sylvain
>
>
> >
> >
> > On Mon, Apr 14, 2014 at 10:57 AM, Jonathan Ellis 
> > wrote:
> >
> > > +1
> > > On Apr 14, 2014 10:39 AM, "Sylvain Lebresne" 
> > wrote:
> > >
> > > > sha1: 7dbbe9233ce83c2a473ba2510c827a661de99400
> > > > Git:
> > > >
> > > >
> > >
> >
> http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/2.0.7-tentative
> > > > Artifacts:
> > > >
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachecassandra-1009/org/apache/cassandra/apache-cassandra/2.0.7/
> > > > Staging repository:
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachecassandra-1009/
> > > >
> > > > The artifacts as well as the debian package are also available here:
> > > > http://people.apache.org/~slebresne/
> > > >
> > > > The vote will be open for 72 hours (longer if needed).
> > > >
> > > > [1]: http://goo.gl/6yg6Xh (CHANGES.txt)
> > > > [2]: http://goo.gl/GxmBC9 (NEWS.txt)
> > > >
> > >
> >
>


Re: [VOTE] Release Apache Cassandra 2.0.8

2014-05-06 Thread Richard Low
There's a small mistake in CHANGES.txt - these changes from 1.2 branch were
already in 2.0.7:

 * Continue assassinating even if the endpoint vanishes (CASSANDRA-6787)
 * Schedule schema pulls on change (CASSANDRA-6971)
 * Non-droppable verbs shouldn't be dropped from OTC (CASSANDRA-6980)
 * Shutdown batchlog executor in SS#drain() (CASSANDRA-7025)

Richard.

On 6 May 2014 02:12, Sylvain Lebresne  wrote:

> Since a fair amount of bug fixes have been committed since 2.0.7 I propose
> the
> following artifacts for release as 2.0.8.
>
> sha1: 7dbbe9233ce83c2a473ba2510c827a661de99400
> Git:
>
> http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/2.0.8-tentative
> Artifacts:
>
> https://repository.apache.org/content/repositories/orgapachecassandra-1011/org/apache/cassandra/apache-cassandra/2.0.8/
> Staging repository:
> https://repository.apache.org/content/repositories/orgapachecassandra-1011/
>
> The artifacts as well as the debian package are also available here:
> http://people.apache.org/~slebresne/
>
> The vote will be open for 72 hours (longer if needed).
>
> [1]: http://goo.gl/G3O7pF (CHANGES.txt)
> [2]: http://goo.gl/xBvQJU (NEWS.txt)
>


Announcing Acunu

2011-01-31 Thread Richard Low
Hello,

Just thought I'd drop everyone a quick line to let you know that Acunu
are looking for some talented devs to work on Cassandra.

Acunu are working on a storage platform for Big Data, including a
modified version of Cassandra on top of a native in-kernel key-value
store, with a bunch of deployment, management and monitoring tools.
In the coming months we're looking to open source our core storage
engine and submit patches back to the project.

Acunu's not ready for production use yet, but we're expanding our beta
right now, and are looking for people to put it through its paces. You
can read more at http://www.acunu.com/

Thanks

Richard

--
Richard Low
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Richard Low
On 20 March 2012 14:50, Rick Branson  wrote:

> To support a form of DF, I think some tweaking of the replica placement could 
> achieve this effect quite well. We could introduce a variable into replica 
> placement, which I'm going to incorrectly call DF for the purposes of 
> illustration. The key range for a node would be sub-divided by DF (1 by 
> default) and this would be used to further distribution replica selection 
> based on this "sub-partition".
>
> Currently, the offset formula works out to be something like this:
>
> offset = replica
>
> For RandomPartitioner, DF placement might look something like:
>
> offset = replica + (token % DF)
>
> Now, I realize replica selection is actually much more complicated than this, 
> but these formulas are for illustration purposes.
>
> Modifying replica placement & the partitioners to support this seems 
> straightforward, but I'm unsure of what's required to get it working for ring 
> management operations. On the surface, it does seem like this could be added 
> without any kind of difficult migration support.
>
> Thoughts?

This solution increases the DF, which has the advantage of providing
some balancing when a node is down temporarily.  The reads and writes
it would have served are now distributed around ~DF nodes.

However, it doesn't have any distributed rebuild.  In fact, any
distribution mechanism with one token per node cannot have distributed
rebuild.  Should a node fail, the next node in the ring has twice the
token range so must have twice the data.  This node will limit the
rebuild time - 'nodetool removetoken' will have to replicate the data
of the failed node onto this node.

Increasing the distribution factor without speeding up rebuild
increases the failure probability - both for data loss or being unable
to reach required consistency levels.  The failure probability is a
trade-off between rebuild time and distribution factor.  Lower rebuild
time helps, and lower distribution factor helps.

Cassandra as it is now has the longest rebuild time and lowest
possible distribution factor.  The original vnodes scheme is the other
extreme - shortest rebuild time and largest possible distribution
factor.
 It turns out that the rebuild time is more important, so this
decreases failure probability (with some assumptions you can show it
decreases by a factor RF! - I'll spare you the math but can send it if
you're interested).

This scheme has the longest rebuild time and a (tuneable) distribution
factor, but larger than the lowest.  That necessarily increases the
failure probability over both Cassandra now and vnode schemes, so I'd
be very careful about choosing it.

Richard.


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Richard Low
On 20 March 2012 14:55, Jonathan Ellis  wrote:
> Here's how I see Sam's list:
>
> * Even load balancing when growing and shrinking the cluster
>
> Nice to have, but post-bootstrap load balancing works well in practice
> (and is improved by TRP).

Post-bootstrap load balancing without vnodes necessarily streams more
data than is necessary.  Vnodes streams the minimal amount.

In fact, post-bootstrap load balancing currently streams a constant
fraction of your data - the network traffic involved in a rebalance
increases linearly with the size of your cluster.  With vnodes it
decreases linearly.

Including removing the ops overhead of running the load balance and
calculating new tokens, this makes removing post-bootstrap load
balancing a pretty big deal.

> * Greater failure tolerance in streaming
>
> Directly addressed by TRP.

Agreed.

> * Evenly distributed impact of streaming operations
>
> Not a problem in practice with stream throttling.

Throttling slows them down, increasing rebuild times so increasing downtime.

> * Possibility for active load balancing
>
> Not really a feature of vnodes per se, but as with the other load
> balancing point, this is also improved by TRP.

Again with the caveat that more data is streamed with TRP.  Vnodes
removes the need for any load balancing with RP.

> * Distributed rebuild
>
> This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
> live without it?  I have so far.  Is this alone worth the complexity
> of vnodes?  No, it is not.  Especially since there are probably other
> approaches that we can take to mitigate this, one of which Rick has
> suggested in a separate sub-thread.

Distributed rebuild means you can store more data per node with the
same failure probabilities.  This is frequently a limiting factor on
how much data you can store per node, increasing cluster sizes
unnecessarily.  I'd argue that this alone is worth the complexity of
vnodes.

Richard.


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Richard Low
On 22 March 2012 05:48, Zhu Han  wrote:

> I second it.
>
> Is there some goals we missed which can not be achieved by assigning
> multiple tokens to a single node?

This is exactly the proposed solution.  The discussion is about how to
implement this, and the methods of choosing tokens and replication
strategy.

Richard.