Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Pavel Yaskevich
On Thu, Apr 11, 2019 at 10:15 PM Joshua McKenzie 
wrote:

> As one of the two people that re-wrote all our unit tests to try and help
> Sylvain get 8099 out the door, I think it's inaccurate to compare the scope
> and potential stability impact of this work to the truly sweeping work that
> went into 8099 (not to downplay the scope and extent of this work here).
>
> TBH, one of the big reasons we tend to drop such large PRs is the fact that
> > Cassandra's code is highly intertwined and it makes it hard to precisely
> > change things. We need to iterate towards interfaces that allow us to
> > iterate quickly and reduce the amount of highly intertwined code. It
> helps
> > with testing as well. I want us to have a meaningful discussion around it
> > before we drop a big PR.
>
> This has been a huge issue with our codebase since at least back when I
> first encountered it five years ago. To date, while we have made progress
> on this front, it's been nowhere near sufficient to mitigate the issues in
> the codebase and allow for large, meaningful changes in smaller incremental
> patches or commits. Having yet another discussion around this (there have
> been many, many of them over the years) as a blocker for significant work
> to go into the codebase is an unnecessary and dangerous blocker. Not to say
> we shouldn't formalize a path to try and make incremental progress to
> improve the situation, far from it, but blocking other progress on a
> decade's worth of accumulated hygiene problems isn't going to make the
> community focus on fixing those problems imo, it'll just turn away
> contributions.
>

> So let me second jd (and many others') opinion here: "it makes sense to get
> it right the first time, rather than applying bandaids to 4.0 and rewriting
> things for 4.next". And fwiw, asking people who have already done a huge
> body of work to reformat that work into a series of commits or to break up
> that work in a fashion that's more to the liking of people not involved in
> either the writing of the patch or reviewing of it doesn't make much sense
> to me. As I am neither an assignee nor reviewer on this contribution, I
> leave it up to the parties involved to do things professionally and with a
> high standard of quality. Admittedly, a large code change merging in like
> this has implications for rebasing on anyone else's work that's in flight,
> but be it one commit merged or 50, or be it one JIRA ticket or ten, the
> end-result is the same; any large contribution in any format will ripple
> outwards and require re-work from others in the community.
>
The one thing I *would* strongly argue for is performance benchmarking of
> the new messaging code on a representative sample of different
> general-purpose queries, LWT's, etc, preferably in a 3 node RF=3 cluster,
> plus a healthy suite of jmh micro-benches (assuming they're not already in
> the diff. If they are, disregard / sorry). From speaking with Aleksey
> offline about this work, my understanding is that that's something they
> plan on doing before putting a bow on things.
>
> In the balance between "fear change because it destabilizes" and "go forth
> blindly into that dark night, rewriting All The Things", I think the
> Cassandra project's willingness to jettison the old and introduce the new
> has served it well in keeping relevant as the years have gone by. I'd hate
> to see that culture of progress get mired in a dogmatic adherence to
> requirements on commit counts, lines of code allowed / expected on a given
> patch set, or any other metrics that might stymie the professional needs of
> some of the heaviest contributors to the project.
>

+1. Based on all of the discussion here and in JIRA it seems to me that
we'd be doing a big disservice to the users by outright rejecting the
changes just
based on +/- LoC or complexity of review. From the points raised it seems
like
enabling encryption by default (or even making it the only available
option?),
upstreaming Netty related changes, possible steps to improve codebase, as
well as how the changes should be formatted to aid the reviewers could all
be discussed separately.

I think at the end of the day it _might be_ reasonable for PMC have a final
say on the matter,
maybe even on point-by-point basis.

>

>
> On Wed, Apr 10, 2019 at 5:03 PM Oleksandr Petrov <
> oleksandr.pet...@gmail.com>
> wrote:
>
> > Sorry to pick only a few points to address, but I think these are ones
> > where I can contribute productively to the discussion.
> >
> > > In principle, I agree with the technical improvements you
> > mention (backpressure / checksumming / etc). These things should be
> there.
> > Are they a hard requirement for 4.0?
> >
> > One thing that comes to mind is protocol versioning and consistency. If
> > changes adding checksumming and handshake do not make it to 4.0, we grow
> > the upgrade matrix and have to put changes to the separate protocol
> > version. I'm not sure how many other internode pr

Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Benedict Elliott Smith
I don’t have a lot to add to Josh’s contribution, except that I’d like to 
really hammer home that many people were a party to 8099, and as a project we 
learned a great deal from the experience.  It’s a very complex topic, that does 
not lend itself to simple comparisons, but I think anyone who participated in 
that work would find it strange to see these two pieces of work compared.  I 
think it would also be helpful if we stopped using it as some kind of bogeyman. 
 It seems too easy to forget how much positive change came out of 8099, and how 
many bugs we have since avoided because of it.  A lot of people put a herculean 
effort into making it happen: by my recollection Sylvain alone spent perhaps a 
year of his time on the initial and follow-up work.  Most of the active 
contributors participated in some way, for months in many cases.  Every time we 
talk about it in this way it denigrates a lot of good work.  Using it as a 
rhetorical device without seemingly appreciating what was involved, or where it 
went wrong, is even less helpful.

On a personal note, I found a couple of the responses to this thread 
disappointing.  As far as I can tell, neither engaged with my email, in which I 
justify our approach on most of their areas of concern.  Nor accepted the 
third-party reviewer’s comments that the patch is manageable to review and of 
acceptable scope.  Nor seemingly read the patch with care to reach their own 
conclusion, with the one concrete factual assertion about the code being false.

We’re trying to build a more positive and constructive community here than 
there has been in the past.  I want to encourage and welcome critical feedback, 
but I think it is incumbent on critics to do some basic research and to engage 
with the target of their criticism - lest they appear to have a goal of 
frustrating a body of work rather than improving it.  Please take a moment to 
read my email, take a closer look at the patch itself, and then engage with us 
on Jira with specific constructive feedback, and concrete positive suggestions.

I'd like to thank everyone else for taking the time to provide their thoughts, 
and we hope to address any lingering concerns.  I would love to hear your 
feedback on our testing and documentation plan [1] that we have put together 
and are executing on.

[1] 
https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Internode+Messaging+Test+Plan
 



> On 12 Apr 2019, at 08:56, Pavel Yaskevich  wrote:
> 
> On Thu, Apr 11, 2019 at 10:15 PM Joshua McKenzie 
> wrote:
> 
>> As one of the two people that re-wrote all our unit tests to try and help
>> Sylvain get 8099 out the door, I think it's inaccurate to compare the scope
>> and potential stability impact of this work to the truly sweeping work that
>> went into 8099 (not to downplay the scope and extent of this work here).
>> 
>> TBH, one of the big reasons we tend to drop such large PRs is the fact that
>>> Cassandra's code is highly intertwined and it makes it hard to precisely
>>> change things. We need to iterate towards interfaces that allow us to
>>> iterate quickly and reduce the amount of highly intertwined code. It
>> helps
>>> with testing as well. I want us to have a meaningful discussion around it
>>> before we drop a big PR.
>> 
>> This has been a huge issue with our codebase since at least back when I
>> first encountered it five years ago. To date, while we have made progress
>> on this front, it's been nowhere near sufficient to mitigate the issues in
>> the codebase and allow for large, meaningful changes in smaller incremental
>> patches or commits. Having yet another discussion around this (there have
>> been many, many of them over the years) as a blocker for significant work
>> to go into the codebase is an unnecessary and dangerous blocker. Not to say
>> we shouldn't formalize a path to try and make incremental progress to
>> improve the situation, far from it, but blocking other progress on a
>> decade's worth of accumulated hygiene problems isn't going to make the
>> community focus on fixing those problems imo, it'll just turn away
>> contributions.
>> 
> 
>> So let me second jd (and many others') opinion here: "it makes sense to get
>> it right the first time, rather than applying bandaids to 4.0 and rewriting
>> things for 4.next". And fwiw, asking people who have already done a huge
>> body of work to reformat that work into a series of commits or to break up
>> that work in a fashion that's more to the liking of people not involved in
>> either the writing of the patch or reviewing of it doesn't make much sense
>> to me. As I am neither an assignee nor reviewer on this contribution, I
>> leave it up to the parties involved to do things professionally and with a
>> high standard of quality. Admittedly, a large code change merging in like
>> this has implications for rebasing on anyone else's work that's in 

Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Blake Eggleston
Well said Josh. You’ve pretty much summarized my thoughts on this as well.

+1 to moving forward with this

> On Apr 11, 2019, at 10:15 PM, Joshua McKenzie  wrote:
> 
> As one of the two people that re-wrote all our unit tests to try and help
> Sylvain get 8099 out the door, I think it's inaccurate to compare the scope
> and potential stability impact of this work to the truly sweeping work that
> went into 8099 (not to downplay the scope and extent of this work here).
> 
> TBH, one of the big reasons we tend to drop such large PRs is the fact that
>> Cassandra's code is highly intertwined and it makes it hard to precisely
>> change things. We need to iterate towards interfaces that allow us to
>> iterate quickly and reduce the amount of highly intertwined code. It helps
>> with testing as well. I want us to have a meaningful discussion around it
>> before we drop a big PR.
> 
> This has been a huge issue with our codebase since at least back when I
> first encountered it five years ago. To date, while we have made progress
> on this front, it's been nowhere near sufficient to mitigate the issues in
> the codebase and allow for large, meaningful changes in smaller incremental
> patches or commits. Having yet another discussion around this (there have
> been many, many of them over the years) as a blocker for significant work
> to go into the codebase is an unnecessary and dangerous blocker. Not to say
> we shouldn't formalize a path to try and make incremental progress to
> improve the situation, far from it, but blocking other progress on a
> decade's worth of accumulated hygiene problems isn't going to make the
> community focus on fixing those problems imo, it'll just turn away
> contributions.
> 
> So let me second jd (and many others') opinion here: "it makes sense to get
> it right the first time, rather than applying bandaids to 4.0 and rewriting
> things for 4.next". And fwiw, asking people who have already done a huge
> body of work to reformat that work into a series of commits or to break up
> that work in a fashion that's more to the liking of people not involved in
> either the writing of the patch or reviewing of it doesn't make much sense
> to me. As I am neither an assignee nor reviewer on this contribution, I
> leave it up to the parties involved to do things professionally and with a
> high standard of quality. Admittedly, a large code change merging in like
> this has implications for rebasing on anyone else's work that's in flight,
> but be it one commit merged or 50, or be it one JIRA ticket or ten, the
> end-result is the same; any large contribution in any format will ripple
> outwards and require re-work from others in the community.
> 
> The one thing I *would* strongly argue for is performance benchmarking of
> the new messaging code on a representative sample of different
> general-purpose queries, LWT's, etc, preferably in a 3 node RF=3 cluster,
> plus a healthy suite of jmh micro-benches (assuming they're not already in
> the diff. If they are, disregard / sorry). From speaking with Aleksey
> offline about this work, my understanding is that that's something they
> plan on doing before putting a bow on things.
> 
> In the balance between "fear change because it destabilizes" and "go forth
> blindly into that dark night, rewriting All The Things", I think the
> Cassandra project's willingness to jettison the old and introduce the new
> has served it well in keeping relevant as the years have gone by. I'd hate
> to see that culture of progress get mired in a dogmatic adherence to
> requirements on commit counts, lines of code allowed / expected on a given
> patch set, or any other metrics that might stymie the professional needs of
> some of the heaviest contributors to the project.
> 
> On Wed, Apr 10, 2019 at 5:03 PM Oleksandr Petrov 
> wrote:
> 
>> Sorry to pick only a few points to address, but I think these are ones
>> where I can contribute productively to the discussion.
>> 
>>> In principle, I agree with the technical improvements you
>> mention (backpressure / checksumming / etc). These things should be there.
>> Are they a hard requirement for 4.0?
>> 
>> One thing that comes to mind is protocol versioning and consistency. If
>> changes adding checksumming and handshake do not make it to 4.0, we grow
>> the upgrade matrix and have to put changes to the separate protocol
>> version. I'm not sure how many other internode protocol changes we have
>> planned for 4.next, but this is definitely something we should keep in
>> mind.
>> 
>>> 2. We should not be measuring complexity in LoC with the exception that
>> all 20k lines *do need to be review* (not just the important parts and
>> because code refactoring tools have bugs too) and more lines take more
>> time.
>> 
>> Everything should definitely be reviewed. But with different rigour: one
>> thing is to review byte arithmetic and protocol formats and completely
>> different thing is to verify that Verb moved from one place

TLP tools for stress testing and building test clusters in AWS

2019-04-12 Thread Jon Haddad
I don't want to derail the discussion about Stabilizing Internode
Messaging, so I'm starting this as a separate thread.  There was a
comment that Josh made [1] about doing performance testing with real
clusters as well as a lot of microbenchmarks, and I'm 100% in support
of this.  We've been working on some tooling at TLP for the last
several months to make this a lot easier.  One of the goals has been
to help improve the 4.0 testing process.

The first tool we have is tlp-stress [2].  It's designed with a "get
started in 5 minutes" mindset.  My goal was to ship a stress tool that
ships with real workloads out of the box that can be easily tweaked,
similar to how fio allows you to design a disk workload and tweak it
with paramaters.  Included are stress workloads that stress LWTs (two
different types), materialized views, counters, time series, and
key-value workloads.  Each workload can be modified easily to change
compaction strategies, concurrent operations, number of partitions.
We can run workloads for a set number of iterations or a custom
duration.  We've used this *extensively* at TLP to help our customers
and most of our blog posts that discuss performance use it as well.
It exports data to both a CSV format and auto sets up prometheus for
metrics collection / aggregation.  As an example, we were able to
determine that the compression length set on the paxos tables imposes
a significant overhead when using the Locking LWT workload, which
simulates locking and unlocking of rows.  See CASSANDRA-15080 for
details.

We have documentation [3] on the TLP website.

The second tool we've been working on is tlp-cluster [4].  This tool
is designed to help provision AWS instances for the purposes of
testing.  To be clear, I don't expect, or want, this tool to be used
for production environments.  It's designed to assist with the
Cassandra build process by generating deb packages or re-using the
ones that have already been uploaded.  Here's a short list of the
things you'll care about:

1. Create instances in AWS for Cassandra using any instance size and
number of nodes.  Also create tlp-stress instances and a box for
monitoring
2. Use any available build of Cassandra, with a quick option to change
YAML config.  For example: tlp-stress use 3.11.4 -c
concurrent_writes:256
3. Do custom builds just by pointing to a local Cassandra git repo.
They can be used the same way as #2.
4. tlp-stress is automatically installed on the stress box.
5. Everything's installed with pure bash.  I considered something more
complex, but since this is for development only, it turns out the
simplest tool possible works well and it means it's easily
configurable.  Just drop in your own bash script starting with a
number in a XX_script_name.sh format and it gets run.
6. The monitoring box is running Prometheus.  It auto scrapes
Cassandra using the Instaclustr metrics library.
7. Grafana is also installed automatically.  There's a couple sample
graphs there now.  We plan on having better default graphs soon.

For the moment it installs java 8 only but that should be easily
fixable to use java 11 to test ZGC (it's on my radar).

Documentation for tlp-cluster is here [5].

There's still some things to work out in the tool, and we've been
working hard to smooth out the rough edges.  I still haven't announced
anything WRT tlp-cluster on the TLP blog, because I don't think it's
quite ready for public consumption, but I think the folks on this list
are smart enough to see the value in it even if it has a few warts
still.

I don't consider myself familiar enough with the networking patch to
give it a full review, but I am qualified to build tools to help test
it and go through the testing process myself.  From what I can tell
the patch is moving the codebase in a positive direction and I'd like
to help build confidence in it so we can get it merged in.

We'll continue to build out and improve the tooling with the goal of
making it easier for people to jump into the QA side of things.

Jon

[1] 
https://lists.apache.org/thread.html/742009c8a77999f4b62062509f087b670275f827d0c1895bf839eece@%3Cdev.cassandra.apache.org%3E
[2] https://github.com/thelastpickle/tlp-stress
[3] http://thelastpickle.com/tlp-stress/
[4] https://github.com/thelastpickle/tlp-cluster
[5] http://thelastpickle.com/tlp-cluster/

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: TLP tools for stress testing and building test clusters in AWS

2019-04-12 Thread Aleksey Yeshchenko
Hey Jon,

This sounds exciting and pretty useful, thanks.

Looking forward to using tlp-stress for validating 15066 performance.

We should touch base some time next week to pick a comprehensive set of 
workloads and versions, perhaps?


> On 12 Apr 2019, at 16:34, Jon Haddad  wrote:
> 
> I don't want to derail the discussion about Stabilizing Internode
> Messaging, so I'm starting this as a separate thread.  There was a
> comment that Josh made [1] about doing performance testing with real
> clusters as well as a lot of microbenchmarks, and I'm 100% in support
> of this.  We've been working on some tooling at TLP for the last
> several months to make this a lot easier.  One of the goals has been
> to help improve the 4.0 testing process.
> 
> The first tool we have is tlp-stress [2].  It's designed with a "get
> started in 5 minutes" mindset.  My goal was to ship a stress tool that
> ships with real workloads out of the box that can be easily tweaked,
> similar to how fio allows you to design a disk workload and tweak it
> with paramaters.  Included are stress workloads that stress LWTs (two
> different types), materialized views, counters, time series, and
> key-value workloads.  Each workload can be modified easily to change
> compaction strategies, concurrent operations, number of partitions.
> We can run workloads for a set number of iterations or a custom
> duration.  We've used this *extensively* at TLP to help our customers
> and most of our blog posts that discuss performance use it as well.
> It exports data to both a CSV format and auto sets up prometheus for
> metrics collection / aggregation.  As an example, we were able to
> determine that the compression length set on the paxos tables imposes
> a significant overhead when using the Locking LWT workload, which
> simulates locking and unlocking of rows.  See CASSANDRA-15080 for
> details.
> 
> We have documentation [3] on the TLP website.
> 
> The second tool we've been working on is tlp-cluster [4].  This tool
> is designed to help provision AWS instances for the purposes of
> testing.  To be clear, I don't expect, or want, this tool to be used
> for production environments.  It's designed to assist with the
> Cassandra build process by generating deb packages or re-using the
> ones that have already been uploaded.  Here's a short list of the
> things you'll care about:
> 
> 1. Create instances in AWS for Cassandra using any instance size and
> number of nodes.  Also create tlp-stress instances and a box for
> monitoring
> 2. Use any available build of Cassandra, with a quick option to change
> YAML config.  For example: tlp-stress use 3.11.4 -c
> concurrent_writes:256
> 3. Do custom builds just by pointing to a local Cassandra git repo.
> They can be used the same way as #2.
> 4. tlp-stress is automatically installed on the stress box.
> 5. Everything's installed with pure bash.  I considered something more
> complex, but since this is for development only, it turns out the
> simplest tool possible works well and it means it's easily
> configurable.  Just drop in your own bash script starting with a
> number in a XX_script_name.sh format and it gets run.
> 6. The monitoring box is running Prometheus.  It auto scrapes
> Cassandra using the Instaclustr metrics library.
> 7. Grafana is also installed automatically.  There's a couple sample
> graphs there now.  We plan on having better default graphs soon.
> 
> For the moment it installs java 8 only but that should be easily
> fixable to use java 11 to test ZGC (it's on my radar).
> 
> Documentation for tlp-cluster is here [5].
> 
> There's still some things to work out in the tool, and we've been
> working hard to smooth out the rough edges.  I still haven't announced
> anything WRT tlp-cluster on the TLP blog, because I don't think it's
> quite ready for public consumption, but I think the folks on this list
> are smart enough to see the value in it even if it has a few warts
> still.
> 
> I don't consider myself familiar enough with the networking patch to
> give it a full review, but I am qualified to build tools to help test
> it and go through the testing process myself.  From what I can tell
> the patch is moving the codebase in a positive direction and I'd like
> to help build confidence in it so we can get it merged in.
> 
> We'll continue to build out and improve the tooling with the goal of
> making it easier for people to jump into the QA side of things.
> 
> Jon
> 
> [1] 
> https://lists.apache.org/thread.html/742009c8a77999f4b62062509f087b670275f827d0c1895bf839eece@%3Cdev.cassandra.apache.org%3E
> [2] https://github.com/thelastpickle/tlp-stress
> [3] http://thelastpickle.com/tlp-stress/
> [4] https://github.com/thelastpickle/tlp-cluster
> [5] http://thelastpickle.com/tlp-cluster/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Jordan West
I understand these non-technical discussions are not what everyone wants to
focus on but they are extremely pertinent to the stability of the project.
What I would like to see before merging this in is below. They are all
reasonable asks in my opinion that will still result in the patch being
merged — only w/ even more confidence in its quality. More details on my
thoughts behind it follow.

1. Additional third party/independent reviewers added and required before
merge.

2. The patch should at a minimum be broken up (commit-wise, not ticket
wise) so the newly added features (CRC/backpressure/virtual tables) can be
reviewed independent of the large patch set and can be easily included or
excluded based on community discussion or reviewer feedback — using the
exception process for new features we have used in the past during this
freeze (merging 13304 the day after freeze, the zstd changes, etc).

3. As Sankalp mentioned, and I believe is in progress, a test plan should
be published and executed (or at least part of it should be execute before
the merge, its possible some will happen post merge but this should be
minimal).

4. A (design) doc should be published to make it easier for reviewers to
approach the code.

With the above said, I apologize if the 8099 comment was a parallel that
was too close to home for some. I am sure they are not direct comparison
but the parallel I was trying to draw (and continue to) is that the
development process that led to a project like 8099 continues to be used
here. My focus for Cassandra is improving Quality and Stability of 4.0 (and
ideally the 3.x series along the way) — especially in light of the recent
status email I sent that included over 25 bugs found in the last 6 months.

There is no question that the majority of this patch should go in — the bug
fixes are necessary and we have no alternatives written. The question to
me, besides increasing confidence in the patch, is can the authors make
this better for reviewers by putting some more effort into the
non-technical aspects of the patch and should the new features be included
given the already risky changes proposed in this patch and the demand to
review them?  The goal of the freeze was to reduce the time spent on new
work so we can improve what existed. We do have a process for exceptions to
that and if the community feels strongly about these features then we
should follow that process — which would involve isolating the changes from
the larger patch and having them be considered separately.

Further, as someone who has reviewed 13304 and found a bug others didn’t, I
don’t think having the code authors dictate the complexity or timeframe of
the review makes sense. Thats not to say I didn’t read the email as
suggested.  I encourage you to consider that its possible my experience
informed my contrary point of view and how that sort of denigrating and
unneeded comment affects the community.

Anyways, part of those improvements have to come from how we design and
develop the database. Not just how we test and run it. Having worked on
several large projects on multiple databases (Cassandra’s SASI*, Riak’s
Cluster Metadata / Membership, Riak’s bucket types, and Riak’s ring
resizing feature, among others) and for large companies (those projects I
can’t talk about), I am sure it is possible to design and develop features
with a better process than the one used here. It is certainly possible, and
hugely beneficial, to break up code into smaller commits (Google also feels
this way: https://arxiv.org/pdf/1702.01715.pdf) and its not unreasonable to
ask by any means. It should be a pre-requisite. A patch like this requires
more from the authors than simply writing the code. Holding ourselves to a
higher standard will increase the quality of the database dramatically. The
same way committing to real testing has done (I again refer to all of the
bugs found during the freeze that were not found previously).

Hopefully its clear from the above that I am very supportive of getting the
majority of these changes in. I think it would benefit the future of the
project if we did that in a more deliberate way than how risky changes like
this, late in the cycle, were handled in the past. We have an opportunity
to change that here and it would benefit the project significantly.
Cassandra’s willingness to jettison code has kept it relevant. Its process
for doing so however has had negative effects on the databases brand — the
deployment of the 3.x series was directly affected by presumptions (real or
otherwise) of quality. We could take this as an opportunity to fix that and
keep the nimble aspects of the database alive at the same time.

Jordan

* Unfortunately w/ SASI we did contribute one big commit publicly but there
was a better commit history during development that could have been shared
and I would have liked to see us make it more digestible (I think we would
have found more bugs before merge).

On Fri, Apr 12, 2019 at 8:21 AM 

Re: TLP tools for stress testing and building test clusters in AWS

2019-04-12 Thread Benedict Elliott Smith
+1

I’m also just as excited to see some standardised workloads and test bed.  At 
the moment we’re benefiting from some large contributors doing their own 
proprietary performance testing, which is super valuable and something we’ve 
lacked before.  But I’m also keen to see some more representative workloads 
that are reproducible by anybody in the community take shape.


> On 12 Apr 2019, at 18:09, Aleksey Yeshchenko  
> wrote:
> 
> Hey Jon,
> 
> This sounds exciting and pretty useful, thanks.
> 
> Looking forward to using tlp-stress for validating 15066 performance.
> 
> We should touch base some time next week to pick a comprehensive set of 
> workloads and versions, perhaps?
> 
> 
>> On 12 Apr 2019, at 16:34, Jon Haddad  wrote:
>> 
>> I don't want to derail the discussion about Stabilizing Internode
>> Messaging, so I'm starting this as a separate thread.  There was a
>> comment that Josh made [1] about doing performance testing with real
>> clusters as well as a lot of microbenchmarks, and I'm 100% in support
>> of this.  We've been working on some tooling at TLP for the last
>> several months to make this a lot easier.  One of the goals has been
>> to help improve the 4.0 testing process.
>> 
>> The first tool we have is tlp-stress [2].  It's designed with a "get
>> started in 5 minutes" mindset.  My goal was to ship a stress tool that
>> ships with real workloads out of the box that can be easily tweaked,
>> similar to how fio allows you to design a disk workload and tweak it
>> with paramaters.  Included are stress workloads that stress LWTs (two
>> different types), materialized views, counters, time series, and
>> key-value workloads.  Each workload can be modified easily to change
>> compaction strategies, concurrent operations, number of partitions.
>> We can run workloads for a set number of iterations or a custom
>> duration.  We've used this *extensively* at TLP to help our customers
>> and most of our blog posts that discuss performance use it as well.
>> It exports data to both a CSV format and auto sets up prometheus for
>> metrics collection / aggregation.  As an example, we were able to
>> determine that the compression length set on the paxos tables imposes
>> a significant overhead when using the Locking LWT workload, which
>> simulates locking and unlocking of rows.  See CASSANDRA-15080 for
>> details.
>> 
>> We have documentation [3] on the TLP website.
>> 
>> The second tool we've been working on is tlp-cluster [4].  This tool
>> is designed to help provision AWS instances for the purposes of
>> testing.  To be clear, I don't expect, or want, this tool to be used
>> for production environments.  It's designed to assist with the
>> Cassandra build process by generating deb packages or re-using the
>> ones that have already been uploaded.  Here's a short list of the
>> things you'll care about:
>> 
>> 1. Create instances in AWS for Cassandra using any instance size and
>> number of nodes.  Also create tlp-stress instances and a box for
>> monitoring
>> 2. Use any available build of Cassandra, with a quick option to change
>> YAML config.  For example: tlp-stress use 3.11.4 -c
>> concurrent_writes:256
>> 3. Do custom builds just by pointing to a local Cassandra git repo.
>> They can be used the same way as #2.
>> 4. tlp-stress is automatically installed on the stress box.
>> 5. Everything's installed with pure bash.  I considered something more
>> complex, but since this is for development only, it turns out the
>> simplest tool possible works well and it means it's easily
>> configurable.  Just drop in your own bash script starting with a
>> number in a XX_script_name.sh format and it gets run.
>> 6. The monitoring box is running Prometheus.  It auto scrapes
>> Cassandra using the Instaclustr metrics library.
>> 7. Grafana is also installed automatically.  There's a couple sample
>> graphs there now.  We plan on having better default graphs soon.
>> 
>> For the moment it installs java 8 only but that should be easily
>> fixable to use java 11 to test ZGC (it's on my radar).
>> 
>> Documentation for tlp-cluster is here [5].
>> 
>> There's still some things to work out in the tool, and we've been
>> working hard to smooth out the rough edges.  I still haven't announced
>> anything WRT tlp-cluster on the TLP blog, because I don't think it's
>> quite ready for public consumption, but I think the folks on this list
>> are smart enough to see the value in it even if it has a few warts
>> still.
>> 
>> I don't consider myself familiar enough with the networking patch to
>> give it a full review, but I am qualified to build tools to help test
>> it and go through the testing process myself.  From what I can tell
>> the patch is moving the codebase in a positive direction and I'd like
>> to help build confidence in it so we can get it merged in.
>> 
>> We'll continue to build out and improve the tooling with the goal of
>> making it easier for people to jump into the QA side of things.
>> 
>

Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Sam Tunnicliffe
+1 Thanks for articulating that so well Josh.

Sam

> On 12 Apr 2019, at 16:19, Blake Eggleston  
> wrote:
> 
> Well said Josh. You’ve pretty much summarized my thoughts on this as well.
> 
> +1 to moving forward with this
> 
>> On Apr 11, 2019, at 10:15 PM, Joshua McKenzie  wrote:
>> 
>> As one of the two people that re-wrote all our unit tests to try and help
>> Sylvain get 8099 out the door, I think it's inaccurate to compare the scope
>> and potential stability impact of this work to the truly sweeping work that
>> went into 8099 (not to downplay the scope and extent of this work here).
>> 
>> TBH, one of the big reasons we tend to drop such large PRs is the fact that
>>> Cassandra's code is highly intertwined and it makes it hard to precisely
>>> change things. We need to iterate towards interfaces that allow us to
>>> iterate quickly and reduce the amount of highly intertwined code. It helps
>>> with testing as well. I want us to have a meaningful discussion around it
>>> before we drop a big PR.
>> 
>> This has been a huge issue with our codebase since at least back when I
>> first encountered it five years ago. To date, while we have made progress
>> on this front, it's been nowhere near sufficient to mitigate the issues in
>> the codebase and allow for large, meaningful changes in smaller incremental
>> patches or commits. Having yet another discussion around this (there have
>> been many, many of them over the years) as a blocker for significant work
>> to go into the codebase is an unnecessary and dangerous blocker. Not to say
>> we shouldn't formalize a path to try and make incremental progress to
>> improve the situation, far from it, but blocking other progress on a
>> decade's worth of accumulated hygiene problems isn't going to make the
>> community focus on fixing those problems imo, it'll just turn away
>> contributions.
>> 
>> So let me second jd (and many others') opinion here: "it makes sense to get
>> it right the first time, rather than applying bandaids to 4.0 and rewriting
>> things for 4.next". And fwiw, asking people who have already done a huge
>> body of work to reformat that work into a series of commits or to break up
>> that work in a fashion that's more to the liking of people not involved in
>> either the writing of the patch or reviewing of it doesn't make much sense
>> to me. As I am neither an assignee nor reviewer on this contribution, I
>> leave it up to the parties involved to do things professionally and with a
>> high standard of quality. Admittedly, a large code change merging in like
>> this has implications for rebasing on anyone else's work that's in flight,
>> but be it one commit merged or 50, or be it one JIRA ticket or ten, the
>> end-result is the same; any large contribution in any format will ripple
>> outwards and require re-work from others in the community.
>> 
>> The one thing I *would* strongly argue for is performance benchmarking of
>> the new messaging code on a representative sample of different
>> general-purpose queries, LWT's, etc, preferably in a 3 node RF=3 cluster,
>> plus a healthy suite of jmh micro-benches (assuming they're not already in
>> the diff. If they are, disregard / sorry). From speaking with Aleksey
>> offline about this work, my understanding is that that's something they
>> plan on doing before putting a bow on things.
>> 
>> In the balance between "fear change because it destabilizes" and "go forth
>> blindly into that dark night, rewriting All The Things", I think the
>> Cassandra project's willingness to jettison the old and introduce the new
>> has served it well in keeping relevant as the years have gone by. I'd hate
>> to see that culture of progress get mired in a dogmatic adherence to
>> requirements on commit counts, lines of code allowed / expected on a given
>> patch set, or any other metrics that might stymie the professional needs of
>> some of the heaviest contributors to the project.
>> 
>> On Wed, Apr 10, 2019 at 5:03 PM Oleksandr Petrov 
>> wrote:
>> 
>>> Sorry to pick only a few points to address, but I think these are ones
>>> where I can contribute productively to the discussion.
>>> 
 In principle, I agree with the technical improvements you
>>> mention (backpressure / checksumming / etc). These things should be there.
>>> Are they a hard requirement for 4.0?
>>> 
>>> One thing that comes to mind is protocol versioning and consistency. If
>>> changes adding checksumming and handshake do not make it to 4.0, we grow
>>> the upgrade matrix and have to put changes to the separate protocol
>>> version. I'm not sure how many other internode protocol changes we have
>>> planned for 4.next, but this is definitely something we should keep in
>>> mind.
>>> 
 2. We should not be measuring complexity in LoC with the exception that
>>> all 20k lines *do need to be review* (not just the important parts and
>>> because code refactoring tools have bugs too) and more lines take more
>>> time.
>>> 
>>> Everything

Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Pavel Yaskevich
On Fri, Apr 12, 2019 at 10:15 AM Jordan West  wrote:

> I understand these non-technical discussions are not what everyone wants to
> focus on but they are extremely pertinent to the stability of the project.
> What I would like to see before merging this in is below. They are all
> reasonable asks in my opinion that will still result in the patch being
> merged — only w/ even more confidence in its quality. More details on my
> thoughts behind it follow.
>
> 1. Additional third party/independent reviewers added and required before
> merge.
>
> 2. The patch should at a minimum be broken up (commit-wise, not ticket
> wise) so the newly added features (CRC/backpressure/virtual tables) can be
> reviewed independent of the large patch set and can be easily included or
> excluded based on community discussion or reviewer feedback — using the
> exception process for new features we have used in the past during this
> freeze (merging 13304 the day after freeze, the zstd changes, etc).
>
> 3. As Sankalp mentioned, and I believe is in progress, a test plan should
> be published and executed (or at least part of it should be execute before
> the merge, its possible some will happen post merge but this should be
> minimal).
>
> 4. A (design) doc should be published to make it easier for reviewers to
> approach the code.
>

I haven't actually looked at the code, but these seems like reasonable asks.

I'd expect mechanical changes split into separate commits, as well as (at
least)
categories (e.g. CRC and framing, backpreassure) that Benedict/Aleksey
outlined.
Design doc would be great at least for posterity sake.


> With the above said, I apologize if the 8099 comment was a parallel that
> was too close to home for some. I am sure they are not direct comparison
> but the parallel I was trying to draw (and continue to) is that the
> development process that led to a project like 8099 continues to be used
> here. My focus for Cassandra is improving Quality and Stability of 4.0 (and
> ideally the 3.x series along the way) — especially in light of the recent
> status email I sent that included over 25 bugs found in the last 6 months.
>
> There is no question that the majority of this patch should go in — the bug
> fixes are necessary and we have no alternatives written. The question to
> me, besides increasing confidence in the patch, is can the authors make
> this better for reviewers by putting some more effort into the
> non-technical aspects of the patch and should the new features be included
> given the already risky changes proposed in this patch and the demand to
> review them?  The goal of the freeze was to reduce the time spent on new
> work so we can improve what existed. We do have a process for exceptions to
> that and if the community feels strongly about these features then we
> should follow that process — which would involve isolating the changes from
> the larger patch and having them be considered separately.
>
> Further, as someone who has reviewed 13304 and found a bug others didn’t, I
> don’t think having the code authors dictate the complexity or timeframe of
> the review makes sense. Thats not to say I didn’t read the email as
> suggested.  I encourage you to consider that its possible my experience
> informed my contrary point of view and how that sort of denigrating and
> unneeded comment affects the community.
>
> Anyways, part of those improvements have to come from how we design and
> develop the database. Not just how we test and run it. Having worked on
> several large projects on multiple databases (Cassandra’s SASI*, Riak’s
> Cluster Metadata / Membership, Riak’s bucket types, and Riak’s ring
> resizing feature, among others) and for large companies (those projects I
> can’t talk about), I am sure it is possible to design and develop features
> with a better process than the one used here. It is certainly possible, and
> hugely beneficial, to break up code into smaller commits (Google also feels
> this way: https://arxiv.org/pdf/1702.01715.pdf) and its not unreasonable
> to
> ask by any means. It should be a pre-requisite. A patch like this requires
> more from the authors than simply writing the code. Holding ourselves to a
> higher standard will increase the quality of the database dramatically. The
> same way committing to real testing has done (I again refer to all of the
> bugs found during the freeze that were not found previously).
>
> Hopefully its clear from the above that I am very supportive of getting the
> majority of these changes in. I think it would benefit the future of the
> project if we did that in a more deliberate way than how risky changes like
> this, late in the cycle, were handled in the past. We have an opportunity
> to change that here and it would benefit the project significantly.
> Cassandra’s willingness to jettison code has kept it relevant. Its process
> for doing so however has had negative effects on the databases brand — the
> deployment of the 3.x ser

Re: Cassandra 4.0 Quality and Stability Update

2019-04-12 Thread Jordan West
Hi Dinesh,

Great question! Unfortunately I don’t have a great definition of what
“alpha” means in the Cassandra community so its hard for me to answer that
directly. However, I am of the opinion that we are not yet at the point of
being able to branch trunk — we are finding too many bugs at too quick a
pace still and have yet to make enough significant progress on the test
plan [1] previously linked. I do think it would be beneficial to cut an
official build (maybe after internode messaging settles down) as a preview
for the community and to make it easier for folks to run on dev/test
hardware. In the Riak community we call these “pre” builds (Riak 2.0.0preX)
and they were nothing more than a stable place on trunk released
periodically until we reached a point where we branched.

Regarding metrics, the first major step towards that was Benedict’s and
others work (thanks al!) to re-organize JIRA. We now have a better set of
inputs to automatically build reports around release quality metrics, etc.
We have yet to take this and turn it into JIRA reports but I am working
with Scott Andreas on it — I don’t have a timeframe just yet but I hope
soon. If you would like to help please let me know.

In the meantime, Scott and I have kept a list which is where the data I
used came from. We absolutely need to make this public and the efforts
mentioned above will accomplish that.

Jordan

[1]
https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Quality%3A+Components+and+Test+Plans

On Thu, Apr 11, 2019 at 4:21 PM Dinesh Joshi  wrote:

> Hey Jordan,
>
> Thanks for update! Do you have a sense of where we are in terms of
> stability and where do we need to be in order to cut an alpha? I also
> remember a discussion on measuring release quality[1]. Not sure where we
> landed on it. Any idea on how are we doing on that front?
>
> Thanks,
>
> Dinesh
>
> [1]
> https://lists.apache.org/thread.html/3a444be1a3097c0c55d15268ccb0fe7aab83ef276b87bf55bf4f3bc2@%3Cdev.cassandra.apache.org%3E
>
> > On Apr 10, 2019, at 8:25 AM, Jordan West  wrote:
> >
> > In September, the community chose to freeze trunk to begin working on
> > Quality and Stability with the goal of releasing the most stable
> Cassandra
> > major in the project’s history. While lots of work has been ongoing and
> > folks could follow along with progress on JIRA I thought it would be
> useful
> > to cover what has been accomplished so far since I’ve spent a good amount
> > of time working with others on various testing projects.
> >
> > During this time we have made significant progress on improving the
> Quality
> > and Stability of Cassandra — not only Cassandra 4.0 but also the
> Cassandra
> > 3.x series and future Cassandra releases. Additionally, testing has
> > provided the opportunity for new community members and committers to
> > contribute. While not comprehensive the community has found at least 25
> > bugs that can be classified as either Data Loss, Corruption, Incorrect
> > Response, Loss of Stability, Loss of Availability, Concurrency Issues,
> > Performance Issues, and Lack of Safety. These bugs have been found by a
> > variety of methodologies including commonly used ones like unit testing
> and
> > canary deployments. However, the majority of the bugs have been found or
> > confirmed using new methodologies like the ones described in a some
> recent
> > blog posts [1] [2].
> >
> > Additionally, the state of the test suites and test tooling have
> improved.
> > CASSANDRA-14806 [3] brought some much welcomed improvements to the
> circleci
> > workflow and made it easier for people to run (d)tests on supported
> > platforms (jdk8/11) and the work to get upgrade tests running found
> several
> > bugs including CASSADNRA-14958 [4].
> >
> > While we have made significant progress there is still more to do before
> we
> > can be truly confident in an Cassandra 4.0 release. Some ongoing and
> > outstanding work includes:
> >
> > * Improving the state of the cqlsh tests [5]
> > * There is ongoing discussion on the new MessagingService [6] which will
> > require significant review and testing
> > * Additional upgrade testing for Cassandra 4.0 including additional
> support
> > for upgrade testing using in-jvm dtests [7]
> > * Work to increase coverage of important areas and new features in
> > Cassandra 4.0 [8]
> >
> > While the list above may seem short, the last item contains a long list
> of
> > important areas the community has previously discussed adding coverage
> to.
> > If you are looking for areas to contribute this is a great starting
> point.
> > If there is a name down on an area you are interested in I would
> encourage
> > you to reach out to them to discuss how you can help further increase the
> > community’s confidence in the Quality and Stability of Cassandra.
> >
> > Below is an in-complete list of many of the severe bugs found during this
> > part of the release cycle. Thanks again to all of the community members
> who
> > contributed to 

Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Benedict Elliott Smith
I would once again exhort everyone making these kinds of comment to actually 
read the code, and to comment on Jira.  Preferably with a justification by 
reference to the code for how or why it would improve the patch.

As far as a design document is concerned, it’s very unclear what is being 
requested.  We already had plans, as Jordan knows, to produce a wiki page for 
posterity, and a blog post closer to release.  However, I have never heard of 
this as a requirement for review, or for commit.  We have so far taken two 
members of the community through the patch over video chat, and would be more 
than happy to do the same for others.  So far nobody has had any difficulty 
getting to grips with its structure.

If the project wants to modify its normal process for putting a patch together, 
this is a whole different can of worms, and I am strongly -1.  I’m not sure 
what precedent we’re trying to set imposing arbitrary constraints pre-commit 
for work that has already met the project’s inclusion criteria.


> On 12 Apr 2019, at 18:58, Pavel Yaskevich  wrote:
> 
> I haven't actually looked at the code





Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Jordan West
Since their seems to be an assumption that I haven’t read the code let me
clarify: I am working on making time to be a reviewer on this and I have
already spent a few hours with the patch before I sent any replies, likely
more than most who are replying here. Again, because I disagree on
non-technical matters does not mean I haven’t considered the technical. I
am sharing what I think is necessary for the authors
to make review higher quality. I will not compromise my review standards on
a patch like this as I have said already. Telling me to review it to talk
more about it directly ignores my feedback and requires me to acquiesce all
of my concerns, which as I said I won’t do as a reviewer.

And yes I am arguing for changing how the Cassandra community approaches
large patches. In the same way the freeze changed how we approached major
releases and the decision to do so has been a net benefit as measured by
quality and stability. Existing community members have already chimed in in
support of things like better commit hygiene.

The past approaches haven’t prioritized quality and stability and it really
shows. What I and others here are suggesting has worked all over our
industry and is adopted by companies big (like google as i linked
previously) and small (like many startups I and others have worked for).
Everything we want to do: better testing, better review, better code, is
made easier with better design review, better discussion, and more
digestible patches among many of the other things suggested in this thread.

Jordan

On Fri, Apr 12, 2019 at 12:01 PM Benedict Elliott Smith 
wrote:

> I would once again exhort everyone making these kinds of comment to
> actually read the code, and to comment on Jira.  Preferably with a
> justification by reference to the code for how or why it would improve the
> patch.
>
> As far as a design document is concerned, it’s very unclear what is being
> requested.  We already had plans, as Jordan knows, to produce a wiki page
> for posterity, and a blog post closer to release.  However, I have never
> heard of this as a requirement for review, or for commit.  We have so far
> taken two members of the community through the patch over video chat, and
> would be more than happy to do the same for others.  So far nobody has had
> any difficulty getting to grips with its structure.
>
> If the project wants to modify its normal process for putting a patch
> together, this is a whole different can of worms, and I am strongly -1.
> I’m not sure what precedent we’re trying to set imposing arbitrary
> constraints pre-commit for work that has already met the project’s
> inclusion criteria.
>
>
> > On 12 Apr 2019, at 18:58, Pavel Yaskevich  wrote:
> >
> > I haven't actually looked at the code
>
>
>
>


Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Blake Eggleston
It seems like one of the main points of contention isn’t so much the the 
content of the patch, but more about the amount of review this patch has/will 
receive relative to its perceived risk. If it’s the latter, then I think it 
would be more effective to explain why that’s the case, and what level of 
review would be more appropriate.

I’m personally +0  on requiring additional review. I feel like the 3 people 
involved so far have sufficient expertise, and trust them to be responsible, 
including soliciting additional reviews if they feel they’re needed.

If dev@ does collectively want more eyes on this, I’d suggest we solicit 
reviews from people who are very familiar with the messaging code, and let them 
decide what additional work and documentation they’d need to make a review 
manageable, if any. Everyone has their own review style, and there’s no need to 
ask for a bunch of additional work if it’s not needed.

> On Apr 12, 2019, at 12:46 PM, Jordan West  wrote:
> 
> Since their seems to be an assumption that I haven’t read the code let me
> clarify: I am working on making time to be a reviewer on this and I have
> already spent a few hours with the patch before I sent any replies, likely
> more than most who are replying here. Again, because I disagree on
> non-technical matters does not mean I haven’t considered the technical. I
> am sharing what I think is necessary for the authors
> to make review higher quality. I will not compromise my review standards on
> a patch like this as I have said already. Telling me to review it to talk
> more about it directly ignores my feedback and requires me to acquiesce all
> of my concerns, which as I said I won’t do as a reviewer.
> 
> And yes I am arguing for changing how the Cassandra community approaches
> large patches. In the same way the freeze changed how we approached major
> releases and the decision to do so has been a net benefit as measured by
> quality and stability. Existing community members have already chimed in in
> support of things like better commit hygiene.
> 
> The past approaches haven’t prioritized quality and stability and it really
> shows. What I and others here are suggesting has worked all over our
> industry and is adopted by companies big (like google as i linked
> previously) and small (like many startups I and others have worked for).
> Everything we want to do: better testing, better review, better code, is
> made easier with better design review, better discussion, and more
> digestible patches among many of the other things suggested in this thread.
> 
> Jordan
> 
> On Fri, Apr 12, 2019 at 12:01 PM Benedict Elliott Smith 
> wrote:
> 
>> I would once again exhort everyone making these kinds of comment to
>> actually read the code, and to comment on Jira.  Preferably with a
>> justification by reference to the code for how or why it would improve the
>> patch.
>> 
>> As far as a design document is concerned, it’s very unclear what is being
>> requested.  We already had plans, as Jordan knows, to produce a wiki page
>> for posterity, and a blog post closer to release.  However, I have never
>> heard of this as a requirement for review, or for commit.  We have so far
>> taken two members of the community through the patch over video chat, and
>> would be more than happy to do the same for others.  So far nobody has had
>> any difficulty getting to grips with its structure.
>> 
>> If the project wants to modify its normal process for putting a patch
>> together, this is a whole different can of worms, and I am strongly -1.
>> I’m not sure what precedent we’re trying to set imposing arbitrary
>> constraints pre-commit for work that has already met the project’s
>> inclusion criteria.
>> 
>> 
>>> On 12 Apr 2019, at 18:58, Pavel Yaskevich  wrote:
>>> 
>>> I haven't actually looked at the code
>> 
>> 
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Benedict Elliott Smith
Can you start a new thread to build consensus on your proposals for modifying 
the commit process?

I do not share your views, as already laid out in my first email.  The 
community makes these decisions through building consensus, and potentially a 
vote of the PMC.  This scope of change requires its own thread of discussion.



> On 12 Apr 2019, at 20:46, Jordan West  wrote:
> 
> Since their seems to be an assumption that I haven’t read the code let me
> clarify: I am working on making time to be a reviewer on this and I have
> already spent a few hours with the patch before I sent any replies, likely
> more than most who are replying here. Again, because I disagree on
> non-technical matters does not mean I haven’t considered the technical. I
> am sharing what I think is necessary for the authors
> to make review higher quality. I will not compromise my review standards on
> a patch like this as I have said already. Telling me to review it to talk
> more about it directly ignores my feedback and requires me to acquiesce all
> of my concerns, which as I said I won’t do as a reviewer.
> 
> And yes I am arguing for changing how the Cassandra community approaches
> large patches. In the same way the freeze changed how we approached major
> releases and the decision to do so has been a net benefit as measured by
> quality and stability. Existing community members have already chimed in in
> support of things like better commit hygiene.
> 
> The past approaches haven’t prioritized quality and stability and it really
> shows. What I and others here are suggesting has worked all over our
> industry and is adopted by companies big (like google as i linked
> previously) and small (like many startups I and others have worked for).
> Everything we want to do: better testing, better review, better code, is
> made easier with better design review, better discussion, and more
> digestible patches among many of the other things suggested in this thread.
> 
> Jordan
> 
> On Fri, Apr 12, 2019 at 12:01 PM Benedict Elliott Smith 
> wrote:
> 
>> I would once again exhort everyone making these kinds of comment to
>> actually read the code, and to comment on Jira.  Preferably with a
>> justification by reference to the code for how or why it would improve the
>> patch.
>> 
>> As far as a design document is concerned, it’s very unclear what is being
>> requested.  We already had plans, as Jordan knows, to produce a wiki page
>> for posterity, and a blog post closer to release.  However, I have never
>> heard of this as a requirement for review, or for commit.  We have so far
>> taken two members of the community through the patch over video chat, and
>> would be more than happy to do the same for others.  So far nobody has had
>> any difficulty getting to grips with its structure.
>> 
>> If the project wants to modify its normal process for putting a patch
>> together, this is a whole different can of worms, and I am strongly -1.
>> I’m not sure what precedent we’re trying to set imposing arbitrary
>> constraints pre-commit for work that has already met the project’s
>> inclusion criteria.
>> 
>> 
>>> On 12 Apr 2019, at 18:58, Pavel Yaskevich  wrote:
>>> 
>>> I haven't actually looked at the code
>> 
>> 
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Pavel Yaskevich
It seems to me that the corner stone here is the development process.
If the work and review is done openly (e.g. on JIRA or Github) we
wouldn't be having this post factum conversation, because all of the
progress would be visible, so it would make sense to just squash before
committing
if so preferred.

It's indeed not really bisect friendly to drop squashed changes into the
repository,
but I've been guilty of that myself with SASI for a number of reasons, so I
can't
blame authors for this without sounding hypocritical.

As I mentioned before I would be great if we could establish the process for
how the development is supposed to happen, like other projects do. But it,
as most of the other concerns, (if not all of them) could be discussed
separately.


On Fri, Apr 12, 2019 at 1:25 PM Benedict Elliott Smith 
wrote:

> Can you start a new thread to build consensus on your proposals for
> modifying the commit process?
>
> I do not share your views, as already laid out in my first email.  The
> community makes these decisions through building consensus, and potentially
> a vote of the PMC.  This scope of change requires its own thread of
> discussion.
>
>
>
> > On 12 Apr 2019, at 20:46, Jordan West  wrote:
> >
> > Since their seems to be an assumption that I haven’t read the code let me
> > clarify: I am working on making time to be a reviewer on this and I have
> > already spent a few hours with the patch before I sent any replies,
> likely
> > more than most who are replying here. Again, because I disagree on
> > non-technical matters does not mean I haven’t considered the technical. I
> > am sharing what I think is necessary for the authors
> > to make review higher quality. I will not compromise my review standards
> on
> > a patch like this as I have said already. Telling me to review it to talk
> > more about it directly ignores my feedback and requires me to acquiesce
> all
> > of my concerns, which as I said I won’t do as a reviewer.
> >
> > And yes I am arguing for changing how the Cassandra community approaches
> > large patches. In the same way the freeze changed how we approached major
> > releases and the decision to do so has been a net benefit as measured by
> > quality and stability. Existing community members have already chimed in
> in
> > support of things like better commit hygiene.
> >
> > The past approaches haven’t prioritized quality and stability and it
> really
> > shows. What I and others here are suggesting has worked all over our
> > industry and is adopted by companies big (like google as i linked
> > previously) and small (like many startups I and others have worked for).
> > Everything we want to do: better testing, better review, better code, is
> > made easier with better design review, better discussion, and more
> > digestible patches among many of the other things suggested in this
> thread.
> >
> > Jordan
> >
> > On Fri, Apr 12, 2019 at 12:01 PM Benedict Elliott Smith <
> bened...@apache.org>
> > wrote:
> >
> >> I would once again exhort everyone making these kinds of comment to
> >> actually read the code, and to comment on Jira.  Preferably with a
> >> justification by reference to the code for how or why it would improve
> the
> >> patch.
> >>
> >> As far as a design document is concerned, it’s very unclear what is
> being
> >> requested.  We already had plans, as Jordan knows, to produce a wiki
> page
> >> for posterity, and a blog post closer to release.  However, I have never
> >> heard of this as a requirement for review, or for commit.  We have so
> far
> >> taken two members of the community through the patch over video chat,
> and
> >> would be more than happy to do the same for others.  So far nobody has
> had
> >> any difficulty getting to grips with its structure.
> >>
> >> If the project wants to modify its normal process for putting a patch
> >> together, this is a whole different can of worms, and I am strongly -1.
> >> I’m not sure what precedent we’re trying to set imposing arbitrary
> >> constraints pre-commit for work that has already met the project’s
> >> inclusion criteria.
> >>
> >>
> >>> On 12 Apr 2019, at 18:58, Pavel Yaskevich  wrote:
> >>>
> >>> I haven't actually looked at the code
> >>
> >>
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: TLP tools for stress testing and building test clusters in AWS

2019-04-12 Thread Jon Haddad
I'd be more than happy to hop on a call next week to give you both
(and anyone else interested) a tour of our dev tools.  Maybe something
early morning on my end, which should be your evening, could work?

I can set up a Zoom conference to get everyone acquainted.  We can
record and post it for any who can't make it.

I'm thinking Tuesday, Wednesday, or Thursday morning, 9AM Pacific (5pm
London)?  If anyone's interested please reply with what dates work.
I'll be sure to post the details back here with the zoom link in case
anyone wants to join that didn't get a chance to reply, as well as a
link to the recorded call.

Jon

On Fri, Apr 12, 2019 at 10:41 AM Benedict Elliott Smith
 wrote:
>
> +1
>
> I’m also just as excited to see some standardised workloads and test bed.  At 
> the moment we’re benefiting from some large contributors doing their own 
> proprietary performance testing, which is super valuable and something we’ve 
> lacked before.  But I’m also keen to see some more representative workloads 
> that are reproducible by anybody in the community take shape.
>
>
> > On 12 Apr 2019, at 18:09, Aleksey Yeshchenko  
> > wrote:
> >
> > Hey Jon,
> >
> > This sounds exciting and pretty useful, thanks.
> >
> > Looking forward to using tlp-stress for validating 15066 performance.
> >
> > We should touch base some time next week to pick a comprehensive set of 
> > workloads and versions, perhaps?
> >
> >
> >> On 12 Apr 2019, at 16:34, Jon Haddad  wrote:
> >>
> >> I don't want to derail the discussion about Stabilizing Internode
> >> Messaging, so I'm starting this as a separate thread.  There was a
> >> comment that Josh made [1] about doing performance testing with real
> >> clusters as well as a lot of microbenchmarks, and I'm 100% in support
> >> of this.  We've been working on some tooling at TLP for the last
> >> several months to make this a lot easier.  One of the goals has been
> >> to help improve the 4.0 testing process.
> >>
> >> The first tool we have is tlp-stress [2].  It's designed with a "get
> >> started in 5 minutes" mindset.  My goal was to ship a stress tool that
> >> ships with real workloads out of the box that can be easily tweaked,
> >> similar to how fio allows you to design a disk workload and tweak it
> >> with paramaters.  Included are stress workloads that stress LWTs (two
> >> different types), materialized views, counters, time series, and
> >> key-value workloads.  Each workload can be modified easily to change
> >> compaction strategies, concurrent operations, number of partitions.
> >> We can run workloads for a set number of iterations or a custom
> >> duration.  We've used this *extensively* at TLP to help our customers
> >> and most of our blog posts that discuss performance use it as well.
> >> It exports data to both a CSV format and auto sets up prometheus for
> >> metrics collection / aggregation.  As an example, we were able to
> >> determine that the compression length set on the paxos tables imposes
> >> a significant overhead when using the Locking LWT workload, which
> >> simulates locking and unlocking of rows.  See CASSANDRA-15080 for
> >> details.
> >>
> >> We have documentation [3] on the TLP website.
> >>
> >> The second tool we've been working on is tlp-cluster [4].  This tool
> >> is designed to help provision AWS instances for the purposes of
> >> testing.  To be clear, I don't expect, or want, this tool to be used
> >> for production environments.  It's designed to assist with the
> >> Cassandra build process by generating deb packages or re-using the
> >> ones that have already been uploaded.  Here's a short list of the
> >> things you'll care about:
> >>
> >> 1. Create instances in AWS for Cassandra using any instance size and
> >> number of nodes.  Also create tlp-stress instances and a box for
> >> monitoring
> >> 2. Use any available build of Cassandra, with a quick option to change
> >> YAML config.  For example: tlp-stress use 3.11.4 -c
> >> concurrent_writes:256
> >> 3. Do custom builds just by pointing to a local Cassandra git repo.
> >> They can be used the same way as #2.
> >> 4. tlp-stress is automatically installed on the stress box.
> >> 5. Everything's installed with pure bash.  I considered something more
> >> complex, but since this is for development only, it turns out the
> >> simplest tool possible works well and it means it's easily
> >> configurable.  Just drop in your own bash script starting with a
> >> number in a XX_script_name.sh format and it gets run.
> >> 6. The monitoring box is running Prometheus.  It auto scrapes
> >> Cassandra using the Instaclustr metrics library.
> >> 7. Grafana is also installed automatically.  There's a couple sample
> >> graphs there now.  We plan on having better default graphs soon.
> >>
> >> For the moment it installs java 8 only but that should be easily
> >> fixable to use java 11 to test ZGC (it's on my radar).
> >>
> >> Documentation for tlp-cluster is here [5].
> >>
> >> There's still some 

Re: Stabilising Internode Messaging in 4.0

2019-04-12 Thread Nate McCall
As someone who has been here a (very) long time and worked on C* in
production envs. back to version 0.4, this large patch - taken by
itself - does, to be frank, scare the shit out of me. In a complex
system any large change will have side effects impossible to
anticipate. I have seen this hold true too many times.

That said, I think we all agree that internode has been a source of
warts since Facebook spun this out 10+ yrs ago and that we are all
tired of applying bandaides.

As has been talked to else thread - and this is the super crucial
point for me -  we also have a substantially better testing story
internally and externally coming together than at any point in the
projects past.

This next part is partially selfish, but I want to be 100% clear is in
the immediate interests of the project's future:
I am getting on stage in about a month to keynote the first Cassandra
focused event with any notable attendance we have had for a long time.
We are then all going out to vegas in Sept. to discuss the future of
our project and ideally have some cool use cases to show a bunch of
users.

For both of these, we need a story to tell. It needs to be clear and
cohesive. And I think it's super important to get in front of these
people and have part of this story be "we took three years because we
didnt compromise on quality." If we dont have our shit together here I
think we will start loosing users at a much faster pace and we
seriously risk becoming "that thing you can run only if you are a
large company and can put a bunch of people on it who know it
intimately." Whether that is the case or not, *it will* be the
perception. We are just running out of time.

So back to this patch: on the surface, it fixes a lot of stuff and
puts us on the right track for the future. I'm willing to set aside
the number of times I've been burned over the past decade because I
think we are in a much better position - as a whole community - to
find, report and fix issues this patch will introduce and do so much
faster than we ever have.

I do want to end this with one more point because it needs to be
called out: a couple of people (even if I know them personally,
consider them friends and are both among the best engineers i've ever
met) going off in a room and producing something in isolation is more
or less a giant "f*k you" to the open source process and our community
as a whole. Internode is a particularly complex, messy, baggage ridden
component where there is an argument to be made that uninterrupted
concentration was the only way to achieve this, but it must be
understood that the perception of actions like this is toe stepping,
devaluation of opinions and is generally not conducive to holding a
community together. Again, i doubt this was the intention, but it is
the perception. Please let's avoid this in the future.

In sum, +1. I wish this process were smoother but we're running out of time.

-Nate

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org