Re: [DISCUSS] Releasable trunk and quality

Joshua McKenzie Wed, 03 Nov 2021 08:56:33 -0700

>
> It'd be great to
> expand this, but it's been somewhat difficult to do, since last time a
> bootstrap test was attempted, it has immediately uncovered enough issues to
> keep us busy fixing them for quite some time. Maybe it's about time to try
> that again.


I'm going to go with a "yes please". :)

On Wed, Nov 3, 2021 at 9:27 AM Oleksandr Petrov <oleksandr.pet...@gmail.com>
wrote:

> I'll merge 16262 and the Harry blog-post that accompanies it shortly.
> Having 16262 merged will significantly reduce the amount of resistance one
> has to overcome in order to write a fuzz test. But this, of course, only
> covers short/small/unit-test-like tests.
>
> For longer running tests, I guess for now we will have to rely on folks
> (hopefully) running long fuzz tests and reporting issues. But eventually
> it'd be great to have enough automation around it so that anyone could do
> that and where test results are public.
>
> In regard to long-running tests, currently with Harry we can run three
> kinds of long-running tests:
> 1. Stress-like concurrent write workload, followed by periods of quiescence
> and then validation
> 2. Writes with injected faults, followed by repair and validation
> 3. Stress-like concurrent read/write workload with fault injection without
> validation, for finding rare edge conditions / triggering possible
> exceptions
>
> Which means that quorum read and write paths (for all kinds of schemas,
> including all possible kinds of read and write queries), compactions,
> repairs, read-repairs and hints are covered fairly well. However things
> like bootstrap and other kinds of range movements aren't. It'd be great to
> expand this, but it's been somewhat difficult to do, since last time a
> bootstrap test was attempted, it has immediately uncovered enough issues to
> keep us busy fixing them for quite some time. Maybe it's about time to try
> that again.
>
> For short tests, you can think of Harry as a tool to save you time and
> allow focusing on higher-level test meaning rather than creating schema and
> coming up with specific values to insert/select.
>
> Thanks
> --Alex
>
>
>
> On Tue, Nov 2, 2021 at 5:30 PM Ekaterina Dimitrova <e.dimitr...@gmail.com>
> wrote:
>
> > Did I hear my name? 😁
> > Sorry Josh, you are wrong :-) 2 out of 30 in two months were real bugs
> > discovered by pflaky tests and one of them was very hard to hit. So
> 6-7%. I
> > think that report I sent back then didn’t come through so the topic was
> > cleared in a follow up mail by Benjamin; with a lot of sweat but we kept
> to
> > the promised 4.0 standard.
> >
> > Now back to this topic:
> > - green CI without enough test coverage is nothing more than green CI
> > unfortunately to me.  I know this is an elephant but I won’t sleep well
> > tonight if I don’t mention it.
> > - I believe the looping of tests mentioned by Berenguer can help for
> > verifying no new weird flakiness is introduced by new tests added. And of
> > course it helps a lot during fixing flaky tests, I think that’s clear.
> >
> >  I think that it would be great if each such test
> > > > (or
> > > > > test group) was guaranteed to have a ticket and some preliminary
> > > analysis
> > > > > was done to confirm it is just a test problem before releasing the
> > new
> > > > > version
> >
> > Probably not bad idea. Preliminary analysis. But we need to get into the
> > cadence of regular checking our CI; divide and conquer on regular basis
> > between all of us. Not to mention it is way easier to follow up recently
> > introduced issues with the people who worked on stuff then trying to find
> > out what happened a year ago in a rush before a release. I agree it is
> not
> > about the number but what stays behind it.
> >
> > Requiring all tests to run pre every merge, easily we can add this in
> > circle but there are many people who don’t have access to high resources
> so
> > again they won’t be able to run absolutely everything. At the end
> > everything is up to the diligence of the reviewers/committers. Plus
> > official CI is Jenkins and we know there are different infra related
> > failures in the different CIs. Not an easy topic, indeed. I support
> running
> > all tests, just having in mind all the related issues/complications.
> >
> > I would say in my mind upgrade tests are particularly important to be
> green
> > before a release, too.
> >
> > Seems to me we have the tools, but now it is time to organize the rhythm
> in
> > an efficient manner.
> >
> > Best regards,
> > Ekaterina
> >
> >
> > On Tue, 2 Nov 2021 at 11:06, Joshua McKenzie <jmcken...@apache.org>
> wrote:
> >
> > > To your point Jacek, I believe in the run up to 4.0 Ekaterina did some
> > > analysis and something like 18% (correct me if I'm wrong here) of the
> > test
> > > failures we were considering "flaky tests" were actual product defects
> in
> > > the database. With that in mind, we should be uncomfortable cutting a
> > > release if we have 6 test failures since there's every likelihood one
> of
> > > them is a surfaced bug.
> > >
> > > ensuring our best practices are followed for every merge
> > >
> > > I totally agree but I also don't think we have this codified (unless
> I'm
> > > just completely missing something - very possible! ;)) Seems like we
> have
> > > different circle configs, different sets of jobs being run, Harry /
> > Hunter
> > > (maybe?) / ?? run on some but not all commits and/or all branches,
> > > manual performance testing on specific releases but nothing surfaced
> > > formally to the project as a reproducible suite like we used to have
> > years
> > > ago (primitive though it was at the time with what it covered).
> > >
> > > If we *don't* have this clarified right now, I think there's
> significant
> > > value in enumerating and at least documenting what our agreed upon best
> > > practices are so we can start holding ourselves and each other
> > accountable
> > > to that bar. Given some of the incredible but sweeping work coming down
> > the
> > > pike, this strikes me as a thing we need to be proactive and vigilant
> > about
> > > so as not to regress.
> > >
> > > ~Josh
> > >
> > > On Tue, Nov 2, 2021 at 3:49 AM Jacek Lewandowski <
> > > lewandowski.ja...@gmail.com> wrote:
> > >
> > > > >
> > > > > we already have a way to confirm flakiness on circle by running the
> > > test
> > > > > repeatedly N times. Like 100 or 500. That has proven to work very
> > well
> > > > > so far, at least for me. #collaborating #justfyi
> > > > >
> > > >
> > > > It does not prove that it is the test flakiness. It still can be a
> bug
> > in
> > > > the code which occurs intermittently under some rare conditions
> > > >
> > > >
> > > > - - -- --- ----- -------- -------------
> > > > Jacek Lewandowski
> > > >
> > > >
> > > > On Tue, Nov 2, 2021 at 7:46 AM Berenguer Blasi <
> > berenguerbl...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > we already have a way to confirm flakiness on circle by running the
> > > test
> > > > > repeatedly N times. Like 100 or 500. That has proven to work very
> > well
> > > > > so far, at least for me. #collaborating #justfyi
> > > > >
> > > > > On the 60+ failures it is not as bad as it looks. Let me explain. I
> > > have
> > > > > been tracking failures in 4.0 and trunk daily, it's grown as a
> habit
> > in
> > > > > me after the 4.0 push. And 4.0 and trunk were hovering around <10
> > > > > failures solidly (you can check jenkins ci graphs). The random
> bisect
> > > or
> > > > > fix was needed leaving behind 3 or 4 tests that have defeated
> > already 2
> > > > > or 3 committers, so the really tough guys. I am reasonably
> convinced
> > > > > once the 60+ failures fix merges we'll be back to the <10 failures
> > with
> > > > > relative little effort.
> > > > >
> > > > > So we're just in the middle of a 'fix' but overall we shouldn't be
> as
> > > > > bad as it looks now as we've been quite good at keeping CI
> green-ish
> > > imo.
> > > > >
> > > > > Also +1 to releasable branches, which whatever we settle it means
> it
> > is
> > > > > not a wall of failures, bc of reasons explained like the hidden
> costs
> > > etc
> > > > >
> > > > > My 2cts.
> > > > >
> > > > > On 2/11/21 6:07, Jacek Lewandowski wrote:
> > > > > >> I don’t think means guaranteeing there are no failing tests
> > (though
> > > > > >> ideally this would also happen), but about ensuring our best
> > > practices
> > > > > are
> > > > > >> followed for every merge. 4.0 took so long to release because of
> > the
> > > > > amount
> > > > > >> of hidden work that was created by merging work that didn’t meet
> > the
> > > > > >> standard for release.
> > > > > >>
> > > > > > Tests are sometimes considered flaky because they fail
> > intermittently
> > > > but
> > > > > > it may not be related to the insufficiently consistent test
> > > > > implementation
> > > > > > and can reveal some real problem in the production code. I saw
> that
> > > in
> > > > > > various codebases and I think that it would be great if each such
> > > test
> > > > > (or
> > > > > > test group) was guaranteed to have a ticket and some preliminary
> > > > analysis
> > > > > > was done to confirm it is just a test problem before releasing
> the
> > > new
> > > > > > version
> > > > > >
> > > > > > Historically we have also had significant pressure to backport
> > > features
> > > > > to
> > > > > >> earlier versions due to the cost and risk of upgrading. If we
> > > maintain
> > > > > >> broader version compatibility for upgrade, and reduce the risk
> of
> > > > > adopting
> > > > > >> newer versions, then this pressure is also reduced
> significantly.
> > > > Though
> > > > > >> perhaps we will stick to our guns here anyway, as there seems to
> > be
> > > > > renewed
> > > > > >> pressure to limit work in GA releases to bug fixes exclusively.
> It
> > > > > remains
> > > > > >> to be seen if this holds.
> > > > > >
> > > > > > Are there any precise requirements for supported upgrade and
> > > downgrade
> > > > > > paths?
> > > > > >
> > > > > > Thanks
> > > > > > - - -- --- ----- -------- -------------
> > > > > > Jacek Lewandowski
> > > > > >
> > > > > >
> > > > > > On Sat, Oct 30, 2021 at 4:07 PM bened...@apache.org <
> > > > bened...@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > >>> How do we define what "releasable trunk" means?
> > > > > >> For me, the major criteria is ensuring that work is not merged
> > that
> > > is
> > > > > >> known to require follow-up work, or could reasonably have been
> > known
> > > > to
> > > > > >> require follow-up work if better QA practices had been followed.
> > > > > >>
> > > > > >> So, a big part of this is ensuring we continue to exceed our
> > targets
> > > > for
> > > > > >> improved QA. For me this means trying to weave tools like Harry
> > and
> > > > the
> > > > > >> Simulator into our development workflow early on, but we’ll see
> > how
> > > > well
> > > > > >> these tools gain broader adoption. This also means focus in
> > general
> > > on
> > > > > >> possible negative effects of a change.
> > > > > >>
> > > > > >> I think we could do with producing guidance documentation for
> how
> > to
> > > > > >> approach QA, where we can record our best practices and evolve
> > them
> > > as
> > > > > we
> > > > > >> discover flaws or pitfalls, either for ergonomics or for bug
> > > > discovery.
> > > > > >>
> > > > > >>> What are the benefits of having a releasable trunk as defined
> > here?
> > > > > >> If we want to have any hope of meeting reasonable release
> cadences
> > > > _and_
> > > > > >> the high project quality we expect today, then I think a
> > ~shippable
> > > > > trunk
> > > > > >> policy is an absolute necessity.
> > > > > >>
> > > > > >> I don’t think means guaranteeing there are no failing tests
> > (though
> > > > > >> ideally this would also happen), but about ensuring our best
> > > practices
> > > > > are
> > > > > >> followed for every merge. 4.0 took so long to release because of
> > the
> > > > > amount
> > > > > >> of hidden work that was created by merging work that didn’t meet
> > the
> > > > > >> standard for release.
> > > > > >>
> > > > > >> Historically we have also had significant pressure to backport
> > > > features
> > > > > to
> > > > > >> earlier versions due to the cost and risk of upgrading. If we
> > > maintain
> > > > > >> broader version compatibility for upgrade, and reduce the risk
> of
> > > > > adopting
> > > > > >> newer versions, then this pressure is also reduced
> significantly.
> > > > Though
> > > > > >> perhaps we will stick to our guns here anyway, as there seems to
> > be
> > > > > renewed
> > > > > >> pressure to limit work in GA releases to bug fixes exclusively.
> It
> > > > > remains
> > > > > >> to be seen if this holds.
> > > > > >>
> > > > > >>> What are the costs?
> > > > > >> I think the costs are quite low, perhaps even negative. Hidden
> > work
> > > > > >> produced by merges that break things can be much more costly
> than
> > > > > getting
> > > > > >> the work right first time, as attribution is much more
> > challenging.
> > > > > >>
> > > > > >> One cost that is created, however, is for version compatibility
> as
> > > we
> > > > > >> cannot say “well, this is a minor version bump so we don’t need
> to
> > > > > support
> > > > > >> downgrade”. But I think we should be investing in this anyway
> for
> > > > > operator
> > > > > >> simplicity and confidence, so I actually see this as a benefit
> as
> > > > well.
> > > > > >>
> > > > > >>> Full disclosure: running face-first into 60+ failing tests on
> > trunk
> > > > > >> I have to apologise here. CircleCI did not uncover these
> problems,
> > > > > >> apparently due to some way it resolves dependencies, and so I am
> > > > > >> responsible for a significant number of these and have been
> quite
> > > sick
> > > > > >> since.
> > > > > >>
> > > > > >> I think a push to eliminate flaky tests will probably help here
> in
> > > > > future,
> > > > > >> though, and perhaps the project needs to have some (low)
> threshold
> > > of
> > > > > flaky
> > > > > >> or failing tests at which point we block merges to force a
> > > correction.
> > > > > >>
> > > > > >>
> > > > > >> From: Joshua McKenzie <jmcken...@apache.org>
> > > > > >> Date: Saturday, 30 October 2021 at 14:00
> > > > > >> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > > > > >> Subject: [DISCUSS] Releasable trunk and quality
> > > > > >> We as a project have gone back and forth on the topic of quality
> > and
> > > > the
> > > > > >> notion of a releasable trunk for quite a few years. If people
> are
> > > > > >> interested, I'd like to rekindle this discussion a bit and see
> if
> > > > we're
> > > > > >> happy with where we are as a project or if we think there's
> steps
> > we
> > > > > should
> > > > > >> take to change the quality bar going forward. The following
> > > questions
> > > > > have
> > > > > >> been rattling around for me for awhile:
> > > > > >>
> > > > > >> 1. How do we define what "releasable trunk" means? All reviewed
> > by M
> > > > > >> committers? Passing N% of tests? Passing all tests plus some
> other
> > > > > metrics
> > > > > >> (manual testing, raising the number of reviewers, test coverage,
> > > usage
> > > > > in
> > > > > >> dev or QA environments, etc)? Something else entirely?
> > > > > >>
> > > > > >> 2. With a definition settled upon in #1, what steps, if any, do
> we
> > > > need
> > > > > to
> > > > > >> take to get from where we are to having *and keeping* that
> > > releasable
> > > > > >> trunk? Anything to codify there?
> > > > > >>
> > > > > >> 3. What are the benefits of having a releasable trunk as defined
> > > here?
> > > > > What
> > > > > >> are the costs? Is it worth pursuing? What are the alternatives
> > (for
> > > > > >> instance: a freeze before a release + stabilization focus by the
> > > > > community
> > > > > >> i.e. 4.0 push or the tock in tick-tock)?
> > > > > >>
> > > > > >> Given the large volumes of work coming down the pike with CEP's,
> > > this
> > > > > seems
> > > > > >> like a good time to at least check in on this topic as a
> > community.
> > > > > >>
> > > > > >> Full disclosure: running face-first into 60+ failing tests on
> > trunk
> > > > when
> > > > > >> going through the commit process for denylisting this week
> brought
> > > > this
> > > > > >> topic back up for me (reminds me of when I went to merge CDC
> back
> > in
> > > > 3.6
> > > > > >> and those test failures riled me up... I sense a pattern ;))
> > > > > >>
> > > > > >> Looking forward to hearing what people think.
> > > > > >>
> > > > > >> ~Josh
> > > > > >>
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> alex p
>

Re: [DISCUSS] Releasable trunk and quality

Reply via email to