I'll merge 16262 and the Harry blog-post that accompanies it shortly. Having 16262 merged will significantly reduce the amount of resistance one has to overcome in order to write a fuzz test. But this, of course, only covers short/small/unit-test-like tests.
For longer running tests, I guess for now we will have to rely on folks (hopefully) running long fuzz tests and reporting issues. But eventually it'd be great to have enough automation around it so that anyone could do that and where test results are public. In regard to long-running tests, currently with Harry we can run three kinds of long-running tests: 1. Stress-like concurrent write workload, followed by periods of quiescence and then validation 2. Writes with injected faults, followed by repair and validation 3. Stress-like concurrent read/write workload with fault injection without validation, for finding rare edge conditions / triggering possible exceptions Which means that quorum read and write paths (for all kinds of schemas, including all possible kinds of read and write queries), compactions, repairs, read-repairs and hints are covered fairly well. However things like bootstrap and other kinds of range movements aren't. It'd be great to expand this, but it's been somewhat difficult to do, since last time a bootstrap test was attempted, it has immediately uncovered enough issues to keep us busy fixing them for quite some time. Maybe it's about time to try that again. For short tests, you can think of Harry as a tool to save you time and allow focusing on higher-level test meaning rather than creating schema and coming up with specific values to insert/select. Thanks --Alex On Tue, Nov 2, 2021 at 5:30 PM Ekaterina Dimitrova <e.dimitr...@gmail.com> wrote: > Did I hear my name? đ > Sorry Josh, you are wrong :-) 2 out of 30 in two months were real bugs > discovered by pflaky tests and one of them was very hard to hit. So 6-7%. I > think that report I sent back then didnât come through so the topic was > cleared in a follow up mail by Benjamin; with a lot of sweat but we kept to > the promised 4.0 standard. > > Now back to this topic: > - green CI without enough test coverage is nothing more than green CI > unfortunately to me. I know this is an elephant but I wonât sleep well > tonight if I donât mention it. > - I believe the looping of tests mentioned by Berenguer can help for > verifying no new weird flakiness is introduced by new tests added. And of > course it helps a lot during fixing flaky tests, I think thatâs clear. > > I think that it would be great if each such test > > > (or > > > > test group) was guaranteed to have a ticket and some preliminary > > analysis > > > > was done to confirm it is just a test problem before releasing the > new > > > > version > > Probably not bad idea. Preliminary analysis. But we need to get into the > cadence of regular checking our CI; divide and conquer on regular basis > between all of us. Not to mention it is way easier to follow up recently > introduced issues with the people who worked on stuff then trying to find > out what happened a year ago in a rush before a release. I agree it is not > about the number but what stays behind it. > > Requiring all tests to run pre every merge, easily we can add this in > circle but there are many people who donât have access to high resources so > again they wonât be able to run absolutely everything. At the end > everything is up to the diligence of the reviewers/committers. Plus > official CI is Jenkins and we know there are different infra related > failures in the different CIs. Not an easy topic, indeed. I support running > all tests, just having in mind all the related issues/complications. > > I would say in my mind upgrade tests are particularly important to be green > before a release, too. > > Seems to me we have the tools, but now it is time to organize the rhythm in > an efficient manner. > > Best regards, > Ekaterina > > > On Tue, 2 Nov 2021 at 11:06, Joshua McKenzie <jmcken...@apache.org> wrote: > > > To your point Jacek, I believe in the run up to 4.0 Ekaterina did some > > analysis and something like 18% (correct me if I'm wrong here) of the > test > > failures we were considering "flaky tests" were actual product defects in > > the database. With that in mind, we should be uncomfortable cutting a > > release if we have 6 test failures since there's every likelihood one of > > them is a surfaced bug. > > > > ensuring our best practices are followed for every merge > > > > I totally agree but I also don't think we have this codified (unless I'm > > just completely missing something - very possible! ;)) Seems like we have > > different circle configs, different sets of jobs being run, Harry / > Hunter > > (maybe?) / ?? run on some but not all commits and/or all branches, > > manual performance testing on specific releases but nothing surfaced > > formally to the project as a reproducible suite like we used to have > years > > ago (primitive though it was at the time with what it covered). > > > > If we *don't* have this clarified right now, I think there's significant > > value in enumerating and at least documenting what our agreed upon best > > practices are so we can start holding ourselves and each other > accountable > > to that bar. Given some of the incredible but sweeping work coming down > the > > pike, this strikes me as a thing we need to be proactive and vigilant > about > > so as not to regress. > > > > ~Josh > > > > On Tue, Nov 2, 2021 at 3:49 AM Jacek Lewandowski < > > lewandowski.ja...@gmail.com> wrote: > > > > > > > > > > we already have a way to confirm flakiness on circle by running the > > test > > > > repeatedly N times. Like 100 or 500. That has proven to work very > well > > > > so far, at least for me. #collaborating #justfyi > > > > > > > > > > It does not prove that it is the test flakiness. It still can be a bug > in > > > the code which occurs intermittently under some rare conditions > > > > > > > > > - - -- --- ----- -------- ------------- > > > Jacek Lewandowski > > > > > > > > > On Tue, Nov 2, 2021 at 7:46 AM Berenguer Blasi < > berenguerbl...@gmail.com > > > > > > wrote: > > > > > > > Hi, > > > > > > > > we already have a way to confirm flakiness on circle by running the > > test > > > > repeatedly N times. Like 100 or 500. That has proven to work very > well > > > > so far, at least for me. #collaborating #justfyi > > > > > > > > On the 60+ failures it is not as bad as it looks. Let me explain. I > > have > > > > been tracking failures in 4.0 and trunk daily, it's grown as a habit > in > > > > me after the 4.0 push. And 4.0 and trunk were hovering around <10 > > > > failures solidly (you can check jenkins ci graphs). The random bisect > > or > > > > fix was needed leaving behind 3 or 4 tests that have defeated > already 2 > > > > or 3 committers, so the really tough guys. I am reasonably convinced > > > > once the 60+ failures fix merges we'll be back to the <10 failures > with > > > > relative little effort. > > > > > > > > So we're just in the middle of a 'fix' but overall we shouldn't be as > > > > bad as it looks now as we've been quite good at keeping CI green-ish > > imo. > > > > > > > > Also +1 to releasable branches, which whatever we settle it means it > is > > > > not a wall of failures, bc of reasons explained like the hidden costs > > etc > > > > > > > > My 2cts. > > > > > > > > On 2/11/21 6:07, Jacek Lewandowski wrote: > > > > >> I donât think means guaranteeing there are no failing tests > (though > > > > >> ideally this would also happen), but about ensuring our best > > practices > > > > are > > > > >> followed for every merge. 4.0 took so long to release because of > the > > > > amount > > > > >> of hidden work that was created by merging work that didnât meet > the > > > > >> standard for release. > > > > >> > > > > > Tests are sometimes considered flaky because they fail > intermittently > > > but > > > > > it may not be related to the insufficiently consistent test > > > > implementation > > > > > and can reveal some real problem in the production code. I saw that > > in > > > > > various codebases and I think that it would be great if each such > > test > > > > (or > > > > > test group) was guaranteed to have a ticket and some preliminary > > > analysis > > > > > was done to confirm it is just a test problem before releasing the > > new > > > > > version > > > > > > > > > > Historically we have also had significant pressure to backport > > features > > > > to > > > > >> earlier versions due to the cost and risk of upgrading. If we > > maintain > > > > >> broader version compatibility for upgrade, and reduce the risk of > > > > adopting > > > > >> newer versions, then this pressure is also reduced significantly. > > > Though > > > > >> perhaps we will stick to our guns here anyway, as there seems to > be > > > > renewed > > > > >> pressure to limit work in GA releases to bug fixes exclusively. It > > > > remains > > > > >> to be seen if this holds. > > > > > > > > > > Are there any precise requirements for supported upgrade and > > downgrade > > > > > paths? > > > > > > > > > > Thanks > > > > > - - -- --- ----- -------- ------------- > > > > > Jacek Lewandowski > > > > > > > > > > > > > > > On Sat, Oct 30, 2021 at 4:07 PM bened...@apache.org < > > > bened...@apache.org > > > > > > > > > > wrote: > > > > > > > > > >>> How do we define what "releasable trunk" means? > > > > >> For me, the major criteria is ensuring that work is not merged > that > > is > > > > >> known to require follow-up work, or could reasonably have been > known > > > to > > > > >> require follow-up work if better QA practices had been followed. > > > > >> > > > > >> So, a big part of this is ensuring we continue to exceed our > targets > > > for > > > > >> improved QA. For me this means trying to weave tools like Harry > and > > > the > > > > >> Simulator into our development workflow early on, but weâll see > how > > > well > > > > >> these tools gain broader adoption. This also means focus in > general > > on > > > > >> possible negative effects of a change. > > > > >> > > > > >> I think we could do with producing guidance documentation for how > to > > > > >> approach QA, where we can record our best practices and evolve > them > > as > > > > we > > > > >> discover flaws or pitfalls, either for ergonomics or for bug > > > discovery. > > > > >> > > > > >>> What are the benefits of having a releasable trunk as defined > here? > > > > >> If we want to have any hope of meeting reasonable release cadences > > > _and_ > > > > >> the high project quality we expect today, then I think a > ~shippable > > > > trunk > > > > >> policy is an absolute necessity. > > > > >> > > > > >> I donât think means guaranteeing there are no failing tests > (though > > > > >> ideally this would also happen), but about ensuring our best > > practices > > > > are > > > > >> followed for every merge. 4.0 took so long to release because of > the > > > > amount > > > > >> of hidden work that was created by merging work that didnât meet > the > > > > >> standard for release. > > > > >> > > > > >> Historically we have also had significant pressure to backport > > > features > > > > to > > > > >> earlier versions due to the cost and risk of upgrading. If we > > maintain > > > > >> broader version compatibility for upgrade, and reduce the risk of > > > > adopting > > > > >> newer versions, then this pressure is also reduced significantly. > > > Though > > > > >> perhaps we will stick to our guns here anyway, as there seems to > be > > > > renewed > > > > >> pressure to limit work in GA releases to bug fixes exclusively. It > > > > remains > > > > >> to be seen if this holds. > > > > >> > > > > >>> What are the costs? > > > > >> I think the costs are quite low, perhaps even negative. Hidden > work > > > > >> produced by merges that break things can be much more costly than > > > > getting > > > > >> the work right first time, as attribution is much more > challenging. > > > > >> > > > > >> One cost that is created, however, is for version compatibility as > > we > > > > >> cannot say âwell, this is a minor version bump so we donât need to > > > > support > > > > >> downgradeâ. But I think we should be investing in this anyway for > > > > operator > > > > >> simplicity and confidence, so I actually see this as a benefit as > > > well. > > > > >> > > > > >>> Full disclosure: running face-first into 60+ failing tests on > trunk > > > > >> I have to apologise here. CircleCI did not uncover these problems, > > > > >> apparently due to some way it resolves dependencies, and so I am > > > > >> responsible for a significant number of these and have been quite > > sick > > > > >> since. > > > > >> > > > > >> I think a push to eliminate flaky tests will probably help here in > > > > future, > > > > >> though, and perhaps the project needs to have some (low) threshold > > of > > > > flaky > > > > >> or failing tests at which point we block merges to force a > > correction. > > > > >> > > > > >> > > > > >> From: Joshua McKenzie <jmcken...@apache.org> > > > > >> Date: Saturday, 30 October 2021 at 14:00 > > > > >> To: dev@cassandra.apache.org <dev@cassandra.apache.org> > > > > >> Subject: [DISCUSS] Releasable trunk and quality > > > > >> We as a project have gone back and forth on the topic of quality > and > > > the > > > > >> notion of a releasable trunk for quite a few years. If people are > > > > >> interested, I'd like to rekindle this discussion a bit and see if > > > we're > > > > >> happy with where we are as a project or if we think there's steps > we > > > > should > > > > >> take to change the quality bar going forward. The following > > questions > > > > have > > > > >> been rattling around for me for awhile: > > > > >> > > > > >> 1. How do we define what "releasable trunk" means? All reviewed > by M > > > > >> committers? Passing N% of tests? Passing all tests plus some other > > > > metrics > > > > >> (manual testing, raising the number of reviewers, test coverage, > > usage > > > > in > > > > >> dev or QA environments, etc)? Something else entirely? > > > > >> > > > > >> 2. With a definition settled upon in #1, what steps, if any, do we > > > need > > > > to > > > > >> take to get from where we are to having *and keeping* that > > releasable > > > > >> trunk? Anything to codify there? > > > > >> > > > > >> 3. What are the benefits of having a releasable trunk as defined > > here? > > > > What > > > > >> are the costs? Is it worth pursuing? What are the alternatives > (for > > > > >> instance: a freeze before a release + stabilization focus by the > > > > community > > > > >> i.e. 4.0 push or the tock in tick-tock)? > > > > >> > > > > >> Given the large volumes of work coming down the pike with CEP's, > > this > > > > seems > > > > >> like a good time to at least check in on this topic as a > community. > > > > >> > > > > >> Full disclosure: running face-first into 60+ failing tests on > trunk > > > when > > > > >> going through the commit process for denylisting this week brought > > > this > > > > >> topic back up for me (reminds me of when I went to merge CDC back > in > > > 3.6 > > > > >> and those test failures riled me up... I sense a pattern ;)) > > > > >> > > > > >> Looking forward to hearing what people think. > > > > >> > > > > >> ~Josh > > > > >> > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > > > > > > > > > > > -- alex p