> It’s hard to gate commit on a clean CI run when there’s flaky tests
I agree, this is also why so much effort was done in 4.0 release to remove as much as possible. Just over 1 month ago we were not really having a flaky test issue (outside of the sporadic timeout issues; my circle ci runs were green constantly), and now the “flaky tests” I see are all actual bugs (been root causing 2 out of the 3 I reported) and some (not all) of the flakyness was triggered by recent changes in the past month. Right now people do not believe the failing test is caused by their patch and attribute to flakiness, which then causes the builds to start being flaky, which then leads to a different author coming to fix the issue; this behavior is what I would love to see go away. If we find a flaky test, we should do the following 1) has it already been reported and who is working to fix? Can we block this patch on the test being fixed? Flaky tests due to timing issues normally are resolved very quickly, real bugs take longer. 2) if not reported, why? If you are the first to see this issue than good chance the patch caused the issue so should root cause. If you are not the first to see it, why did others not report it (we tend to be good about this, even to the point Brandon has to mark the new tickets as dups…)? I have committed when there were flakiness, and I have caused flakiness; not saying I am perfect or that I do the above, just saying that if we all moved to the above model we could start relying on CI. The biggest impact to our stability is people actually root causing flaky tests. > I think we're going to need a system that > understands the difference between success, failure, and timeouts I am curious how this system can know that the timeout is not an actual failure. There was a bug in 4.0 with time serialization in message, which would cause the message to get dropped; this presented itself as a timeout if I remember properly (Jon Meredith or Yifan Cai fixed this bug I believe). > On Nov 3, 2021, at 10:56 AM, Brandon Williams <dri...@gmail.com> wrote: > > On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org <bened...@apache.org> > wrote: >> >> The largest number of test failures turn out (as pointed out by David) to be >> due to how arcane it was to trigger the full test suite. Hopefully we can >> get on top of that, but I think a significant remaining issue is a lack of >> trust in the output of CI. It’s hard to gate commit on a clean CI run when >> there’s flaky tests, and it doesn’t take much to misattribute one failing >> test to the existing flakiness (I tend to compare to a run of the trunk >> baseline for comparison, but this is burdensome and still error prone). The >> more flaky tests there are the more likely this is. >> >> This is in my opinion the real cost of flaky tests, and it’s probably worth >> trying to crack down on them hard if we can. It’s possible the Simulator may >> help here, when I finally finish it up, as we can port flaky tests to run >> with the Simulator and the failing seed can then be explored >> deterministically (all being well). > > I totally agree that the lack of trust is a driving problem here, even > in knowing which CI system to rely on. When Jenkins broke but Circle > was fine, we all assumed it was a problem with Jenkins, right up until > Circle also broke. > > In testing a distributed system like this I think we're always going > to have failures, even on non-flaky tests, simply because the > underlying infrastructure is variable with transient failures of its > own (the network is reliable!) We can fix the flakies where the fault > is in the code (and we've done this to many already) but to get more > trustworthy output, I think we're going to need a system that > understands the difference between success, failure, and timeouts, and > in the latter case knows how to at least mark them differently. > Simulator may help, as do the in-jvm dtests, but there is ultimately > no way to cover everything without doing some things the hard, more > realistic way where sometimes shit happens, marring the almost-perfect > runs with noisy doubt, which then has to be sifted through to > determine if there was a real issue. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org