Because I've been working on a few of them and here's what I think would make them a lot easier to fix, and therefore improve our test coverage and make sheriffs much happier

1) make it easier to figure out from bugzilla/treeherder when and where the failure first occurred - I don't want to know the first thing that got reported to bmo - IME, that is not always the first time it happened, just the first time it got filed.

In other words, can I query treeherder in some way (we have structured logs now right, and all this stuff is in a DB somewhere?) with a test name and a regex, to have it tell me where the test first failed with a message matching that regex?

2) make it easier to figure out from bugzilla/treeherder when and where the failure happens

Linux only? Debug only? (non-)e10s only?

These questions are reasonably OK to answer right now by expanding all the TBPL comments and using 'find in page'.

Harder questions to figure out are:

How often does this happen on which platform? Id est, more likely to happen on debug, linux, asan, ... ? This helps with figuring out optimal strategies to test fixes and/or regression hunt

I'm thinking a table with OS vs. debug/opt/asan/pgo vs. e10s/non-e10s and numbers in the cells would already go a long way.

3) numbers on how frequently a test fails

"But we have this in orange-factor" I hear you say. Sure, but that tells me how often it got starred, not a percentage ("failed 1% of the time on Linux debug, 2% of the time on Windows 7 pgo, ..."), and so I can't know how often to retrigger until I try. It also makes it hard to estimate when the intermittent started being intermittent because it's rarely the cset from (1) - given failure in 1 out of N runs, the likely regression range is correlated with N (can't be bothered doing the exact probability math right now).

This is an increasing problem because we run more and more jobs every month, and so the threshold for annoyance for the sheriffs is getting lower and lower.

4) automate regression hunting (aka mozregression for intermittent infra-only failures)

see https://bugzilla.mozilla.org/show_bug.cgi?id=1099095 for an example of how this works manually. We have APIs for retriggering now, right? We have APIs for distinguishing relevant failures in logs from unrelated orange, too. With the above, it should even be possible to narrow down which platforms to retrigger on (I ended up just using winxp/7/linux debug because they seemed most prominent, but I was too lazy to manually create (2)), and how often to retrigger to get reasonable confidence in ranges (3).

Right now, doing this manually costs me probably a full day or two of my time to actually pore over results and such, with obviously a lot more time spent waiting on the retriggers themselves. Automating this can reduce this to 10 minutes of putting together the data and setting off the requisite automation, plus it could theoretically strategize to run retriggers at non-peak times.

5) rr or similar recording of failing test runs

We've talked about this before on this newsgroup, but it's been a long time. Is this feasible and/or currently in the pipeline?


Do we have projects on any of this, and if not, can we start some? Do other people have other ideas on how to make this stuff easier, especially considering my note under (3) regarding the (implicit) threshold percentage going lower and lower, and that making this harder and harder (ie understanding and fixing bugs that show up on <1% of runs)?

~ Gijs
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to