Because I've been working on a few of them and here's what I think would
make them a lot easier to fix, and therefore improve our test coverage
and make sheriffs much happier
1) make it easier to figure out from bugzilla/treeherder when and where
the failure first occurred
- I don't want to know the first thing that got reported to bmo - IME,
that is not always the first time it happened, just the first time it
got filed.
In other words, can I query treeherder in some way (we have structured
logs now right, and all this stuff is in a DB somewhere?) with a test
name and a regex, to have it tell me where the test first failed with a
message matching that regex?
2) make it easier to figure out from bugzilla/treeherder when and where
the failure happens
Linux only? Debug only? (non-)e10s only?
These questions are reasonably OK to answer right now by expanding all
the TBPL comments and using 'find in page'.
Harder questions to figure out are:
How often does this happen on which platform? Id est, more likely to
happen on debug, linux, asan, ... ? This helps with figuring out optimal
strategies to test fixes and/or regression hunt
I'm thinking a table with OS vs. debug/opt/asan/pgo vs. e10s/non-e10s
and numbers in the cells would already go a long way.
3) numbers on how frequently a test fails
"But we have this in orange-factor" I hear you say. Sure, but that tells
me how often it got starred, not a percentage ("failed 1% of the time on
Linux debug, 2% of the time on Windows 7 pgo, ..."), and so I can't know
how often to retrigger until I try. It also makes it hard to estimate
when the intermittent started being intermittent because it's rarely the
cset from (1) - given failure in 1 out of N runs, the likely regression
range is correlated with N (can't be bothered doing the exact
probability math right now).
This is an increasing problem because we run more and more jobs every
month, and so the threshold for annoyance for the sheriffs is getting
lower and lower.
4) automate regression hunting (aka mozregression for intermittent
infra-only failures)
see https://bugzilla.mozilla.org/show_bug.cgi?id=1099095 for an example
of how this works manually. We have APIs for retriggering now, right? We
have APIs for distinguishing relevant failures in logs from unrelated
orange, too. With the above, it should even be possible to narrow down
which platforms to retrigger on (I ended up just using winxp/7/linux
debug because they seemed most prominent, but I was too lazy to manually
create (2)), and how often to retrigger to get reasonable confidence in
ranges (3).
Right now, doing this manually costs me probably a full day or two of my
time to actually pore over results and such, with obviously a lot more
time spent waiting on the retriggers themselves. Automating this can
reduce this to 10 minutes of putting together the data and setting off
the requisite automation, plus it could theoretically strategize to run
retriggers at non-peak times.
5) rr or similar recording of failing test runs
We've talked about this before on this newsgroup, but it's been a long
time. Is this feasible and/or currently in the pipeline?
Do we have projects on any of this, and if not, can we start some? Do
other people have other ideas on how to make this stuff easier,
especially considering my note under (3) regarding the (implicit)
threshold percentage going lower and lower, and that making this harder
and harder (ie understanding and fixing bugs that show up on <1% of runs)?
~ Gijs
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform