Re: The current state of Talos benchmarks

Ed Morley Tue, 04 Sep 2012 09:05:20 -0700

Thank you for posting this - I've become increasingly worried by the number of 
real regressions that we are likely missing due to the current situation.

Like yourself I used to read every single dev.tree-management mail & try to act 
on those that looked real, however after many months of not feeling like it was 
making any difference at all, I stopped checking regularly since it was a very 
unproductive use of my time.

The main issues were:

1) Many seemingly irrelevant tests (eg codesize, number of constructors, ...?) 
and noisy tests causing the SNR of dev.tree-management to be very low, 
resulting in dev.tree-management receiving hundreds of regression emails a 
week. This meant it was both a struggle to find time for them all on a daily 
basis and also to remember which were reliable and so should be more urgently 
acted upon.

2) Large amounts of coalescing causing the regression range to be wide, 
resulting in either vast amounts of sheriff time spent trying 
retriggers/TryServer runs; or else developers believing the regression must 
have been caused by someone else in the range & so do not look into it 
themselves.

3) Lack of documentation for even working out what the test in question does & 
how to run it locally. I filed bug 770460 to deal with this, which jhammel 
kindly has, so the situation is a bit better than it was: 
https://wiki.mozilla.org/Buildbot/Talos

4) The combination of the above causing:
    a) Devs to either not believe the regression could be anything other than 
noise/due to someone else's patch, or else they try really hard to investigate 
the regression but are not able to due to gaps in the documentation/difficultly 
in reproducing locally.
    b) Sheriffs to not have enough conviction in the reliability of a reported 
regression, that they don't feel they are able to hassle said devs about (a).

As an example, I filed bug 778718 (30% Windows Ts regression since 1st March) a 
month ago, but yet it has sat inactive for a while now and I sadly don't see us 
making any progress on it before we release the regressed builds :-(

--

On Thursday, 30 August 2012 01:01:26 UTC+1, Matt Brubeck  wrote:
> * Less is more.  We can pay more attention to tests if every alert is 
> for something we care about.  We can *track* stuff like Trace Malloc 
> Allocs if there are people who find the data useful in their work, but 
> we should not *alert* on it unless it is a key user-facing metric.

On Thursday, 30 August 2012 02:20:24 UTC+1, Dave Mandelin  wrote:
> - First, and most important, fix the test suite so that it measures only 
> things that are useful and meaningful to developers and users. We can easily 
> take a first cut at this if engineering teams go over the tests related to 
> their work, and tell A-Team which are not useful. Over time, I think we need 
> to get a solid understanding of what performance looks like to users, what 
> things to test, and how to test them soundly. This may require dedicated 
> performance engineers or a performance product manager.

On Thursday, 30 August 2012 17:11:25 UTC+1, Ehsan Akhgari  wrote:
> Note that some of the work of to differentiate between false positives 
> and real regressions needs to be done by the engineers, similar to the 
> work required to investigate correctness problems.  And people need to 
> accept that seemingly benign changes may also cause real performance 
> regressions, so it's not always possible to glance over a changeset and 
> say "nah, this can't be my fault."  :-)

I agree entirely with each of the above.

On Thursday, 30 August 2012 16:53:08 UTC+1, Ehsan Akhgari  wrote:
> The way that we handle similar situations with test failures is to close 
> inbound, back out enough patches so that things go green again, and then 
> reopen, and ask the people that had been backed out to retry landing 
> after testing on try.
> 
> This is entirely possible for Talos regressions as well if we want to, 
> it just takes longer, and will mean more patches will get backed out.

On Thursday, 30 August 2012 22:42:07 UTC+1, Taras Glek  wrote:
> ** We will extend the current merge criteria of last green PGO changeset 
> to also include "good Talos numbers"

Once we have the main issues resolved, then I agree this will be the best way 
to victory - and in fact will be happy to take lead on pro-actively checking 
dev.tree-management before merging.

However at this precise point in time, I feel that blocking merges on talos 
"green" is not yet viable, since the reasons at the start of my post still 
hold. I'm hopeful that many of them can mitigated over the next couple of 
weeks, but for now I think we're just going to have to deal with just the most 
obvious talos regressions in the days shortly after each merge, rather than 
blocking the merges on them.

In addition, as far as I'm aware, it takes several talos runs (after the 
regression) before the regression mails are triggered (presumably more so for 
noisy tests), so we would have to wait several hours longer than the usual "PGO 
green on all platforms" before we could be sure it was safe to merge. Add to 
this the end-to-end time of retriggers to counteract coalescing - and I suspect 
this would cause a fair amount of grief between sheriffs and devs, if our 
experience with delaying merges for unit test failures is anything to go by. 
Can someone who knows talos better than I, clarify talos' behaviour for when 
the regression emails are sent?

On Friday, 31 August 2012 16:38:10 UTC+1, Ehsan Akhgari  wrote:
> On 12-08-31 5:37 AM, Justin Lebar wrote:
> > Sorry to continue beating this horse, but I don't think it's quite dead yet:
> > One of the best things we could do to make finding these regressions
> > easier is to never coalesce Talos on mozilla-inbound.  It's crazy to
> > waste developer time bisecting Talos locally when we don't run it on
> > every push.
> 
> In order to help kill that horse, I filed bug 787447 and CCed John on 
> it.  :-)

I would absolutely love it if we were able to do this, but with our current 
capacity issues (which really do need resolving, but that's another thread, to 
which I'm replying after this), we just don't have anywhere near the spare 
machine time. 

What would be useful however, is an easy way to mark a regression-range (eg 
from that given in a dev.tree-management email) as needing retriggers on every 
push to combat coalescing - that then schedules them for our quieter machine 
times eg weekends/outside PDT hours. If we did this, we still wouldn't be able 
to block mozilla-inbound merges on 'green' talos, but we would at least be in a 
better position than we are now, where the regression ranges are so wide that 
they are not easily actionable.

Best wishes,

Ed
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: The current state of Talos benchmarks

Reply via email to