Re: The current state of Talos benchmarks

Dave Mandelin Wed, 29 Aug 2012 18:25:20 -0700

On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
> Hi everyone,
> 
> The way the current situation happens is that many of the developers 
> ignore the Talos regression emails that go to dev-tree-management,


Talos is widely disliked and distrusted by developers, because it's hard to 
understand what it's really measuring, and there are lots of false alarms. 
Metrics and A-Team have been doing a ton of work to improve this. In 
particular, I told them that some existing Talos JS tests were not useful to 
us, and they deleted them. And v2 is going to have exactly the tests we want, 
with regression alarms. So Talos can (and will) be fixed for developers.

> and in many cases regressions of a few percents slide in without being 
> tracked.  This trend of relatively big performance regressions becomes 
> more evident every time we do an uplift, which means that 6 weeks worth 
> of development get compared to the previous version.
> 
> A few people (myself included) have tried to go through these emails and 
> notify the people responsible in the past.  This process has proved to 
> be ineffective, because (1) the problem is not officially owned by 
> anyone (currently the only person going through those emails is 
> mbrubeck), and (2) because of problems such as the difficulty of 
> diagnosing and reproducing performance regressions, many people think 
> that their patches are unlikely to have caused a regression, and 
> therefore no investigation gets done.

Yeah, that's no good.

> Some people have noted in the past that some Talos measurements are not 
> representative of something that the users would see, the Talos numbers 
> are noisy, and we don't have good tools to deal with these types of 
> regressions.  There might be some truth to all of these, but I believe 
> that the bigger problem is that nobody owns watching over these numbers, 
> and as a result as take regressions in some benchmarks which can 
> actually be representative of what our users experience.

The interesting thing is that we basically have no idea if that's true for any 
given Talos alarm.

> I don't believe that the current situation is acceptable, especially 
> with the recent focus on performance (through the Snappy project), and I 
> would like to ask people if they have any ideas on what we can do to fix 
> this.  The fix might be turning off some Talos tests if they're really 
> not useful, asking someone or a group of people to go over these test 
> results, get better tools with them, etc.  But _something_ needs to 
> happen here.

I would say:

- First, and most important, fix the test suite so that it measures only things 
that are useful and meaningful to developers and users. We can easily take a 
first cut at this if engineering teams go over the tests related to their work, 
and tell A-Team which are not useful. Over time, I think we need to get a solid 
understanding of what performance looks like to users, what things to test, and 
how to test them soundly. This may require dedicated performance engineers or a 
performance product manager.

- Second, as you say, get an owner for performance regressions. There are lots 
of ways we could do this. I think it would integrate fairly easily into our 
existing processes if we (automatically or by a designated person) filed a bug 
for each regression and marked it tracking (so the release managers would own 
followup). Alternately, we could have a designated person own followup. I'm not 
sure if that has any advantages, but release managers would probably know. But 
doing any of this is going to severely annoy engineers unless we get the false 
positive rate under control.

- Speaking of false positives, we should seriously start tracking them. We 
should keep track of each Talos regression found and its outcome. (It would be 
great to track false negatives too but it's a lot harder to catch them and 
record them accurately.) That way we'd actually know whether we have a few 
false positives or a lot, or whether the false positives were coming up on 
certain tests. And we could use that information to improve the false positive 
rate over time.

Dave
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: The current state of Talos benchmarks

Reply via email to