Re: The current state of Talos benchmarks

Rafael Ávila de Espíndola Thu, 30 Aug 2012 16:34:02 -0700

Some people have noted in the past that some Talos measurements are not
representative of something that the users would see, the Talos numbers
are noisy, and we don't have good tools to deal with these types of
regressions.  There might be some truth to all of these, but I believe
that the bigger problem is that nobody owns watching over these numbers,
and as a result as take regressions in some benchmarks which can
actually be representative of what our users experience.

I was recently hit by most of the shortcomings you mentioned whiletrying to upgrade clang. Fortunately, I found the issue on try, but Iwill admit that comparing talos on try is something I only do when Iexpect a problem.

I still intend to write a blog post once I am done with the update andhave more data, but some interesting points that showed up so far

* compare-talos and compare.py were out of date. I was really lucky thatone of the benchmarks that still had the old name was the one thatshowed the regression. I have started a script that I hope will be moreresilient to future changes. (bug 786504).

* our builds are *really* hard to reproduce. The build I was downloadingfrom try was faster than the one I was doing locally. In despair Idecided to fix at least part of this first. It found that our build wasdepending on the way the bots use ccache (they set CCACHE_BASEDIR whichchanges __FILE__), the build directory (shows up on debug info that isnot stripped), and the file system being case sensitive or not.

* testing on linux showed even more bizarre cases where small changescause performance problems. In particular, adding a nop *after the lastret* in function would make the js interpreter faster on sunspider. Thenop was just enough to make the function size cross the next 16 bytesboundary and that changed the address of every function linked after it.

* the histogram of some benchmarks don't look like a normal distribution(https://plus.google.com/u/0/108996039294665965197/posts/8GyqMEZHHVR). Istill have to read the paper mentioned in the comments.

I don't believe that the current situation is acceptable, especially
with the recent focus on performance (through the Snappy project), and I
would like to ask people if they have any ideas on what we can do to fix
this.  The fix might be turning off some Talos tests if they're really
not useful, asking someone or a group of people to go over these test
results, get better tools with them, etc.  But _something_ needs to
happen here.

There are many things we can do to make perf debugging/testing better,but I don't think that is the main thing we need to do to solve theproblem. The tools we have do work. Try is slow and talos is noisy, butit is possible to detect and debug regressions.

What I think we need to do is differentiate tests that we expect tomatch user experience and synthetic tests. Synthetic tests *are* usefulas they can much more easily find what changed, even if it is somethingas silly as the address of some function. The difference is that wedon't want to regress on the tests that match user experience. IMHO we*can* regress on synthetic ones as long as we know what is going on. Andyes, if a particular synthetic test is too brittle then we should remove it.

With the distinction in place we can then handle perf regressions in asimilar way to how we handle test failures: revert the offending patchand make the original developer responsible for tracking it down. If atest is known to regress a synthetic benchmark, a comment on the commiton the lines of "renaming this file causes __FILE__ to change in anassert message and produces a spurious regression on md5" should besufficient. It is not the developers *fault* that that causes a problem,but IHMO it should still be his responsibility to track it.

Cheers,
Ehsan


Cheers,
Rafael

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: The current state of Talos benchmarks

Reply via email to