Some people have noted in the past that some Talos measurements are not representative of something that the users would see, the Talos numbers are noisy, and we don't have good tools to deal with these types of regressions. There might be some truth to all of these, but I believe that the bigger problem is that nobody owns watching over these numbers, and as a result as take regressions in some benchmarks which can actually be representative of what our users experience.
I was recently hit by most of the shortcomings you mentioned while trying to upgrade clang. Fortunately, I found the issue on try, but I will admit that comparing talos on try is something I only do when I expect a problem.
I still intend to write a blog post once I am done with the update and have more data, but some interesting points that showed up so far
* compare-talos and compare.py were out of date. I was really lucky that one of the benchmarks that still had the old name was the one that showed the regression. I have started a script that I hope will be more resilient to future changes. (bug 786504).
* our builds are *really* hard to reproduce. The build I was downloading from try was faster than the one I was doing locally. In despair I decided to fix at least part of this first. It found that our build was depending on the way the bots use ccache (they set CCACHE_BASEDIR which changes __FILE__), the build directory (shows up on debug info that is not stripped), and the file system being case sensitive or not.
* testing on linux showed even more bizarre cases where small changes cause performance problems. In particular, adding a nop *after the last ret* in function would make the js interpreter faster on sunspider. The nop was just enough to make the function size cross the next 16 bytes boundary and that changed the address of every function linked after it.
* the histogram of some benchmarks don't look like a normal distribution (https://plus.google.com/u/0/108996039294665965197/posts/8GyqMEZHHVR). I still have to read the paper mentioned in the comments.
I don't believe that the current situation is acceptable, especially with the recent focus on performance (through the Snappy project), and I would like to ask people if they have any ideas on what we can do to fix this. The fix might be turning off some Talos tests if they're really not useful, asking someone or a group of people to go over these test results, get better tools with them, etc. But _something_ needs to happen here.
There are many things we can do to make perf debugging/testing better, but I don't think that is the main thing we need to do to solve the problem. The tools we have do work. Try is slow and talos is noisy, but it is possible to detect and debug regressions.
What I think we need to do is differentiate tests that we expect to match user experience and synthetic tests. Synthetic tests *are* useful as they can much more easily find what changed, even if it is something as silly as the address of some function. The difference is that we don't want to regress on the tests that match user experience. IMHO we *can* regress on synthetic ones as long as we know what is going on. And yes, if a particular synthetic test is too brittle then we should remove it.
With the distinction in place we can then handle perf regressions in a similar way to how we handle test failures: revert the offending patch and make the original developer responsible for tracking it down. If a test is known to regress a synthetic benchmark, a comment on the commit on the lines of "renaming this file causes __FILE__ to change in an assert message and produces a spurious regression on md5" should be sufficient. It is not the developers *fault* that that causes a problem, but IHMO it should still be his responsibility to track it.
Cheers, Ehsan
Cheers, Rafael _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform