I'd like to propose some changes in how we report and triage Talos
alerts. Over the past couple years, Joel Maher (with occasional
assistance from myself and others) has taken over the job of triaging
and responding to ("sheriffing") Talos regressions. He's done this
through a bunch of existing systems:
1. Graphserver: https://graphs.mozilla.org - a system for visualizing
the results of Talos tests
2. Graphserver alerts: A system that monitors Graphserver for sustained
regressions and improvements in particular Talos tests, and emails
developers and posts to a newsgroup (m.d.tree-alerts) when it detects them.
3. AlertManager: http://alertmanager.allizom.org:8080/alerts.html - A
system Joel created that ingests the emails produced by Graphserver alerts
Over the past year, I've been working on a system called Perfherder
which aims to subsume the above functionality, streamlining this process
and make it easier for developers to understand and respond to these
reports. The final pieces have been coming together, the final one being
a better interface for triaging and responding to alerts, which you can
see here:
https://treeherder.mozilla.org/perf.html#/alerts
We've found that the automated alert emails have not been an effective
way of getting developers to respond to performance regressions. Indeed,
it might have had the opposite effect: because there are so many
"downstream" alerts (notifications produced due to merges and uplifts)
people have been largely ignoring them, thinking that graphserver alerts
(and perhaps talos in general?) are "just noise".
In fact, this isn't true: the alert subsystem *does* in fact produce
correct reports of regressions and improvements in most cases (there are
certainly exceptions), but I think history has shown that we need at
least some hands-on work by a performance sheriff (someone like Joel) to
triage these results and file bugs appropriately. Doing this with the
help of the AlertManager dashboard has proven much more effective than
the automated e-mails for getting results over the past few years, so
we're going to continue with that approach as we transition over to
Perfherder.
Here's my rough roadmap for changing the current system:
1. Effective immediately (well, as soon as possible): Stop emailing
developers graphserver regression reports. These reports will continue
to be sent to mozilla.dev.tree-alerts (mostly so that they can be picked
up by the existing AlertManager system while we transition over to
Perfherder).
2. End of January: Sheriffs will start using Perfherder to triage
performance alerts and file bugs (we're almost ready to do this now,
pending a few important features, like prepopulating a bug based on a
template). Hopefully the only thing developers will notice are
easier-to-understand bugs due to the various improvements of Perfherder
over Graphserver. :)
3. End of Q1: After 2 months of running side-by-side with Perfherder, we
will stop submitting talos data to Graphserver. Graphserver will keep on
running read-only, for historical purposes.
Will
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform