I'd like to propose some changes in how we report and triage Talos alerts. Over the past couple years, Joel Maher (with occasional assistance from myself and others) has taken over the job of triaging and responding to ("sheriffing") Talos regressions. He's done this through a bunch of existing systems:

1. Graphserver: https://graphs.mozilla.org - a system for visualizing the results of Talos tests 2. Graphserver alerts: A system that monitors Graphserver for sustained regressions and improvements in particular Talos tests, and emails developers and posts to a newsgroup (m.d.tree-alerts) when it detects them. 3. AlertManager: http://alertmanager.allizom.org:8080/alerts.html - A system Joel created that ingests the emails produced by Graphserver alerts

Over the past year, I've been working on a system called Perfherder which aims to subsume the above functionality, streamlining this process and make it easier for developers to understand and respond to these reports. The final pieces have been coming together, the final one being a better interface for triaging and responding to alerts, which you can see here:

https://treeherder.mozilla.org/perf.html#/alerts

We've found that the automated alert emails have not been an effective way of getting developers to respond to performance regressions. Indeed, it might have had the opposite effect: because there are so many "downstream" alerts (notifications produced due to merges and uplifts) people have been largely ignoring them, thinking that graphserver alerts (and perhaps talos in general?) are "just noise".

In fact, this isn't true: the alert subsystem *does* in fact produce correct reports of regressions and improvements in most cases (there are certainly exceptions), but I think history has shown that we need at least some hands-on work by a performance sheriff (someone like Joel) to triage these results and file bugs appropriately. Doing this with the help of the AlertManager dashboard has proven much more effective than the automated e-mails for getting results over the past few years, so we're going to continue with that approach as we transition over to Perfherder.

Here's my rough roadmap for changing the current system:

1. Effective immediately (well, as soon as possible): Stop emailing developers graphserver regression reports. These reports will continue to be sent to mozilla.dev.tree-alerts (mostly so that they can be picked up by the existing AlertManager system while we transition over to Perfherder). 2. End of January: Sheriffs will start using Perfherder to triage performance alerts and file bugs (we're almost ready to do this now, pending a few important features, like prepopulating a bug based on a template). Hopefully the only thing developers will notice are easier-to-understand bugs due to the various improvements of Perfherder over Graphserver. :) 3. End of Q1: After 2 months of running side-by-side with Perfherder, we will stop submitting talos data to Graphserver. Graphserver will keep on running read-only, for historical purposes.

Will
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to