Proposed changes to Talos (performance) alerting

William Lachance Wed, 06 Jan 2016 14:55:43 -0800

I'd like to propose some changes in how we report and triage Talosalerts. Over the past couple years, Joel Maher (with occasionalassistance from myself and others) has taken over the job of triagingand responding to ("sheriffing") Talos regressions. He's done thisthrough a bunch of existing systems:

1. Graphserver: https://graphs.mozilla.org - a system for visualizingthe results of Talos tests2. Graphserver alerts: A system that monitors Graphserver for sustainedregressions and improvements in particular Talos tests, and emailsdevelopers and posts to a newsgroup (m.d.tree-alerts) when it detects them.3. AlertManager: http://alertmanager.allizom.org:8080/alerts.html - Asystem Joel created that ingests the emails produced by Graphserver alerts

Over the past year, I've been working on a system called Perfherderwhich aims to subsume the above functionality, streamlining this processand make it easier for developers to understand and respond to thesereports. The final pieces have been coming together, the final one beinga better interface for triaging and responding to alerts, which you cansee here:


https://treeherder.mozilla.org/perf.html#/alerts

We've found that the automated alert emails have not been an effectiveway of getting developers to respond to performance regressions. Indeed,it might have had the opposite effect: because there are so many"downstream" alerts (notifications produced due to merges and uplifts)people have been largely ignoring them, thinking that graphserver alerts(and perhaps talos in general?) are "just noise".

In fact, this isn't true: the alert subsystem *does* in fact producecorrect reports of regressions and improvements in most cases (there arecertainly exceptions), but I think history has shown that we need atleast some hands-on work by a performance sheriff (someone like Joel) totriage these results and file bugs appropriately. Doing this with thehelp of the AlertManager dashboard has proven much more effective thanthe automated e-mails for getting results over the past few years, sowe're going to continue with that approach as we transition over toPerfherder.


Here's my rough roadmap for changing the current system:

1. Effective immediately (well, as soon as possible): Stop emailingdevelopers graphserver regression reports. These reports will continueto be sent to mozilla.dev.tree-alerts (mostly so that they can be pickedup by the existing AlertManager system while we transition over toPerfherder).2. End of January: Sheriffs will start using Perfherder to triageperformance alerts and file bugs (we're almost ready to do this now,pending a few important features, like prepopulating a bug based on atemplate). Hopefully the only thing developers will notice areeasier-to-understand bugs due to the various improvements of Perfherderover Graphserver. :)3. End of Q1: After 2 months of running side-by-side with Perfherder, wewill stop submitting talos data to Graphserver. Graphserver will keep onrunning read-only, for historical purposes.


Will
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Proposed changes to Talos (performance) alerting

Reply via email to