The TaskCluster team's goal is to provide a reliable efficient system to
support software engineers working to make Mozilla's products awesome. Tree
closures prevent engineers from getting that work done. The team treats
tree closure downtime as a critical metric and treating any bugs that arise
as high priority items.

Over the past week there have been several prolonged tree closures linked
to TaskCluster. To address this issue, the TaskCluster team, together with
gps, met to identify some of the root causes and next steps for preventing
this in the future.


During our retrospective of last week, the team has identified some common
themes that can lead to a tree closure.  These include:

   1.

   Misconfiguration of worker types
   -

      symptoms: large backlogs for a given worker type
      2.

   Failure within the platform causing tasks to not be scheduled or
   schedule too often
   -

      symptoms: range widely: dependent tasks not being scheduled, or a
      single retrigger causing multiple retriggers, or worker backlogs
not being
      cleared as quickly as they should be
      3.

   High peak load causing large backlogs (reopening of trees, batch of
   pushes, etc)
   -

      symptoms: task not being claimed and executed within a timely manner,
      large backlogs


To work at preventing downtime in the future we have a list of work that
the team will be addressing.  The following bugs have been entered and will
see activity in the coming days:

   -

   Bug 1295173 - delete unused workerTypes
   -

   Bug 1292647, 1264956 - increase validation of worker type definitions by
   the provisioner
   -

   Bug 1295179 - add lib-monitor library to provisioner and add alerts for
   pending backlog spikes
   -

   Bug 1295180 - Add metrics for oldest scheduled date of a task in the
   backlog
   -

   Bug 1295181 - Review and adjust existing alerts/metrics for pending
   counts/timing to ensure they are accurate
   -

   Bug 1295184 - set worker terminate rate to 2 billing cycles of idle
   -

   Bug 1295185 - work with AWS solutions architect on increasing rate
   limits and investigate alternative spot request options



Additionally, some longer term goals have been discussed that the team will
be working on.  These goals will include improved communication plans and
alert processes as well as increasing the rate at which we can provide
machines to work on tasks.

As always, feel free to reach out to the team in #taskcluster to discuss.

-Greg
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to