The TaskCluster team's goal is to provide a reliable efficient system to support software engineers working to make Mozilla's products awesome. Tree closures prevent engineers from getting that work done. The team treats tree closure downtime as a critical metric and treating any bugs that arise as high priority items.
Over the past week there have been several prolonged tree closures linked to TaskCluster. To address this issue, the TaskCluster team, together with gps, met to identify some of the root causes and next steps for preventing this in the future. During our retrospective of last week, the team has identified some common themes that can lead to a tree closure. These include: 1. Misconfiguration of worker types - symptoms: large backlogs for a given worker type 2. Failure within the platform causing tasks to not be scheduled or schedule too often - symptoms: range widely: dependent tasks not being scheduled, or a single retrigger causing multiple retriggers, or worker backlogs not being cleared as quickly as they should be 3. High peak load causing large backlogs (reopening of trees, batch of pushes, etc) - symptoms: task not being claimed and executed within a timely manner, large backlogs To work at preventing downtime in the future we have a list of work that the team will be addressing. The following bugs have been entered and will see activity in the coming days: - Bug 1295173 - delete unused workerTypes - Bug 1292647, 1264956 - increase validation of worker type definitions by the provisioner - Bug 1295179 - add lib-monitor library to provisioner and add alerts for pending backlog spikes - Bug 1295180 - Add metrics for oldest scheduled date of a task in the backlog - Bug 1295181 - Review and adjust existing alerts/metrics for pending counts/timing to ensure they are accurate - Bug 1295184 - set worker terminate rate to 2 billing cycles of idle - Bug 1295185 - work with AWS solutions architect on increasing rate limits and investigate alternative spot request options Additionally, some longer term goals have been discussed that the team will be working on. These goals will include improved communication plans and alert processes as well as increasing the rate at which we can provide machines to work on tasks. As always, feel free to reach out to the team in #taskcluster to discuss. -Greg _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform