I'd like to get some feedback on changes that we (release engineering and the taskcluster team) are planning around how we build Firefox. I apologize that this is a bit long, as I'm trying to include the necessary background. I have some questions at the end about which I'd like to have some discussion.
Before I get into it, a word of apology. We haven't done a great job of talking about this work -- I've talked to many members of the build module individually, but in so doing perhaps not shared all of the required background or established a common understanding. It's easy to get so deeply into something that you assume everyone knows about it, and I fear that's what may have happened. So if some of this comes as a surprise or feels like a fait accompli, I apologize. It's certainly not finished, and it's all software, so we can always change it. We're working to be more communicative and inclusive in the future. = Buildbot and TaskCluster = As you may know, we currently use Buildbot to schedule build (and test) jobs across all of our platforms. This has had a number of issues, including difficulty in reproducing the build or test environments outside of Buildbot, difficulty in testing or deploying changes to the build process (especially around scheduling and host configuration), and difficulty scaling. One of the issues we struggled with internally was the difficulty in making requested upgrades to otherwise "frozen" build systems: often the available upgrade was not available for the ancient version of the operating system we were runnning. You may have your own issues to add to this list -- I'd be interested to hear them, to see if we are addressing them, or can address them, in this change! During the development of Firefox OS, though, another parallel system called TaskCluster (https://docs.taskcluster.net) was developed. It's a cloud-first job-scheduling system designed with simplicity in mind: tasks go in as structured data and run, producing logs, artifacts, and even other tasks. Here's a list of the design goals for TaskCluster: * Establish a developer-first workflow for builds and tests. * Speed/flexibility of deployment: Minimize server-side config requirements; new Taskcluster Platform tasks can be deployed in days or weeks, BB platform takes months/quarters. * Reproducibility: Ability to define complete OS environment / image outside of Releng- or IT-controlled infra; Increased transparency in deployment * Self-service: no releng needed to change any tasks, in-tree scheduling * Extensibility: Developing a general platform for developing tools we haven't thought of The bit of TaskCluster that is probably most salient for this audience is this: a TaskCluster task essentially boils down to "run this shell command in this operating system image," and the scheduling system gets both the shell command and the operating system image from the gecko tree itself. That means you can change just about everything about the build process, including the operating system (host environment), from in-tree -- even by pushing to try! As an example, here [1] is a recent try job of mine. We are currently in the process of transitioning from Buildbot to TaskCluster. This is, of course, an enormous project, and requires a lot of hard work and attention from Releng, A-Team, the TaskCluster team, and from other teams impacted by the changes. I also think it's going to be enormously rewarding, and free us of a lot of the constraints and issues I mentioned above. Ideally everyone wins -- build, Releng, A-Team, TaskCluster, all developers, even IT. So if you see something here as "losing", aside from the inevitable friction of change, please speak up. = Linux Builds = Zooming in a little bit, let's talk about Linux builds of Firefox Desktop. Mac, Windows, Fennec, B2G etc. are all in various states of progress, and we can talk about those too. My focus right now is on Linux builds and Glandium has raised some questions about them that I'd like to address here. For tasks that run on Linux, we can use Docker. That means that the "operating system image" is a Docker image. In fact, we have a method for building those Docker images using in-tree specifications and plans[2] to support automatically rebuilding them on pushes to try. I've built a working CentOS 6.7 image specification[3] based on the mock environments used in buildbot, and I'm working on greening that up and putting it into TreeHerder as a Tier-2 build. Mock is not used at all -- the build runs directly in the docker image, using a "worker" user account. Taskcluster invokes a script that's baked into the docker image which knows enough to checkout the required revisions of the required sources (build/tools and gecko), then execute an in-tree script (testing/taskcluster/scripts/builder/build-linux.sh). That build-linux.sh translates a whole bunch of parameters from environment variables into Mozharness configuration, then invokes the proper Mozharness script, which performs the build as usual. All of that is easily tested in try -- that's what I've been doing for a few weeks now! This approach has lots of advantages, and (we think) solves a few of the issues I mentioned above: * Since everything is in-tree, everyone can self-serve. There's no need to wait on resources from another team to modify the build environment, e.g., to upgrade valgrind. * Since everything is in-tree, it can be handled like any other commit: tried, backed out, bisected, put on trains, merged, etc. * Downloading and running a docker image is a well-known process, so it's easy for devs to precisely replicate the production build environment when necessary * Inputs and outputs are precisely specified, so builds are repeatable. And because each gecko revision specifies exactly the docker image used to build it, you can even bisect over host configuration changes! To address the issue of difficult upgrades, and to support the security team's desire that we not run insecure versions of packages, I have suggested that we rebuild the docker images weekly, regardless of whether there are configuration changes to the images. This would incorporate any upstream package updates, but not upgrade to a new major distro version (so no unexpected upgrade to CentOS 7). Mechanically, a bumper script would increment a "VERSION" file somewhere in-tree, causing an automatic rebuild[2] of the image. Thus the "bump" would show up in treeherder, in version-control history, and in perfherder and could be bisected over and blamed for build or test failures or performance regressions, just like any other changeset. Reverting the changeset would revert to the earlier image. The changeset would ride the trains just like any change. = Questions = Glandium has already raised a few questions about this plan. I'll list them here, but reserve my suggestions for a later reply. Please do respond to these questions, and add any other comments or questions that you might have so that we can identify, discuss, and address any other risks or downsides to this approach. 1. A weekly bump may mean devs trying to use the latest-and-greatest are constantly downloading new docker images, which can be large. 2. Glandium has also expressed some concern at the way testing/docker/centos6-build/system-setup.sh is installing software: downloading source tarballs and building them directly. Other alternatives are to download hand-built RPMs directly (as is done for valgrind, yasm, and freetype, since they were already available in that format) or to host and maintain a custom Yum repository. = Thanks = Thanks for reading this long email! I look forward to hearing your thoughts. Dustin = References = [1] https://tools.taskcluster.net/task-inspector/#FAaH6Fc4TpSObTm6T_JEYw/0 [2] https://bugzilla.mozilla.org/show_bug.cgi?id=1132346 [3] https://bugzilla.mozilla.org/show_bug.cgi?id=1189892 _______________________________________________ dev-builds mailing list dev-builds@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-builds