I'd like to get some feedback on changes that we (release engineering
and the taskcluster team) are planning around how we build Firefox. I
apologize that this is a bit long, as I'm trying to include  the
necessary background.  I have some questions at the end about which
I'd like to have some discussion.

Before I get into it, a word of apology.  We haven't done a great job
of talking about this work --

I've talked to many members of the build module individually, but in
so doing perhaps not shared all of the required background or
established a common understanding.  It's easy to get so deeply into
something that you assume everyone knows about it, and I fear that's
what may have happened.  So if some of this comes as a surprise or
feels like a fait accompli, I apologize.  It's certainly not finished,
and it's all software, so we can always change it.  We're working to
be more communicative and inclusive in the future.

= Buildbot and TaskCluster =

As you may know, we currently use Buildbot to schedule build (and
test) jobs across all of our platforms.  This has had a number of
issues, including difficulty in reproducing the build or test
environments outside of Buildbot, difficulty in testing or deploying
changes to the build process (especially around scheduling and host
configuration), and difficulty scaling. One of the issues we struggled
with internally was the difficulty in making requested upgrades to
otherwise "frozen" build systems: often the available upgrade was not
available for the ancient version of the operating system we were
runnning.

You may have your own issues to add to this list -- I'd be interested
to hear them, to see if we are addressing them, or can address them,
in this change!

During the development of Firefox OS, though, another parallel system
called TaskCluster (https://docs.taskcluster.net) was developed.  It's
a cloud-first job-scheduling system designed with simplicity in mind:
tasks go in as structured data and run, producing logs, artifacts, and
even other tasks.  Here's a list of the design goals for TaskCluster:

 * Establish a developer-first workflow for builds and tests.
 * Speed/flexibility of deployment: Minimize server-side config
requirements; new Taskcluster Platform tasks can be deployed in days
or weeks, BB platform takes months/quarters.
 * Reproducibility: Ability to define complete OS environment / image
outside of Releng- or IT-controlled infra; Increased transparency in
deployment
 * Self-service: no releng needed to change any tasks, in-tree scheduling
 * Extensibility: Developing a general platform for developing tools
we haven't thought of

The bit of TaskCluster that is probably most salient for this audience
is this: a TaskCluster task essentially boils down to "run this shell
command in this operating system image," and the scheduling system
gets both the shell command and the operating system image from the
gecko tree itself.  That means you can change just about everything
about the build process, including the operating system (host
environment), from in-tree -- even by pushing to try!  As an example,
here [1] is a recent try job of mine.

We are currently in the process of transitioning from Buildbot to
TaskCluster.  This is, of course, an enormous project, and requires a
lot of hard work and attention from Releng, A-Team, the TaskCluster
team, and from other teams impacted by the changes.  I also think it's
going to be enormously rewarding, and free us of a lot of the
constraints and issues I mentioned above.  Ideally everyone wins --
build, Releng, A-Team, TaskCluster, all developers, even IT.  So if
you see something here as "losing", aside from the inevitable friction
of change, please speak up.

= Linux Builds =

Zooming in a little bit, let's talk about Linux builds of Firefox
Desktop.  Mac, Windows, Fennec, B2G etc. are all in various states of
progress, and we can talk about those too.  My focus right now is on
Linux builds and Glandium has raised some questions about them that
I'd like to address here.

For tasks that run on Linux, we can use Docker.  That means that the
"operating system image" is a Docker image.  In fact, we have a method
for building those Docker images using in-tree specifications and
plans[2] to support automatically rebuilding them on pushes to try.
I've built a working CentOS 6.7 image specification[3] based on the
mock environments used in buildbot, and I'm working on greening that
up and putting it into TreeHerder as a Tier-2 build.

Mock is not used at all -- the build runs directly in the docker
image, using a "worker" user account.  Taskcluster invokes a script
that's baked into the docker image which knows enough to checkout the
required revisions of the required sources (build/tools and gecko),
then execute an in-tree script
(testing/taskcluster/scripts/builder/build-linux.sh).  That
build-linux.sh translates a whole bunch of parameters from environment
variables into Mozharness configuration, then invokes the proper
Mozharness script, which performs the build as usual.  All of that is
easily tested in try -- that's what I've been doing for a few weeks
now!

This approach has lots of advantages, and (we think) solves a few of
the issues I mentioned above:

 * Since everything is in-tree, everyone can self-serve.  There's no
need to wait on resources from another team to modify the build
environment, e.g., to upgrade valgrind.

 * Since everything is in-tree, it can be handled like any other
commit: tried, backed out, bisected, put  on trains, merged, etc.

 * Downloading and running a docker image is a well-known process, so
it's easy for devs to precisely replicate the production build
environment when necessary

 * Inputs and outputs are precisely specified, so builds are
repeatable.  And because each gecko revision specifies exactly the
docker image used to build it, you can even bisect over host
configuration changes!

To address the issue of difficult upgrades, and to support the
security team's desire that we not run insecure versions of packages,
I have suggested that we rebuild the docker images weekly, regardless
of whether there are configuration changes to the images.  This would
incorporate any upstream package updates, but not upgrade to a new
major distro version (so no unexpected upgrade to CentOS 7).
Mechanically, a bumper script would increment a "VERSION" file
somewhere in-tree, causing an automatic rebuild[2] of the image.  Thus
the "bump" would show up in treeherder, in version-control history,
and in perfherder and could be bisected over and blamed for build or
test failures or performance regressions, just like any other
changeset.  Reverting the changeset would revert to the earlier image.
The changeset would ride the trains just like any change.

= Questions =

Glandium has already raised a few questions about this plan.  I'll
list them here, but reserve my suggestions for a later reply.  Please
do respond to these questions, and add any other comments or questions
that you might have so that we can identify, discuss, and address any
other risks or downsides to this approach.

1. A weekly bump may mean devs trying to use the latest-and-greatest
are constantly downloading new docker images, which can be large.

2. Glandium has also expressed some concern at the way
testing/docker/centos6-build/system-setup.sh is installing software:
downloading source tarballs and building them directly.  Other
alternatives are to download hand-built RPMs directly (as is done for
valgrind, yasm, and freetype, since they were already available in
that format) or to host and maintain a custom Yum repository.

= Thanks =

Thanks for reading this long email!  I look forward to hearing your thoughts.

Dustin

= References =

[1] https://tools.taskcluster.net/task-inspector/#FAaH6Fc4TpSObTm6T_JEYw/0
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1132346
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1189892
_______________________________________________
dev-builds mailing list
dev-builds@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-builds

Reply via email to