Re: Building Firefox in TaskCluster

Axel Hecht Fri, 04 Sep 2015 16:11:21 -0700

On 9/5/15 12:06 AM, Gregory Szorc wrote:

First, thank you for spending the time to compose this. It is an
excellent write-up. Responses inline.


On Fri, Sep 4, 2015 at 1:24 PM, Dustin Mitchell <dus...@mozilla.com
<mailto:dus...@mozilla.com>> wrote:

      I'd like to get some feedback on changes that we (release engineering
    and the taskcluster team) are planning around how we build Firefox. I
    apologize that this is a bit long, as I'm trying to include  the
    necessary background.  I have some questions at the end about which
    I'd like to have some discussion.

    Before I get into it, a word of apology.  We haven't done a great job
    of talking about this work --

    I've talked to many members of the build module individually, but in
    so doing perhaps not shared all of the required background or
    established a common understanding.  It's easy to get so deeply into
    something that you assume everyone knows about it, and I fear that's
    what may have happened.  So if some of this comes as a surprise or
    feels like a fait accompli, I apologize.  It's certainly not finished,
    and it's all software, so we can always change it.  We're working to
    be more communicative and inclusive in the future.

    = Buildbot and TaskCluster =

    As you may know, we currently use Buildbot to schedule build (and
    test) jobs across all of our platforms.  This has had a number of
    issues, including difficulty in reproducing the build or test
    environments outside of Buildbot, difficulty in testing or deploying
    changes to the build process (especially around scheduling and host
    configuration), and difficulty scaling. One of the issues we struggled
    with internally was the difficulty in making requested upgrades to
    otherwise "frozen" build systems: often the available upgrade was not
    available for the ancient version of the operating system we were
    runnning.

    You may have your own issues to add to this list -- I'd be interested
    to hear them, to see if we are addressing them, or can address them,
    in this change!

    During the development of Firefox OS, though, another parallel system
    called TaskCluster (https://docs.taskcluster.net) was developed.  It's
    a cloud-first job-scheduling system designed with simplicity in mind:
    tasks go in as structured data and run, producing logs, artifacts, and
    even other tasks.  Here's a list of the design goals for TaskCluster:

      * Establish a developer-first workflow for builds and tests.
      * Speed/flexibility of deployment: Minimize server-side config
    requirements; new Taskcluster Platform tasks can be deployed in days
    or weeks, BB platform takes months/quarters.
      * Reproducibility: Ability to define complete OS environment / image
    outside of Releng- or IT-controlled infra; Increased transparency in
    deployment
      * Self-service: no releng needed to change any tasks, in-tree
    scheduling
      * Extensibility: Developing a general platform for developing tools
    we haven't thought of

    The bit of TaskCluster that is probably most salient for this audience
    is this: a TaskCluster task essentially boils down to "run this shell
    command in this operating system image," and the scheduling system
    gets both the shell command and the operating system image from the
    gecko tree itself.  That means you can change just about everything
    about the build process, including the operating system (host
    environment), from in-tree -- even by pushing to try!  As an example,
    here [1] is a recent try job of mine.

    We are currently in the process of transitioning from Buildbot to
    TaskCluster.  This is, of course, an enormous project, and requires a
    lot of hard work and attention from Releng, A-Team, the TaskCluster
    team, and from other teams impacted by the changes.  I also think it's
    going to be enormously rewarding, and free us of a lot of the
    constraints and issues I mentioned above.  Ideally everyone wins --
    build, Releng, A-Team, TaskCluster, all developers, even IT.  So if
    you see something here as "losing", aside from the inevitable friction
    of change, please speak up.

    = Linux Builds =

    Zooming in a little bit, let's talk about Linux builds of Firefox
    Desktop.  Mac, Windows, Fennec, B2G etc. are all in various states of
    progress, and we can talk about those too.  My focus right now is on
    Linux builds and Glandium has raised some questions about them that
    I'd like to address here.

    For tasks that run on Linux, we can use Docker.  That means that the
    "operating system image" is a Docker image.  In fact, we have a method
    for building those Docker images using in-tree specifications and
    plans[2] to support automatically rebuilding them on pushes to try.
    I've built a working CentOS 6.7 image specification[3] based on the
    mock environments used in buildbot, and I'm working on greening that
    up and putting it into TreeHerder as a Tier-2 build.

    Mock is not used at all -- the build runs directly in the docker
    image, using a "worker" user account.  Taskcluster invokes a script
    that's baked into the docker image which knows enough to checkout the
    required revisions of the required sources (build/tools and gecko),
    then execute an in-tree script
    (testing/taskcluster/scripts/builder/build-linux.sh).  That
    build-linux.sh translates a whole bunch of parameters from environment
    variables into Mozharness configuration, then invokes the proper
    Mozharness script, which performs the build as usual.  All of that is
    easily tested in try -- that's what I've been doing for a few weeks
    now!

    This approach has lots of advantages, and (we think) solves a few of
    the issues I mentioned above:

      * Since everything is in-tree, everyone can self-serve.  There's no
    need to wait on resources from another team to modify the build
    environment, e.g., to upgrade valgrind.

      * Since everything is in-tree, it can be handled like any other
    commit: tried, backed out, bisected, put  on trains, merged, etc.


This is all fantastic. We've wanted to move in this direction for years.
It will enable all kinds of people to experiment with new and crazy
ideas without having to bother anybody on the automation side of things
(in theory). This should enable all kinds of changes and experiments
that were otherwise too costly to perform. It is huge for productivity.


      * Downloading and running a docker image is a well-known process, so
    it's easy for devs to precisely replicate the production build
    environment when necessary

      * Inputs and outputs are precisely specified, so builds are
    repeatable.  And because each gecko revision specifies exactly the
    docker image used to build it, you can even bisect over host
    configuration changes!

    To address the issue of difficult upgrades, and to support the
    security team's desire that we not run insecure versions of packages,
    I have suggested that we rebuild the docker images weekly, regardless
    of whether there are configuration changes to the images.  This would
    incorporate any upstream package updates, but not upgrade to a new
    major distro version (so no unexpected upgrade to CentOS 7).
    Mechanically, a bumper script would increment a "VERSION" file
    somewhere in-tree, causing an automatic rebuild[2] of the image.  Thus
    the "bump" would show up in treeherder, in version-control history,
    and in perfherder and could be bisected over and blamed for build or
    test failures or performance regressions, just like any other
    changeset.  Reverting the changeset would revert to the earlier image.
    The changeset would ride the trains just like any change.

    = Questions =

    Glandium has already raised a few questions about this plan.  I'll
    list them here, but reserve my suggestions for a later reply.  Please
    do respond to these questions, and add any other comments or questions
    that you might have so that we can identify, discuss, and address any
    other risks or downsides to this approach.

    1. A weekly bump may mean devs trying to use the latest-and-greatest
    are constantly downloading new docker images, which can be large.


    2. Glandium has also expressed some concern at the way
    testing/docker/centos6-build/system-setup.sh is installing software:
    downloading source tarballs and building them directly.  Other
    alternatives are to download hand-built RPMs directly (as is done for
    valgrind, yasm, and freetype, since they were already available in
    that format) or to host and maintain a custom Yum repository.


My professional opinion is that the layered images approach used by
Docker out of the box is extremely sub-optimal and should be avoided at
all costs. One of the primary reasons for this is because you end up
having to downloads gigabytes of image layers and/or distro packages
over and over and over again. This will be a very real concern for
developers that don't have super fast Internet connections. This
includes some Mozilla offices. A one-time cost to obtain the source code
and system dependencies and the ongoing cost to keep these up to date is
acceptable. But many barely tolerate it today. If we throw Docker's
inefficiency into the mix, I worry about the consequences.

I think it is a worthwhile investment to build out a Docker image
management infrastructure that doesn't abuse the Internet so much. This
almost certainly entails caching tarballs, packages, etc locally and
then having the image build process leverage that local cache. I've
heard of a few projects that are basically transparent Yum/Apt/PyPI
caching proxies that do just this. Unfortunately, I can't find links to
them right now. There are also efforts to invent better Docker image
building techniques. I chatted with someone from RedHat about this a few
months ago and it sounded like we were on the same page about an
approach that composes images from cached/shared assets. I /think/
Project Atomic (http://www.projectatomic.io/) was looking into things
like building Docker images using the Yum "database" and cached RPMs
from the host machine. Not sure how far they got.

As for how exactly packages/binaries should make their way to images, we
have a few options.

I know this is going to sound crazy, but I think containers remove most
of the necessity of a system packaging tool. Man pages, services, config
files, etc are mostly irrelevant in containers - especially ones that
build Firefox. Even most of the support binaries you'll find in
containers are unused. System packaging is focused on managing a
standalone system with many running services and with a configuration
that is dynamic over time. The type of containers we're talking about
only need to do one thing (build Firefox). So most of the benefits of a
system packaging tool are overhead and cruft to the container world. The
system packaging tools will need to adapt to this brave new container
world. But until they do, I wouldn't feel obligated to use a system
packaging tool, at least not in the traditional sense. e.g. I would use
the lower-level `dpkg --unpack` or `dpkg --install` instead of `apt-get
install` because apt-get provides little to no benefit for containers.

Continuing this train of thought, I don't think there is anything wrong
with defining images in terms of a manifest of tarballs, RPMs, or debs
that should be manually uncompressed into / (possibly with filtering
involved so you don't install unused files like man pages). The
manifests should have embedded checksums for *everything* to prevent
against MitM attacks and undetected changes. This manifest approach is
low level and fast. Caching the individual components is trivial.
Instead of downloading whole new images every week, you are downloading
the individual packages that changed. This should use less bandwidth.
This approach also gives you full control. It doesn't bind you to the
system packager's world view that you are building a full-fledged
system. If you squint hard enough, this approach kinda resembles
tooltool, just carried out to the extreme.

For the record, I've used this approach at a previous company. We had
plain text manifests listing archives to install. There was an
additional layer to run scripts after the base image was built to
perform any additional customization. It worked insanely well. I wish
Docker would have used this approach from the beginning. But I
understand why they didn't: it was easier to lean on existing system
packaging tools, even if they aren't (yet) suited for a container world.

As for where to get the packages from, we should favor using bits
produced by reputable vendors, notably RedHat and Debian. I especially
like Debian because many of their packages are now reproducible and
deterministic. This provides a lot of defense against Trusting Trust
attacks. In theory, we could follow in the footsteps of Tor and enable a
bit-identical Firefox build. No other browser can provide this to the
degree we can. It makes the tin foil hat crowd happy. But the real
benefit is to developers not having "drift" between build environments.
And you don't need to build everything from source to achieve that. So
as cool as Gitian (Tor's deterministic build tool) is, we shouldn't lose
sleep over using it. Come to think of it, if we use Debian packages for
building Firefox, we should be reproducible via the transitive property!

Whew, that was a lot of words. I hope I gave you something to think
about. I'm sure others will disagree with my opinions on the futility of
system packaging in a container world :)



I fully agree on the gist of it here.

We want to increase participation, and thus, kicking things off needs tobe cheap.


It also need a chance to succeed on flaky internets.

We also need to support environments of hack fests. You got 20/30/50/100people in a single office, over one line of internet, and they all wantto start hacking at the same time.

When we did the last hackathon in Berlin, we had the VM with aprecompiled build.


A few minutes of USB stick goodness, and people were launching a build.

Axel
_______________________________________________
dev-builds mailing list
dev-builds@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-builds

Re: Building Firefox in TaskCluster

Reply via email to