On Tue, Sep 8, 2015 at 7:38 AM, Dustin Mitchell <dus...@mozilla.com> wrote:
> Thanks! Greg, I agree with a lot of what you have said. Docker has > some design issues, but it has advantages too. Among those, it caches > well and starts new containers very quickly. I don't think we could > match that speed with an approach that was manipulating individual > files. > > If you think of Docker as a way of "freezing" the built images in a > way that is traceable and can be deployed quickly, I think it makes a > bit more sense. > I do. I wasn't suggesting we build images at job time: they should definitely be pre-built and cached (OK, maybe there is a TaskCluster Graph like job that ensures they are built and cached at the very beginning of overall job execution - but certainly not during a job itself: that would likely add too much overhead and redundancy). > > Regarding layers -- we can use those where it's useful, and not where > it's problematic. In the implementation I've put together, all of the > package installation happens in one Docker "layer", so there's no > redundant caching of files that are later deleted, etc. It's worth > noting that the overall size of the desktop-build docker image is > 1.4GB - about 60% smaller than a mozilla-central checkout. > > One place the layering helps is in caching. The current > implementation is comprised of three layers: the base `centos6` image > from RedHat (200MB), `centos6-build` which installs a long list of > build dependencies (1.2GB), and `desktop-build` which installs a few > shell scripts to make things go (negligible). Rather than re-creating > that centos6-build image frequently, we could insert a > `centos6-build-updates` layer after it, which just involves a `yum > update` run. That layer will be fairly small and grow slowly, since > centos6 gets so few updates. Then developers on slow connections > would only need to download `centos6-build-updates` and > `desktop-build` to have the latest-and-greatest build image. > > Axel, you raise a good point about hackfests. Docker has an `export` > command which could be used to create a tarball to put on a USB stick. > That has the side-effect of squashing layers, but for this particular > purpose I don't think that will hurt. The effect is similar to the VM > image you mentioned. I think there's also some means of carrying > layers around other than HTTP (so hackers could prime their docker > image caches from the USB stick), but I can't find it at the moment. > > Regarding the image build system, I see this running maybe 5-10 times > a week (once automatically, plus a few try jobs as someone hacks on > upgrading this or that library and finally lands to inbound), so I > don't think it's a huge pain in the Internet. Aside from the RPMs, > the packages we install *are* cached locally, and content is verified > for everything (acknowledging weaknesses in the yum signature system). > > When it comes to building packages that aren't easily available > upstream (and, note, we require CentOS 6 for production builds and > CentOS 6 is not, to my knowledge, available from Debian!), I agree > that skipping the package database poses no issues. I would be happy > with another solution that offloads the package building, as long as > that is automated and traceable (no "built it on my laptop and pushed > it") and can be done in try by non-privileged users. That's tricky > with yum/apt repositories. Nix might be able to do it? Alternately, > with some effort we could create other TaskCluster tasks to build each > dependency as artifacts, then combine those artifacts together into > the docker image. > > In general, lots of great ideas, but of course I can't promise to > implement all of them. Still, I don't think the system I've outlined > *precludes* any of those great ideas, nor requires implementing them - > in fact it enables a great many of them. Especially deterministic > builds :) > > Dustin > > On Fri, Sep 4, 2015 at 7:09 PM, Axel Hecht <l...@mozilla.com> wrote: > > On 9/5/15 12:06 AM, Gregory Szorc wrote: > >> > >> First, thank you for spending the time to compose this. It is an > >> excellent write-up. Responses inline. > >> > >> On Fri, Sep 4, 2015 at 1:24 PM, Dustin Mitchell <dus...@mozilla.com > >> <mailto:dus...@mozilla.com>> wrote: > >> > >> I'd like to get some feedback on changes that we (release > >> engineering > >> and the taskcluster team) are planning around how we build Firefox. > I > >> apologize that this is a bit long, as I'm trying to include the > >> necessary background. I have some questions at the end about which > >> I'd like to have some discussion. > >> > >> Before I get into it, a word of apology. We haven't done a great > job > >> of talking about this work -- > >> > >> I've talked to many members of the build module individually, but in > >> so doing perhaps not shared all of the required background or > >> established a common understanding. It's easy to get so deeply into > >> something that you assume everyone knows about it, and I fear that's > >> what may have happened. So if some of this comes as a surprise or > >> feels like a fait accompli, I apologize. It's certainly not > finished, > >> and it's all software, so we can always change it. We're working to > >> be more communicative and inclusive in the future. > >> > >> = Buildbot and TaskCluster = > >> > >> As you may know, we currently use Buildbot to schedule build (and > >> test) jobs across all of our platforms. This has had a number of > >> issues, including difficulty in reproducing the build or test > >> environments outside of Buildbot, difficulty in testing or deploying > >> changes to the build process (especially around scheduling and host > >> configuration), and difficulty scaling. One of the issues we > struggled > >> with internally was the difficulty in making requested upgrades to > >> otherwise "frozen" build systems: often the available upgrade was > not > >> available for the ancient version of the operating system we were > >> runnning. > >> > >> You may have your own issues to add to this list -- I'd be > interested > >> to hear them, to see if we are addressing them, or can address them, > >> in this change! > >> > >> During the development of Firefox OS, though, another parallel > system > >> called TaskCluster (https://docs.taskcluster.net) was developed. > It's > >> a cloud-first job-scheduling system designed with simplicity in > mind: > >> tasks go in as structured data and run, producing logs, artifacts, > and > >> even other tasks. Here's a list of the design goals for > TaskCluster: > >> > >> * Establish a developer-first workflow for builds and tests. > >> * Speed/flexibility of deployment: Minimize server-side config > >> requirements; new Taskcluster Platform tasks can be deployed in days > >> or weeks, BB platform takes months/quarters. > >> * Reproducibility: Ability to define complete OS environment / > image > >> outside of Releng- or IT-controlled infra; Increased transparency in > >> deployment > >> * Self-service: no releng needed to change any tasks, in-tree > >> scheduling > >> * Extensibility: Developing a general platform for developing > tools > >> we haven't thought of > >> > >> The bit of TaskCluster that is probably most salient for this > audience > >> is this: a TaskCluster task essentially boils down to "run this > shell > >> command in this operating system image," and the scheduling system > >> gets both the shell command and the operating system image from the > >> gecko tree itself. That means you can change just about everything > >> about the build process, including the operating system (host > >> environment), from in-tree -- even by pushing to try! As an > example, > >> here [1] is a recent try job of mine. > >> > >> We are currently in the process of transitioning from Buildbot to > >> TaskCluster. This is, of course, an enormous project, and requires > a > >> lot of hard work and attention from Releng, A-Team, the TaskCluster > >> team, and from other teams impacted by the changes. I also think > it's > >> going to be enormously rewarding, and free us of a lot of the > >> constraints and issues I mentioned above. Ideally everyone wins -- > >> build, Releng, A-Team, TaskCluster, all developers, even IT. So if > >> you see something here as "losing", aside from the inevitable > friction > >> of change, please speak up. > >> > >> = Linux Builds = > >> > >> Zooming in a little bit, let's talk about Linux builds of Firefox > >> Desktop. Mac, Windows, Fennec, B2G etc. are all in various states > of > >> progress, and we can talk about those too. My focus right now is on > >> Linux builds and Glandium has raised some questions about them that > >> I'd like to address here. > >> > >> For tasks that run on Linux, we can use Docker. That means that the > >> "operating system image" is a Docker image. In fact, we have a > method > >> for building those Docker images using in-tree specifications and > >> plans[2] to support automatically rebuilding them on pushes to try. > >> I've built a working CentOS 6.7 image specification[3] based on the > >> mock environments used in buildbot, and I'm working on greening that > >> up and putting it into TreeHerder as a Tier-2 build. > >> > >> Mock is not used at all -- the build runs directly in the docker > >> image, using a "worker" user account. Taskcluster invokes a script > >> that's baked into the docker image which knows enough to checkout > the > >> required revisions of the required sources (build/tools and gecko), > >> then execute an in-tree script > >> (testing/taskcluster/scripts/builder/build-linux.sh). That > >> build-linux.sh translates a whole bunch of parameters from > environment > >> variables into Mozharness configuration, then invokes the proper > >> Mozharness script, which performs the build as usual. All of that > is > >> easily tested in try -- that's what I've been doing for a few weeks > >> now! > >> > >> This approach has lots of advantages, and (we think) solves a few of > >> the issues I mentioned above: > >> > >> * Since everything is in-tree, everyone can self-serve. There's > no > >> need to wait on resources from another team to modify the build > >> environment, e.g., to upgrade valgrind. > >> > >> * Since everything is in-tree, it can be handled like any other > >> commit: tried, backed out, bisected, put on trains, merged, etc. > >> > >> > >> This is all fantastic. We've wanted to move in this direction for years. > >> It will enable all kinds of people to experiment with new and crazy > >> ideas without having to bother anybody on the automation side of things > >> (in theory). This should enable all kinds of changes and experiments > >> that were otherwise too costly to perform. It is huge for productivity. > >> > >> > >> * Downloading and running a docker image is a well-known process, > so > >> it's easy for devs to precisely replicate the production build > >> environment when necessary > >> > >> * Inputs and outputs are precisely specified, so builds are > >> repeatable. And because each gecko revision specifies exactly the > >> docker image used to build it, you can even bisect over host > >> configuration changes! > >> > >> To address the issue of difficult upgrades, and to support the > >> security team's desire that we not run insecure versions of > packages, > >> I have suggested that we rebuild the docker images weekly, > regardless > >> of whether there are configuration changes to the images. This > would > >> incorporate any upstream package updates, but not upgrade to a new > >> major distro version (so no unexpected upgrade to CentOS 7). > >> Mechanically, a bumper script would increment a "VERSION" file > >> somewhere in-tree, causing an automatic rebuild[2] of the image. > Thus > >> the "bump" would show up in treeherder, in version-control history, > >> and in perfherder and could be bisected over and blamed for build or > >> test failures or performance regressions, just like any other > >> changeset. Reverting the changeset would revert to the earlier > image. > >> The changeset would ride the trains just like any change. > >> > >> = Questions = > >> > >> Glandium has already raised a few questions about this plan. I'll > >> list them here, but reserve my suggestions for a later reply. > Please > >> do respond to these questions, and add any other comments or > questions > >> that you might have so that we can identify, discuss, and address > any > >> other risks or downsides to this approach. > >> > >> 1. A weekly bump may mean devs trying to use the latest-and-greatest > >> are constantly downloading new docker images, which can be large. > >> > >> > >> 2. Glandium has also expressed some concern at the way > >> testing/docker/centos6-build/system-setup.sh is installing software: > >> downloading source tarballs and building them directly. Other > >> alternatives are to download hand-built RPMs directly (as is done > for > >> valgrind, yasm, and freetype, since they were already available in > >> that format) or to host and maintain a custom Yum repository. > >> > >> > >> My professional opinion is that the layered images approach used by > >> Docker out of the box is extremely sub-optimal and should be avoided at > >> all costs. One of the primary reasons for this is because you end up > >> having to downloads gigabytes of image layers and/or distro packages > >> over and over and over again. This will be a very real concern for > >> developers that don't have super fast Internet connections. This > >> includes some Mozilla offices. A one-time cost to obtain the source code > >> and system dependencies and the ongoing cost to keep these up to date is > >> acceptable. But many barely tolerate it today. If we throw Docker's > >> inefficiency into the mix, I worry about the consequences. > >> > >> I think it is a worthwhile investment to build out a Docker image > >> management infrastructure that doesn't abuse the Internet so much. This > >> almost certainly entails caching tarballs, packages, etc locally and > >> then having the image build process leverage that local cache. I've > >> heard of a few projects that are basically transparent Yum/Apt/PyPI > >> caching proxies that do just this. Unfortunately, I can't find links to > >> them right now. There are also efforts to invent better Docker image > >> building techniques. I chatted with someone from RedHat about this a few > >> months ago and it sounded like we were on the same page about an > >> approach that composes images from cached/shared assets. I /think/ > >> Project Atomic (http://www.projectatomic.io/) was looking into things > >> like building Docker images using the Yum "database" and cached RPMs > >> from the host machine. Not sure how far they got. > >> > >> As for how exactly packages/binaries should make their way to images, we > >> have a few options. > >> > >> I know this is going to sound crazy, but I think containers remove most > >> of the necessity of a system packaging tool. Man pages, services, config > >> files, etc are mostly irrelevant in containers - especially ones that > >> build Firefox. Even most of the support binaries you'll find in > >> containers are unused. System packaging is focused on managing a > >> standalone system with many running services and with a configuration > >> that is dynamic over time. The type of containers we're talking about > >> only need to do one thing (build Firefox). So most of the benefits of a > >> system packaging tool are overhead and cruft to the container world. The > >> system packaging tools will need to adapt to this brave new container > >> world. But until they do, I wouldn't feel obligated to use a system > >> packaging tool, at least not in the traditional sense. e.g. I would use > >> the lower-level `dpkg --unpack` or `dpkg --install` instead of `apt-get > >> install` because apt-get provides little to no benefit for containers. > >> > >> Continuing this train of thought, I don't think there is anything wrong > >> with defining images in terms of a manifest of tarballs, RPMs, or debs > >> that should be manually uncompressed into / (possibly with filtering > >> involved so you don't install unused files like man pages). The > >> manifests should have embedded checksums for *everything* to prevent > >> against MitM attacks and undetected changes. This manifest approach is > >> low level and fast. Caching the individual components is trivial. > >> Instead of downloading whole new images every week, you are downloading > >> the individual packages that changed. This should use less bandwidth. > >> This approach also gives you full control. It doesn't bind you to the > >> system packager's world view that you are building a full-fledged > >> system. If you squint hard enough, this approach kinda resembles > >> tooltool, just carried out to the extreme. > >> > >> For the record, I've used this approach at a previous company. We had > >> plain text manifests listing archives to install. There was an > >> additional layer to run scripts after the base image was built to > >> perform any additional customization. It worked insanely well. I wish > >> Docker would have used this approach from the beginning. But I > >> understand why they didn't: it was easier to lean on existing system > >> packaging tools, even if they aren't (yet) suited for a container world. > >> > >> As for where to get the packages from, we should favor using bits > >> produced by reputable vendors, notably RedHat and Debian. I especially > >> like Debian because many of their packages are now reproducible and > >> deterministic. This provides a lot of defense against Trusting Trust > >> attacks. In theory, we could follow in the footsteps of Tor and enable a > >> bit-identical Firefox build. No other browser can provide this to the > >> degree we can. It makes the tin foil hat crowd happy. But the real > >> benefit is to developers not having "drift" between build environments. > >> And you don't need to build everything from source to achieve that. So > >> as cool as Gitian (Tor's deterministic build tool) is, we shouldn't lose > >> sleep over using it. Come to think of it, if we use Debian packages for > >> building Firefox, we should be reproducible via the transitive property! > >> > >> Whew, that was a lot of words. I hope I gave you something to think > >> about. I'm sure others will disagree with my opinions on the futility of > >> system packaging in a container world :) > > > > > > > > I fully agree on the gist of it here. > > > > We want to increase participation, and thus, kicking things off needs to > be > > cheap. > > > > It also need a chance to succeed on flaky internets. > > > > We also need to support environments of hack fests. You got 20/30/50/100 > > people in a single office, over one line of internet, and they all want > to > > start hacking at the same time. > > > > When we did the last hackathon in Berlin, we had the VM with a > precompiled > > build. > > > > A few minutes of USB stick goodness, and people were launching a build. > > > > Axel > > _______________________________________________ > > dev-builds mailing list > > dev-builds@lists.mozilla.org > > https://lists.mozilla.org/listinfo/dev-builds > _______________________________________________ > dev-builds mailing list > dev-builds@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-builds >
_______________________________________________ dev-builds mailing list dev-builds@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-builds