Re: [rfc] Trying to make sense of opt-in Node integration in mozilla-central

Gregory Szorc Fri, 20 Apr 2018 11:50:19 -0700

On Fri, Apr 20, 2018 at 10:24 AM, Nicholas Alexander <nalexan...@mozilla.com
> wrote:


> Colleagues,
>
> I have a patch series that adds an opt-in --enable-node-environment
> configure flag and, when that flag is set, uses Node (via Webpack) to
> generate the Activity Stream content bundle.  This patch series does not
> try to solve a few hard problems:
>
> 1) vendoring Node modules into the tree
> 2) installing $topobjdir/node_modules at build time efficiently.
>
> There's a green artifact build of my prototype at
>
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=
> d138f854139f2e389867b01d2f2afe59f2975783
>
> I owe some folks (dmose, jlaster) details on what is need in order to
> land this opt-in prototype and, more importantly, how to make the
> prototype not opt-in.  To that end, I talked to most of the build peers
> (chmanchester, gps, mshal, ted) yesterday in YVR.
>
> The results were not what I expected of that discussion were not what I
> expected.
>
> First, some context.  The build system is investing heavily into
> capturing the full dependency DAG (Directed Acyclic Graph) in order to
> produce correct builds.  The current build backends, and in particular
> the dominant RecursiveMake build backend, do not capture the full DAG.
> Capturing the full DAG is required to use "modern" build systems like
> Tup, Buck, or Bazel.  Any sub-component of the build must therefore
> either correspond to edges in the DAG (these inputs, these outputs) or,
> if it does its own caching and invalidation, expose its internal DAG.
> In the current build system, C compiler invocations are the prototype of
> the first situation and cargo is the prototype of the second situation.
> What I did not know is that the build peers are contributing code to
> cargo to have it expose its internal DAG, and that all of the "modern"
> build systems (in particular Buck) need this functionality to integrate
> against cargo.
>
> Second, my appraisal of the situation.
>
> Integrating Node will be very challenging.  On the one hand, |yarn
> install| (or
> |npm install|) is, like cargo, in the second situation -- it is its own
> build system that does its own caching and invalidation.  That means
> that to integrate into the build system it must expose its internal DAG.
> It's possible that yarn could expose its own DAG, but Node modules can
> define arbitrary pre- and post-install scripts, which are essential to
> the module ecosystem.  I can't imagine us being able to capture the
> "leaf DAG" of every installed module -- there are no rules out at the
> leaves.
>
> On the second hand, the most general form of integration (which I have
> been pursuing) is to enable the build system to invoke arbitrary yarn
> verbs (like `GENERATED_FILES[...].script = 'yarn.py';
> GENERATED_FILES[...].flags = 'run arbitary_yarn_verb').  Arbitrary yarn
> verbs are, well, arbitrary -- they could be simple, like C compiler
> invocations, or they could be build systems in their own right, like
> Webpack.  For arbitrary yarn verbs, I don't think it's feasible to
> extract DAGs from the Node ecosystem tools involved.
>
> Third, what is to be done.
>
> The build peers most invested in the transition to a "modern" build
> system (here, Tup) are chmanchester and mshal.  They conclude that it is
> not possible to integrate build systems into each other without
> significant work exposing internal DAGs (which we are willing to do for
> cargo).  They instead propose that build systems not integrate but
> instead run in serial.  That is, the "Node bits" run either first (and
> provide inputs to the rest of the build system) or run second (and
> consume outputs from the rest of the build system).  Of course, that
> arrangement sacrifices parallelism and throughput, but at least the
> final output will be correct.
>
> This leads me to propose that we treat |yarn install| as a separate
> build system that runs before the main build system.  It manages its own
> caching and invalidation, and produces $topobjdir/node_modules.  |yarn
> install| is intended to efficiently determine that its output is
> up-to-date, so perhaps the overhead of running it every time we build
> will be acceptable.  (Otherwise, we try to find ways to invoke it less
> frequently.)
>
> We then have a choice.  We can either push _all_ Node invocations into
> the first build system and accept what I expect to be a big performance
> penalty in practice; or we can restrict the Node integration in the main
> build system to commands that we are confident are not their own build
> systems.
>
> The former is fully general but will require non-trivial effort to
> implement in the build system, I expect -- perhaps a new build backend,
> specialized to Node, and some glue code in |mach build| to manage
> ordering the systems.  In addition, such an arrangement could never
> allow Node bits to depend on regular build system bits, since the Node
> bits would always happen first.  That might make some sense right now,
> since all of the Node projects we're integrating stand-alone (usually on
> GitHub!) but as more of the core Firefox front-end functionality
> leverages Node that will look worse and worse.  Even exposing
> AppConstants.jsm to Node could be fraught (if the actual contents are
> required, for example to tree shake on the basis of build flags).
>
> The latter is restrictive -- for example, we might support only Rollup
> but not Webpack, since Rollup is more clearly inputs-to-outputs and
> Webpack is more focused on incremental builds -- and requires labour to
> audit and add support for new tools.  However, it requires less up front
> build system modifications and is easier to transition to gradually.
>
> Fourth, my conclusion.
>
> I prefer working within the existing build system and invoking Node
> commands rather than arbitrary yarn verbs.
>
> The fast path to landing this as an opt-in therefore looks like:
>
> - adding a new "node" build tier before "pre-export" that runs |yarn
> install|
> - restricting to audited Node-consuming commands like |node webpack| and
>   |node rollup| in the build system.
>
> After that we can tackle vendoring Node modules into the tree, which does
> not appear to have anything fundamental blocking it.
>
> Phew!  That's a wall of text.  Please correct me if I'm misunderstanding
> things, or if my explanations need clarification.  As I said, the
> results of this discussion were not what I expected, so this is mostly
> new to me :/
>
> I'll wait to collect some feedback on this summary before trying to
> figure out next steps.
>

This is a good summary and captures the fears that many of us build system
maintainers have with integrating 3rd party tools that behave like or are
build systems.

I think it helps to divide the problem space into 2 parts:

a) Enabling the running of Node in the build system (i.e. Node package
management)
b) Running Node in the build system

We kind of have a precedent for both parts with Python. For Python, we
create a virtualenv in the objdir during configure. And then for each
Python process we run, we either teach the build system about the
dependencies and outputs explicitly (often via moz.build files or custom
Python code running during moz.build evaluation time). Or we do it inline
in the Python that is executed at build time (e.g. by spitting out a make
dependencies file).

I think "a" should live in configure *or* should be invoked at build time
by the Python code backing `mach build`. I don't think it should live in
the pre-export tier because that is specific to the make backend and having
it live there will require us to implement logic for invoking it in every
backend. This "step" of the build is common to all backends and should not
live in a single backend.

"b" is a harder problem because it isn't well-defined. We *will* need every
Node invocation during the build to define its dependencies and
contribution to the overall DAG. We can do things like assume all installed
Node module files [managed by "a"] are dependencies for *every* Node
process. But for the non-obvious inputs and for all outputs, we'll need to
annotate those somehow. There's no getting around that. Well, we could
ignore enumerating those inputs and outputs. But then we invoke Node
processes on every build and this will make no-op and light builds slower.
We don't like making builds slower by running things that shouldn't need to
run. So we'll end up enumerating inputs and outputs.

_______________________________________________
dev-builds mailing list
dev-builds@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-builds

Re: [rfc] Trying to make sense of opt-in Node integration in mozilla-central

Reply via email to