[bug #41781] Provide a fast fail path when a target is compromized during a parallel build

Paul D. Smith Wed, 05 Mar 2014 10:01:34 -0800

Update of bug #41781 (project make):

             Assigned to:                    None => psmith                 
           Triage Status:                    None => Medium Effort


    _______________________________________________________

Follow-up Comment #1:

There's some confusion in the reading material you mention, and in the bug
report, about how GNU make actually works.

It's not true that enabling parallel builds will magically turn on the -k
(keep-going) flag and that it can't be turned off.  Make with parallel builds
enabled, and without -k, handles failed builds in exactly the same way that it
would otherwise: it stops building things "as soon as it can".

When a parallel make instance detects a failed build then it will not start
any new jobs and it will wait for all the currently-running parallel jobs to
complete, then it will exit with an error code.  Make doesn't try to kill any
running jobs, it waits for them to finish: killing jobs can lead to corrupted
builds if the recipe doesn't expect to be killed (arguably this is a bug in
the recipe since the user could always use CTRL-C but nevertheless it's not
expected that make will kill things).

The problem you're seeing likely happens because you are using recursive make
instances.  Suppose you have a makefile which runs three recursive instances
of make.  You start it with -j2, so you have make instance A (the top one)
spawning instance B and C (the children).  The third recursive make instance
is not started (yet) because there are only 2 jobs slots available.

Now say that during the run of make instance B a job fails.  In that case,
instance B detects that a failure happened and it won't start any new jobs and
it will exit with an error as soon as all currently running jobs _it started_
are completed.  When B exits with an error, instance A (the root) sees the
failure and it won't start up the final recursive make at all.  However,
already-running make instance C has no idea that B saw an error so C keeps
building all its targets as usual, then exits.

If you have a large amount of parallelism and a recursive make environment,
then you can get a large number of directories building happily along, unaware
that one of their siblings has detected an error.

I've been thinking about this problem (which is not really the point of the
patch you mention: that patch allows some finite number, less than "all" which
is what you get with -k, errors to be ignored).

I think the right solution is that when a make instance gets an error, it
notifies all the other make instances about the error.  I think the right way
to do that is to start putting back a different token into the jobserver
queue, which means "error detected".  Then when other make instances get that
token they'll know one of their siblings had an error and they can stop
building.  This means that make will still not kill any existing jobs that are
running, but it won't start any new ones either, even in recursive
environments.

The big open question is, if a make instance detects that some other make
instance failed and so it wants to stop early, what should its exit code be? 
I think it must exit with an error code, even though it, itself, did not
detect any error, because it didn't completely build its target; we have to
signal that to the parent in case the parent is relying on that for further
processing.  On the other hand that means you'll get a list of error messages
for all the recursive make invocations, which might be unpleasant.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?41781>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
Bug-make@gnu.org
https://lists.gnu.org/mailman/listinfo/bug-make

[bug #41781] Provide a fast fail path when a target is compromized during a parallel build

Reply via email to