Hi, [please keep me CCed, I'm not subscribed to bug-make]
I've noticed the problem with make 3.80 during building GCC. I can reproduce it with a small makefile, also with current CVS of GNU make. First I describe the symptoms, and then the bug. The former is a bit long, so you might skip to the description of the bug, which is obvious once knowing where to look. See this Makefile: ---------------------------- .PHONY: all fail1 fail2 fail3 ok1 ok2 ok3 all: fail1 ok1 fail2 ok2 fail3 ok3 fail1 fail2 fail3: echo Fail exit 1 ok1 ok2 ok3: echo Ok sleep 2 echo ok done ---------------------------- So, we have a mixture of failing and winning commands, where the winning commands need quite some time to finish. makeing the above in parallel will result sometimes in make not waiting for all started jobs before exiting. A multi-CPU machine increases the possibility of this happening. Higher number for -jN increase it too (I usually can reproduce it just fine with -j6, i.e. with the max parallelity for this makefile, but others might have to add more targets). This is an example of the bug: % make -r -j5 ; echo "============================="; pp sleep echo Fail Fail exit 1 echo Ok echo Fail echo Ok echo Fail Ok sleep 2 Fail exit 1 make: *** [fail3] Error 1 make: *** Waiting for unfinished jobs.... make: *** [fail1] Error 1 Ok sleep 2 Fail exit 1 make: *** [fail2] Error 1 ============================= USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND matz 14483 0.0 0.1 7112 736 pts/0 S 06:02 0:00 sleep 2 matz 14485 0.0 0.1 7112 736 pts/0 S 06:02 0:00 sleep 2 Note how even after 'make' stoped there are still two sleeps running on the system. The above example is of course harmless. But this also happens if the commands are submakes, which then hang around without a controling parent make. And worse, a make can return to the shell (with an error), while some sub-makes are still building stuff in some directories. If one tries to work on after the top make returned, one might see confusing effects from those submakes (e.g. files magically appearing in subdirs, command output in the terminal, and generally annoying things). Killing all these sub-makes by hand can be cumbersome if there are many (I have machines where I can build GCC with parallelity of 32, and something of the above happened to me. I rather waited some time until the sub-makes where done on their own, instead of hunting them down). To demonstrate the above effect with sub-makes involved, just change the top-level Makefile to: ---------------------------------- .PHONY: all fail1 fail2 fail3 ok1 ok2 ok3 all: fail1 ok1 fail2 ok2 fail3 ok3 ok1 ok2 ok3 fail1 fail2 fail3: $(MAKE) -C $@ ---------------------------------- Where the */Makefile contain the same commands from above appropriately separated for the ok* and fail* subdirs. An example output would look like: % ./make/make/make -r -j6 ; pp sleep /tmp/par-make/./make/make/make -C fail1 /tmp/par-make/./make/make/make -C ok1 /tmp/par-make/./make/make/make -C fail2 /tmp/par-make/./make/make/make -C ok2 /tmp/par-make/./make/make/make -C fail3 /tmp/par-make/./make/make/make -C ok3 make[1]: Entering directory `/tmp/par-make/ok1' make[1]: Entering directory `/tmp/par-make/fail2' make[1]: Entering directory `/tmp/par-make/fail1' make[1]: Entering directory `/tmp/par-make/ok2' make[1]: Entering directory `/tmp/par-make/fail3' make[1]: Entering directory `/tmp/par-make/ok3' Fail /tmp/par-make/fail2 exit 1 Ok /tmp/par-make/ok3 Ok /tmp/par-make/ok1 Fail /tmp/par-make/fail3 exit 1 Fail /tmp/par-make/fail1 Ok /tmp/par-make/ok2 exit 1 make[1]: *** [all] Error 1 make[1]: Leaving directory `/tmp/par-make/fail2' make: *** [fail2] Error 2 make: *** Waiting for unfinished jobs.... make[1]: *** [all] Error 1 make[1]: Leaving directory `/tmp/par-make/fail3' make[1]: *** [all] Error 1 make[1]: Leaving directory `/tmp/par-make/fail1' make: *** [fail3] Error 2 make: *** [fail1] Error 2 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND matz 9765 0.0 0.0 7120 740 pts/5 S 05:18 0:00 sleep 2 matz 9766 0.0 0.0 7120 740 pts/5 S 05:18 0:00 sleep 2 matz 9769 0.0 0.0 7120 740 pts/5 S 05:18 0:00 sleep 2 [EMAIL PROTECTED] % Ok /tmp/par-make/ok3 done make[1]: Leaving directory `/tmp/par-make/ok3' Ok /tmp/par-make/ok1 done Ok /tmp/par-make/ok2 done make[1]: Leaving directory `/tmp/par-make/ok1' make[1]: Leaving directory `/tmp/par-make/ok2' Note how the prompt is there already, and after that some output from the sub-makes working in ok[123] . I spare us the output of running make with the -d option, what happens is, that make suddenly exits, although there are still job slots in use. I know why this happens. The problem is the interaction between die() and reap_children() when multiple failing jobs are in queue and the user does not use -k. Let's suppose there are five job slots in use (reflecting all three failing and two ok jobs). The first failing one will trigger "reap_children(0, 0)" somewhen, and then the chain of events goes like so: reap_children (0, /*err= */ 0) # reap the failing child fail1 # if (!err && child_failed && !keep_going_flag) # die (2); die (2) # this is the first call, hence dying is 0, ergo it does: # dying = 1 # for (err = (status != 0); job_slots_used > 0; err = 0) # reap_children (1, err); # status == 2, hence err will be 1 in the first call reap_children (1, 1) # suppose this will get the second failing job, fail2 # if (!err && child_failed && !keep_going_flag) # die (2); # as err == 1, this will not call die(2). Instead it set blocks=0 # repeats the loop, and exits it, as no other childs are dead, # so we return to the above die (2) activation # We are in this loop again: # for (err = (status != 0); job_slots_used > 0; err = 0) # reap_children (1, err); # right now job_slots_used is 3 (the last fail job, and the two ok jobs) # this time, the second iteration, i.e. err is now 0, so we do: reap_children (1, 0) # We now reap the third failing child, fail3 # err is 0, hence we do this: # if (!err && child_failed && !keep_going_flag) # die (2); die (2) # as dying is set, we jump over the cleanup # and just do: exit (2) Voila. We don't wait for the two last jobs ok1 and ok2. Note that the timing here is critical. If in the second reap_children invocation both remaining fail jobs are done, then they will be reaped by that activation already, and hence don't lead to a recursive die() call in the last reap_children() invocation. The problem is, that the 'err' variable is used to control two things, namely if the 'Waiting for unfinished jobs....' warning should be printed, _and_ if die() should be called recursively. As the warning should be printed only once, 'err' is reset after the first iteration. But that leads to a recursive invocation of die() which just exits the whole make, and misses to complete the iteration of the waiting loop in the upper die() activation. I used the below patch to fix this problem. It produces no regressions in the testsuite. It might perhaps be a good idea tp test that job_slots_used is 0 right before doing the exit() in die(). It would have catched this bug. I hope this makes sense. Ciao, Michael. -- Index: job.c =================================================================== RCS file: /cvsroot/make/make/job.c,v retrieving revision 1.166 diff -u -p -r1.166 job.c --- job.c 26 Jun 2005 03:31:30 -0000 1.166 +++ job.c 31 Jul 2005 03:50:43 -0000 @@ -475,9 +475,12 @@ reap_children (int block, int err) if (err && block) { + static printed = 0; /* We might block for a while, so let the user know why. */ fflush (stdout); - error (NILF, _("*** Waiting for unfinished jobs....")); + if (!printed) + error (NILF, _("*** Waiting for unfinished jobs....")); + printed = 1; } /* We have one less dead child to reap. As noted in Index: main.c =================================================================== RCS file: /cvsroot/make/make/main.c,v retrieving revision 1.210 diff -u -p -r1.210 main.c --- main.c 12 Jul 2005 04:35:13 -0000 1.210 +++ main.c 31 Jul 2005 03:50:44 -0000 @@ -2990,7 +2990,7 @@ die (int status) print_version (); /* Wait for children to die. */ - for (err = (status != 0); job_slots_used > 0; err = 0) + for (err = (status != 0); job_slots_used > 0;) reap_children (1, err); /* Let the remote job module clean up its state. */ _______________________________________________ Bug-make mailing list Bug-make@gnu.org http://lists.gnu.org/mailman/listinfo/bug-make