URL: <http://savannah.gnu.org/bugs/?40725>
Summary: Make could completely freeze during a parallel build in some particular conditions Project: make Submitted by: fviard Submitted on: mer. 27 nov. 2013 19:16:14 GMT Severity: 3 - Normal Item Group: Bug Status: None Privacy: Public Assigned to: None Open/Closed: Open Discussion Lock: Any Component Version: 3.81 Operating System: POSIX-Based Fixed Release: None Triage Status: None _______________________________________________________ Details: I have a particular setup where I run make to build a simple package inside a using Scratchbox2, through a fakeroot, inside a chroot. I wasn't able to reproduce this issue outside of this setup, but it is possible that is only a question of timing, the make running in my special setup being more slow. In my case, I try to build an old version of "dosfstools" with "make -j 8". The makefile is like that: ----------------- DESTDIR = PREFIX = /usr/local SBINDIR = $(PREFIX)/sbin DOCDIR = $(PREFIX)/share/doc MANDIR = $(PREFIX)/share/man OPTFLAGS = -O2 -fomit-frame-pointer $(shell getconf LFS_CFLAGS) WARNFLAGS = -Wall -Wextra -Wno-sign-compare -Wno-missing-field-initializers -Wmissing-prototypes -Wstrict-prototypes DEBUGFLAGS = CFLAGS += $(OPTFLAGS) $(WARNFLAGS) $(DEBUGFLAGS) VPATH = src all: build build: dosfsck dosfslabel mkdosfs dosfsck: boot.o check.o common.o fat.o file.o io.o lfn.o dosfsck.o dosfslabel: boot.o check.o common.o fat.o file.o io.o lfn.o dosfslabel.o mkdosfs: mkdosfs.o ... ----------------- I don't notice this issue if I replace this line: OPTFLAGS = -O2 -fomit-frame-pointer $(shell getconf LFS_CFLAGS) by this one: OPTFLAGS = -O2 -fomit-frame-pointer -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 When make is frozen, I can see the following process tree: ... |-sh---fakeroot---sb2-monitor-+-bash---bash---make---qemu-arm Basically, qemu-arm is what effectively run the "getconf" command. The interesting point is that "qemu-arm" is in Zombie state. So it has already completed but make havn't yet done a waitpid for it. I did my work with Make 3.81, but I noticed no change to the following parts of code in last versions of Make. Inside job.c, inside the new_job() function, there is the following piece of code inside a while loop: /* Make sure we have a dup'd FD. */ if (job_rfd < 0) { DB (DB_JOBS, ("Duplicate the job FD\n")); job_rfd = dup (job_fds[0]); } [...] /* Reap anything that's currently waiting. */ reap_children (0, 0); /* Kick off any jobs we have waiting for an opportunity that can run now (i.e., waiting for load). */ start_waiting_jobs (); /* If our "free" slot has become available, use it; we don't need an actual token. */ if (!jobserver_tokens) break; /* There must be at least one child already, or we have no business waiting for a token. */ if (!children) fatal (NILF, "INTERNAL: no children as we go to sleep on read\n"); [...] /* Set interruptible system calls, and read() for a job token. */ set_child_handler_action_flags (1, waiting_jobs != NULL); got_token = read (job_rfd, &token, 1); saved_errno = errno; set_child_handler_action_flags (0, waiting_jobs != NULL); Basically, set_child_handler_action_flags() will enable restarting syscall in case of signal interruption before and disable it after the read and it will also enable the following signal_handler for the rest of the execution of the process: RETSIGTYPE child_handler (int sig UNUSED) { ++dead_children; if (job_rfd >= 0) { close (job_rfd); job_rfd = -1; } [...] } The idea here is to be able to interrupt the blocking read if something happend to the child. Later, if the process is able to acquire a "work slot", the shell command will be executed through "func_shell" function of "function.c". (or "func_shell_base" function in Make4.0) [ First pipedes pipe will be created, and the shell command run keeping the write side of the pipe. Then, the make (parent) process do a blocking read of pipedes[0] (read side, child output), until the child process complete. ] for infinite: [...] EINTRLOOP (cc, read (pipedes[0], &buffer[i], maxlen - i)); if (cc <= 0) break; } buffer[i] = '\0'; /* Close the read side of the pipe. */ [...] (void) close (pipedes[0]); Here, the blocking read will use the fact that interrupted syscall are not automatically retried to be interrupted when the program receive any early SIGCHLD and not risk to be blocked for ever on the read. So we arrive at the issue. Sometimes, make is frozen at the following "close" call: (void) close (pipedes[0]); But, if I put a "printf" just before the close, the issue is not reproducible. Looking at gdb when make is stuck, I can see the following backtrace: ---------------------------- 0xb76557c4 in accept () at ../sysdeps/unix/sysv/linux/i386/socket.S:57 57 in ../sysdeps/unix/sysv/linux/i386/socket.S (gdb) bt #0 0xb76557c4 in accept () at ../sysdeps/unix/sysv/linux/i386/socket.S:57 #1 0xb783cec8 in ?? () #2 0xb77648c6 in rcmd_af (ahost=0xb783cec8, rport=0, locuser=0x0, remuser=0x0, cmd=0xb780d8a9 "\201\303\377\340\002", fd2p=0xb783b9a8, af=<value optimized out>) at rcmd.c:236 #3 0xb780d8da in ?? () #4 0xb780db1a in ?? () #5 0xb77fe415 in ?? () #6 0x08054430 in child_handler (sig=17) at job.c:436 #7 <signal handler called> #8 0xb77648db in rcmd_af (ahost=0xb783cec8, rport=1, locuser=0x0, remuser=0xb783b9a8 "h\250\005", cmd=0xb780d919 "\201Ï\340\002", fd2p=0xb783b9a8, af=<value optimized out>) at rcmd.c:286 #9 0xb780d949 in ?? () #10 0xb77fe415 in ?? () #11 0x0805106e in func_shell (o=0x949f849 "ssing-field-initializer", argv=0xbf93f560, funcname=0x8067158 "shell") at function.c:1737 ---------------------------- It is maybe not really clear like that, but what happened is that we enter the close func to close pipedes[0] in #11, but during this execution, the "child_handler" signal handler is triggered because of the termination of the child subprocess. And inside this signal handler, as there is a value in "job_rfd" var, another close is called to close the file descriptor of the pipe identified by job_rfd. So, make is there stuck trying to execute a close, inside the signal handler that was executed inside another close for another file descriptor. So, I have 2 theories, 1) there is something not signal safe inside my libc or environment, 2) in my particular setup, environment produce often the correct timing to have the read that terminate because of the end of the input of the child process, and just then the SIGCHLD signal arrive just in the same time as the close function is called. (Bad luck :p) Anyway, I think that there is something bad in the current code that should be fixed even if the issue is not really reproducible for different setup. So, I have 2 proposal of solutions that works correctly for the current code in "job.c": 1) Always close job_rfd after the read, so the close in the signal handler will not be executed later than during this read call. ------------- set_child_handler_action_flags (1, waiting_jobs != NULL); got_token = read (job_rfd, &token, 1); saved_errno = errno; -> +if (job_rfd >= 0){ -> + close (job_rfd); -> + job_rfd = -1; -> +} set_child_handler_action_flags (0, waiting_jobs != NULL); ------------- ( Because I don't know if it is really useful to try to preserve job_rfd for next iteration for not having to dup() again ) 2) Add a new variable that will act like some kind of "lock" to be sure that in child_handler, the close will only be called during the interesting read call. ------------- - RETSIGTYPE - child_handler (int sig UNUSED) - { - ++dead_children; - - if (job_rfd >= 0) - { - close (job_rfd); - job_rfd = -1; - } - [...] - } + int should_handler_close_rfd = 0; + RETSIGTYPE + child_handler (int sig UNUSED) + { + ++dead_children; + + if (job_rfd >= 0 && should_handler_close_rfd == 1) + { + close (job_rfd); + job_rfd = -1; + } + [...] + } ... /* Set interruptible system calls, and read() for a job token. */ - set_child_handler_action_flags (1, waiting_jobs != NULL); - got_token = read (job_rfd, &token, 1); - saved_errno = errno; - set_child_handler_action_flags (0, waiting_jobs != NULL); + set_child_handler_action_flags (1, waiting_jobs != NULL); + should_handler_close_rfd = 1; + got_token = read (job_rfd, &token, 1); + saved_errno = errno; + should_handler_close_rfd = 0; + set_child_handler_action_flags (0, waiting_jobs != NULL); (Option 2 is my favorite one) In the hope that you will be able to understand my big bug report :-) _______________________________________________________ Reply to this item at: <http://savannah.gnu.org/bugs/?40725> _______________________________________________ Message posté via/par Savannah http://savannah.gnu.org/ _______________________________________________ Bug-make mailing list Bug-make@gnu.org https://lists.gnu.org/mailman/listinfo/bug-make