> The trace I just captured looks like this: > > 08:12:54.424327 clone(child_stack=0, > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x7fdf2ad8e9f0) = 15397 > 08:12:54.424394 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > 08:12:54.424432 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > 08:12:54.424463 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > 08:12:54.424486 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > 08:12:54.424510 rt_sigaction(SIGINT, {0x43c990, [], SA_RESTORER, > 0x301f833140}, {SIG_DFL, [], SA_RESTORER, 0x301f833140}, 8) = 0 > 08:12:54.424541 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], > 0, NULL) = 15397 > 08:12:54.424668 --- SIGINT (Interrupt) @ 0 (0) --- > 08:12:54.424680 rt_sigreturn(0x2) = 15397 > 08:12:54.424702 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, > 0x301f833140}, {0x43c990, [], SA_RESTORER, 0x301f833140}, 8) = 0 > 08:12:54.424739 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > 08:12:54.424761 --- SIGCHLD (Child exited) @ 0 (0) --- > 08:12:54.424772 wait4(-1, 0x7fff0a9956dc, WNOHANG, NULL) = -1 ECHILD > (No child processes) > 08:12:54.424793 rt_sigreturn(0xffffffffffffffff) = 0 > 08:12:54.424824 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > 08:12:54.424850 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > 08:12:54.424881 rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0 > > and I think one reason why the race is hard to get rid of is simply > that system call return is _the_ common point of signal handling in > UNIX (technically, obviously any return to user space, but there are > no appreciable interrupts etc going on, and there _are_ a lot of > system calls). The above trace is one that my patch would have handled > correctly (it has no EINTR).
The one thing that jumps out at me here is the number of signal-handling system calls. wait_sigint_handler gets installed and removed as the SIGINT handler a lot. I wonder if we would see an improvement if I used a global SIGINT handler that understood what to do with wait_sigint_received. That's a change for down the road. > And there aren't a lot of different system calls in the tight loop > here: that's literally the whole loop of the "while : ; do /bin/true; > done" thing. So most of the time, SIGINT will happen at _one_ of those > system calls, and so even if you set that "waiting_for_child" flag > right around the call to "wait4()", the race is actually much easier > to hit than you'd think from "it's just a few CPU instructions". > > So the above trace is one that my patch would have handled correctly > (since it has no EINTR). Maybe a combination of the two approaches > would work even better? I think we're eventually going to converge on a hybrid solution. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRU c...@case.edu http://cnswww.cns.cwru.edu/~chet/