On Mon, Mar 7, 2011 at 7:52 AM, Chet Ramey <chet.ra...@case.edu> wrote: > > Let's take a step back and approach this a different way. Instead of > trying to intuit whether or not the child did anything with the SIGINT, > let's try to make the race condition smaller.
I can still easily see the porblem. I think your patch is an improvement over plain bash, but I can still see the problem. How much harder it is to see is hard to say - it's pretty subjective. But I think my patch (the second one, the one that had the "working" EINTR logic) made it _harder_ to trigger than yours does. The trace I just captured looks like this: 08:12:54.424327 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdf2ad8e9f0) = 15397 08:12:54.424394 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 08:12:54.424432 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 08:12:54.424463 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 08:12:54.424486 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 08:12:54.424510 rt_sigaction(SIGINT, {0x43c990, [], SA_RESTORER, 0x301f833140}, {SIG_DFL, [], SA_RESTORER, 0x301f833140}, 8) = 0 08:12:54.424541 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 15397 08:12:54.424668 --- SIGINT (Interrupt) @ 0 (0) --- 08:12:54.424680 rt_sigreturn(0x2) = 15397 08:12:54.424702 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x301f833140}, {0x43c990, [], SA_RESTORER, 0x301f833140}, 8) = 0 08:12:54.424739 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 08:12:54.424761 --- SIGCHLD (Child exited) @ 0 (0) --- 08:12:54.424772 wait4(-1, 0x7fff0a9956dc, WNOHANG, NULL) = -1 ECHILD (No child processes) 08:12:54.424793 rt_sigreturn(0xffffffffffffffff) = 0 08:12:54.424824 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 08:12:54.424850 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 08:12:54.424881 rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0 and I think one reason why the race is hard to get rid of is simply that system call return is _the_ common point of signal handling in UNIX (technically, obviously any return to user space, but there are no appreciable interrupts etc going on, and there _are_ a lot of system calls). The above trace is one that my patch would have handled correctly (it has no EINTR). And there aren't a lot of different system calls in the tight loop here: that's literally the whole loop of the "while : ; do /bin/true; done" thing. So most of the time, SIGINT will happen at _one_ of those system calls, and so even if you set that "waiting_for_child" flag right around the call to "wait4()", the race is actually much easier to hit than you'd think from "it's just a few CPU instructions". So the above trace is one that my patch would have handled correctly (since it has no EINTR). Maybe a combination of the two approaches would work even better? Linus