Re: "wait" loses signals
Date:Fri, 21 Feb 2020 10:07:25 -0500 From:Chet Ramey Message-ID: | That's just not reasonable. You're saying signals that are received before | the wait builtin begins executing (say, while the command is being parsed, | or the shell is doing some other bookkeeping task) should be considered | to have arrived while the wait builtin is executing. I'm pretty sure that's | not consistent with the letter or the spirit of the standard. It quite clearly isn't consistent, what the standard says is: When the shell is waiting, by means of the wait utility, for asynchronous commands to complete, the reception of a signal for which a trap has been set shall cause the wait utility to return immediately with an exit status >128, immediately after which the trap associated with that signal shall be taken. Note: "when the shell us waiting for an asynchronous command to complete" (when that happens as a result of the user/script executing the wait utility) then ... What Denys is failing to realise, is that the standard describes what shells do (or more accurately perhaps, did, in the late 1980's or early 1990's) not what someone might want them to do. And that is, when the wait/waitpid/wait3/wait4/waitid/wait6 (whatever the shell uses) system call returns EINTR, the wait utility exited with a status indicating it was interrupted by that signal (status > 128 means 128+SIGno) and runs the trap. Because that is what shells actually did - the alternative being to simply restart the wait on EINTR like many other system calls that are interrupted by signals are conventionally restarted. Like it or not, that's what shells did, what most still do, and what the standard says must be done. Apart from that, and not interrupting a wait for a foreground process, the standard says very little about when traps should be run, and sorry Harald, but your "as soon as" from ... har...@gigawatt.nl said: | In the same way, I think that except when overridden by 2.11, the "when" | in "Otherwise, the argument action shall be read and executed by the | shell when one of the corresponding conditions arises." should be | interpreted as "as soon as". The only way to do that literally would be to run the trap from the signal handler, as that is "as soon as" the condition arises. But I think we all know that is simply not possible. So let's read that as "as soon as possible after" instead. That's getting more reasonable, but someone needs to decide just what is possible - will running the trap handler mess up the shell's internal state while a new command is parsed and executed? Eg: what if we had VAR=$(grep -c some_string file*.c) and a (trapped) signal arrives while grep is running (more correctly, while the process running the command substitution, which runs grep, is running). We know we cannot interrupt the wait for that foreground process to run the trap handler, so we don't - but do we execute the trap handler before we assign the answer to VAR ? This kind of thing is why shells in general only normally even look to see if there is a trap handler waiting to run after completing executing commands, not in the middle of one. The relevance of this is that if a signal arrives while the wait command is executing (or as Chet suggested, while doing whatever housekeeping is needed to prepare to run it, like looking to see what command comes next) but before the relevant wait*() system call is running, the trap won't be run until after the wait command completes. That's the way shells have always worked, and the way the standard (for that very reason) says can be relied upon by scripts - which is much of its purpose, to tell script writers what they can expect will work, and what will not necessarily work. Now the standard doesn't preclude a shell from looking for pending traps as frequently as it wants to, every second line of C code in the shell could be if (traps_pending) run_trap_handler(); But most shell authors (I believe) wouldn't consider that reasonable. The standard also doesn't preclude a shell from taking extra measures to push the arrival of a signal in the wait utility down to occur in the wait system call (or whatever replaces it). Old shells didn't do that, as there simply was no mechanism for that, and using SIGCHLD was always problematic because of its quite different implementation of different (now ancient) systems, hence we have what we have. The standard is not a legislature, and does not change the rules just because what is there doesn't look reasonable, or you don't like it. If you want things changed, convince the major shell maintainers that this race condition is something they should make their shell go slower to fix (because that's really all it takes on modern systems) and wait for them to comply. When most major shells (perhaps all major shells, and some of the others) have implemented what you want
Re: "wait" loses signals
On 2/24/20 9:59 AM, Robert Elz wrote: And that is, when the wait/waitpid/wait3/wait4/waitid/wait6 (whatever the shell uses) system call returns EINTR, the wait utility exited with a status indicating it was interrupted by that signal (status > 128 means 128+SIGno) and runs the trap. This is racy. Even if you try to code is as tightly as possible: if (got_sigs) { handle signals } got_sigs = 0; pid = waitpid(...); /* without WNOHANG */ if (pid < 0 && errno == EINTR) { handle signals } since signals can be delivered not only while waitpid() syscall is in kernel, but also when we are only about to enter the kernel - and in this case, the shell's sighandler will set the flag variable, but then we enter the kernel *and sleep*. Because that is what shells actually did - the alternative being to simply restart the wait on EINTR like many other system calls that are interrupted by signals are conventionally restarted. Like it or not, that's what shells did, what most still do, and what the standard says must be done. Standard does not say that. It says "when the shell is waiting for an asynchronous command to complete", it does not say "when the shell is waiting in a waitpid() syscall". Yes, you are right, you can argue that shell is minimally fulfilling standard's requirement if it does something like my code example. I am arguing that it can be made better: it can be coded so that signal has no time window to arrive before waitpid() but have its trap delayed to after "wait" builtin ends (which might be "never", mind you).
Re: "wait" loses signals
Date:Mon, 24 Feb 2020 11:50:55 +0100 From:Denys Vlasenko Message-ID: <47762f41-e393-30cd-50ed-43c6bdd29...@redhat.com> | This is racy. Even if you try to code is as tightly as possible: Absolutely, I agree. The question is more whether it really matters. | Standard does not say that. It says "when the shell is waiting for an | asynchronous command to complete", it does not say "when the shell is | waiting in a waitpid() syscall". That's because the standard has no notion of "system calls", just functions, but the shell is not actually waiting (it is doing something else) until the system call causes it to pause if the desired (or any) child is not ready for reaping. | Yes, you are right, you can argue that shell is minimally fulfilling | standard's requirement if it does something like my code example. It doesn't even need to do that. As I said, the standard's primary purpose is to advise script writers what they can depend upon the shell providing. And a race free wrt traps wait utility is not one of those things. That's because what scripts can rely upon is based upon what shells implement (or implemented at the time - with some more recent additions for some more modern functionality that has been widely adopted). Even now, as was demonstrated, most shells have this "issue" - hence the standard simply cannot tell users that they can rely on something else. Any attempt to read it otherwise than that is simply wrong, and obviously so (though sometimes it is possible to argue that the wording used does not express the intent obviously enough - or accoasionally - at all, but when that happens, all you will ever get as the best possible result is corrected wording that says what it intended to say in the first place). The standard also serves to advise shell authors what they need to do to provide a shell which should run all conformant shell applications, but it would be grossly unfair (and improper) to require of new shells something that old ones didn't do. But that side of it is less relevant to this discussion, except that it doesn't tell shell authors to make sure there are no race conditions wrt traps in the wait utility (it would do that in quite different language than this, but that would be the point, if it were there). | I am arguing that it can be made better: That part is arguable | it can be coded so that signal has no time window to arrive before | waitpid() but have its trap delayed to after "wait" builtin ends | (which might be "never", mind you). It can be so coded, but when done (correctly, and assuming a trapped signal has arrived) it won't be never, the signal will interrupt the sys call that actually pauses (which will most likely not be wait*() in this case, but that's irrelevant) and the wait would correctly exit. A few shells have done that. The question is whether it is worth going to that extra effort - or in other words, is it really better. As best I can tell, it only really matters to shell scripts attempting to use signals/traps as an IPC mechanism, and that I simply don't believe they should be doing - programs that need that kind of functionality should be written in a language that provides more suitable mechanisms (and usually not only for simple one bit message passing that a signal offers). There are lots of programming languages around, they each have their particular niche - the reason their inventors created them in the first place. Use an appropriate one, rather than attempting to shoehorn some feature that is needed into a language that was never intended for it - just because you happen to be a big fan of that language. Spread your wings, learn a new one - the hard part about any programming isn't the programming language, it is getting the desired concepts and structures straight - do that and any competent programmer can make a working program in any suitable language (ie: not expecting anyone to write an operating system in COBOL) fairly quickly. They'll make it better after they get used to the idioms of the language, but providing the method needed to solve the problem is known first (that's usually the hard part, for anything non trivial) the actual coding into a working, if not necessarily ideal, form is simple. kre
Re: "wait" loses signals
> There are lots of programming languages around, they each have their > particular > niche - the reason their inventors created them in the first place. Use > an > appropriate one, rather than attempting to shoehorn some feature that is > needed > into a language that was never intended for it - just because you happen > to > be a big fan of that language. Spread your wings, learn a new one That is a poor excuse for not fixing bugs. Maybe you can torture the standards into confessing that this behavior isn't a bug. This behavior nevertheless surprises people and nevertheless precludes various things people want to do with a shell. Don't you think it's better that programs work reliably than that they don't? Of course something working intuitively 99.9% of the time and hanging 0.1% of the time is a bug. It's not appropriate to treat that 0.1% hang as some kind of cosmic punishment for using shell in a manner you find inappropriate: remember when we believed in mechanism, not policy? Nor is the presence of the bug in other shells adequate justification for leaving this one in a bad state. I've never understood the philosophy of people who want to leave bugs unfixed. No, it's not that much trouble to fix the bug. The techniques for fixing this kind of signal race are well-known. In particular, instead of waitpid, you use a self-pipe and signal the pipe in the signal handler, and you have a signal handler for SIGCHLD. If we had a pwaitpid (like pselect) we could use that too. If I could get Chet to look at my patches, I'd fix it myself.
Re: "wait" loses signals
Date:Mon, 24 Feb 2020 04:58:31 -0800 From:"Daniel Colascione" Message-ID: <07d1441d41280e6f9535048d6485.squir...@dancol.org> | That is a poor excuse for not fixing bugs. Only if they are bugs. | Maybe you can torture the standards into confessing that this | behavior isn't a bug. No torture required. Once again, the standard documents the way users can expect shells to behave. That is what a standard is - a common set of agreed operations (or whatever is apporpriate for the object being standardised). It does not (or should not) ever invent new stuff and require it. Shells have always worked this way, so that is how the standard is written - that is what users can expect to happen (that is why it is called a "standard" after all). Once again, you are free to attempt to get shells to change this way of working, and if you can get a suitable set to agree, and implement something new, that more meets your needs, then perhaps that might one day become the standard, and later appear in the standards document. New and/or changed features to happen, expecially when they don't break backwards compatibility, which this wouldn't. | This behavior nevertheless surprises people Lots of things surprise people. | and nevertheless precludes various things | people want to do with a shell. That was my point, that you just labelled a poor excuse. Not everything is suitable for implementation in sh. Sometimes you simply have to go elsewhere. Wanting to do it in shell doesn't make it reasonable or possible. I want the shell to feed my dog, where is the dogfood option? | Don't you think it's better that programs | work reliably than that they don't? Yes, when they are written correctly. | Of course something working intuitively 99.9% of the time and | hanging 0.1% of the time is a bug. Nonsense. An alternative explanation is that your intuition is wrong, and that it often works that way is just by chance. The program is broken because it is making unjustified assumptions about how things are specified to work. This is the kind of common error that people who program (in any language) by guesswork often make "I saw Fred did this, and I tried it, and it worked for me like I thought it would, so it must do this similar thing like I think it will too". Rubbish. | I've never understood the philosophy of people who want to leave | bugs unfixed. Nor have I, except sometimes perhaps when it comes to costs. But the issue here is whether this is a bug. Your belief that it is does not make it so. | No, it's not that much trouble to fix the bug. It isn't, if it needs fixing - but any fix for this will slow the shell (for what that matters, but some people care). Further there are simpler cheaper techniques than the one described. | If we had a pwaitpid (like pselect) we could use that too. Yes, if. If that existed a fix would be almost cost free. If. I suspect that before you can get bash (note: I am no authority and have no voice in these decisions, I work on a different shell) to make use of something like that it would need to be implemented in quite a lot of systems, including the commercial ones, which tend to be very conservative about adding new features for fun. kre
Re: "wait" loses signals
> Date:Mon, 24 Feb 2020 04:58:31 -0800 > From:"Daniel Colascione" > Message-ID: <07d1441d41280e6f9535048d6485.squir...@dancol.org> > > | That is a poor excuse for not fixing bugs. > > Only if they are bugs. That executing traps except in case you lose one rare race is painfully obvious. > > | Maybe you can torture the standards into confessing that this > | behavior isn't a bug. > > No torture required. Once again, the standard documents the way users > can expect shells to behave. I refuse to let the standard cap the quality of a shell's implementation. Missing signals this way is pure negative. It doesn't add to any capability or help any user. It can only make computing unreliable and hurt real users trying to automate things with shell. > That is what a standard is - a common set > of agreed operations A standard is a bare minimum. > attempt to get shells to change this way of > working, and if you can get a suitable set to agree, and implement > something > new, that more meets your needs, then perhaps that might one day become > the > standard, This opposition to doing more than the bare minimum that the standard requires makes this task all the much harder. > | This behavior nevertheless surprises people > > Lots of things surprise people. Sometimes people deserve to be surprised. This isn't one of those times. > | and nevertheless precludes various things > | people want to do with a shell. > > That was my point, that you just labelled a poor excuse. Not everything > is suitable for implementation in sh. Sometimes you simply have to go > elsewhere. Making people go elsewhere *on purpose* by refusing to fix bugs is not good software engineering. > Wanting to do it in shell doesn't make it reasonable or > possible. It is reasonable and possible. All that's needed is to make an existing operation that's almost perfectly reliable in fact perfectly reliable, and as I've mentioned, it's not that hard. > I want the shell to feed my dog, where is the dogfood option? We're talking about fixing an existing shell feature, not adding a new one. > | Don't you think it's better that programs > | work reliably than that they don't? > > Yes, when they are written correctly. By fixing this bug, we make a class of programs correct automatically. > > | Of course something working intuitively 99.9% of the time and > | hanging 0.1% of the time is a bug. > > Nonsense. An alternative explanation is that your intuition is wrong, > and that it often works that way is just by chance. We're talking about a documented feature that users expect to work a certain way and that almost always *does* work that way and that diverges from this behavior only under rare circumstances. Not the same as spacebar heating. > The program is > broken because it is making unjustified assumptions about how things are > specified to work. This moralistic outlook is not helpful. It doesn't *matter* whether a program is right or wrong or making unjustified assumptions or not. Punishing programs does not make the world does not make the world better. When a piece of infrastructure can transform these programs from incorrect to correct at next to zero cost, it behooves that infrastructure component to do that. > This is the kind of common error that people who > program (in any language) by guesswork often make "I saw Fred did this, > and I tried it, and it worked for me like I thought it would, so it > must do this similar thing like I think it will too". Rubbish. Ever hear of the "pit of success"? It's the idea that software gets better when you make the intuitive thing happen to be the correct thing. Why should we require a degree of cleverness greater than what a domain requires? Why *not* make it so that, to the greatest extent possible, shouldn't we let "I saw Fred do this" lead people to good patterns? Like I said before, making things difficult on purpose doesn't actually achieve anything. [1] https://docs.microsoft.com/en-us/archive/blogs/brada/the-pit-of-success > > | I've never understood the philosophy of people who want to leave > | bugs unfixed. > > Nor have I, except sometimes perhaps when it comes to costs. But the > issue here is whether this is a bug. Your belief that it is does not make > it so. Your belief that this behavior is acceptable doesn't make it so --- except under a pointlessly literal interpretation of the standards. > | No, it's not that much trouble to fix the bug. > > It isn't, if it needs fixing - but any fix for this will slow the shell > (for what that matters, but some people care). Further there are simpler > cheaper techniques than the one described. The fix for this issue will not meaningfully affect the speed of the shell. Instead of waiting on waitpid directly, we wait on a pipe. Plenty of programs do this already. Micro-optimizing for system call count will hardly slow the shell: other factors matter a lot
Re: "wait" loses signals
On 2/24/20 3:59 AM, Robert Elz wrote: > The relevance of this is that if a signal arrives while the wait command > is executing (or as Chet suggested, while doing whatever housekeeping is > needed to prepare to run it, like looking to see what command comes next) > but before the relevant wait*() system call is running, the trap won't > be run until after the wait command completes. There are two separate cases here: if the signal arrives before the wait command has begun executing (during `housekeeping') or if it arrives after the wait command has begun running but before it calls whatever system call it uses to wait for children. The second case is relatively easy to solve; Jilles wrote a message detailing the alternatives. Bash uses the longjmp-out-of-the-trap-signal- handler mechanism. The trap handler only has to know that the wait builtin is running and that there's a valid saved environment to longjmp to. The first case is trickier: there's always going to be a window between the time the shell checks for pending traps and the time the wait builtin starts to run. You can't really close it unless you're willing to run the trap out of the signal handler, which everyone agrees is a bad idea, but you can squeeze it down to practially nothing. I think I've got a way to close that and make signals that arrive in that first case act as if they arrived `while the shell is waiting by means of the wait utility'. It's not much code and not disruptive. With that, bash runs the original test script (100,000 iterations) on RHEL7 and macOS without a `stray' sleep. It's in the git devel branch. I'm going to defer the question of whether or not that's the `right' thing to do -- people have been trying to make signals into an IPC mechanism since Berkeley introduced `reliable signals'. Can we all take a breath now? -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: "wait" loses signals
On 2/24/20 7:58 AM, Daniel Colascione wrote: > No, it's not that much trouble to fix the bug. The techniques for fixing > this kind of signal race are well-known. In particular, instead of > waitpid, you use a self-pipe and signal the pipe in the signal handler, > and you have a signal handler for SIGCHLD. You've just substituted a real IPC mechanism (pipes) for the one people are trying to make signals into. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: "wait" loses signals
On 2/24/20 5:18 PM, Chet Ramey wrote: The first case is trickier: there's always going to be a window between the time the shell checks for pending traps and the time the wait builtin starts to run. You can't really close it unless you're willing to run the trap out of the signal handler, which everyone agrees is a bad idea, but you can squeeze it down to practially nothing. dash uses something along these lines: sigfillset(&mask); sigprocmask(SIG_SETMASK, &mask, &mask); while (!pending_sig) sigsuspend(&mask); sigprocmask(SIG_SETMASK, &mask, NULL); if (pending_sig) handle_signals(pending_sig); pid = waitpid(... WNOHANG); It sleeps in sigsuspend(), not in waitpid(). This way we wait for both signals *and* children (by virtue of getting SIGCHLD for them).
Re: "wait" loses signals
On 24/02/2020 08:59, Robert Elz wrote: har...@gigawatt.nl said: | In the same way, I think that except when overridden by 2.11, the "when" | in "Otherwise, the argument action shall be read and executed by the | shell when one of the corresponding conditions arises." should be | interpreted as "as soon as". The only way to do that literally would be to run the trap from the signal handler, as that is "as soon as" the condition arises. But I think we all know that is simply not possible. So let's read that as "as soon as possible after" instead. Sure. That's getting more reasonable, but someone needs to decide just what is possible - will running the trap handler mess up the shell's internal state while a new command is parsed and executed? Eg: what if we had VAR=$(grep -c some_string file*.c) and a (trapped) signal arrives while grep is running (more correctly, while the process running the command substitution, which runs grep, is running). We know we cannot interrupt the wait for that foreground process to run the trap handler, so we don't - but do we execute the trap handler before we assign the answer to VAR ? Although 2.11 that you referred to states "When a signal for which a trap has been set is received while the shell is waiting for the completion of a utility executing a foreground command", that is not what any shell implements. Instead, what shells implement is more like "while the shell is waiting for the completion of a foreground command". Consider for instance (sleep 5): the sleep command run in a subshell. The parent shell is not waiting for the completion of a utility executing a foreground command, the parent shell is waiting for the completion of the subshell, which is not a utility. Nevertheless, shells do not run any trap action until after the subshell has completed. This is just sloppy wording in the standard. It is probably written this way so that it is clear that given { foo; bar; }, if a signal is received while foo is running, any trap action runs before bar. The whole compound command shouldn't be considered the foreground command, only foo should be. In your example, I would expect the whole of VAR=$(...) to be considered the foreground command that the shell is waiting for, and that is what almost all shells do. A notable exception is zsh. This kind of thing is why shells in general only normally even look to see if there is a trap handler waiting to run after completing executing commands, not in the middle of one. The relevance of this is that if a signal arrives while the wait command is executing (or as Chet suggested, while doing whatever housekeeping is needed to prepare to run it, like looking to see what command comes next) but before the relevant wait*() system call is running, the trap won't be run until after the wait command completes. That's the way shells have always worked, and the way the standard (for that very reason) says can be relied upon by scripts - which is much of its purpose, to tell script writers what they can expect will work, and what will not necessarily work. You say "have always worked", but I'd like to point out that this whole thing started because I was looking at code that Herbert Xu had changed in dash to avoid this race back in 2009. That's over 10 years ago now. The behaviour of dash before that, and several shells now, can not, or at least not now, be said to be how shells have always worked. Cheers, Harald van Dijk