Re: "wait" loses signals

2020-02-24 Thread Robert Elz
Date:Fri, 21 Feb 2020 10:07:25 -0500
From:Chet Ramey 
Message-ID:  

  | That's just not reasonable. You're saying signals that are received before
  | the wait builtin begins executing (say, while the command is being parsed,
  | or the shell is doing some other bookkeeping task) should be considered
  | to have arrived while the wait builtin is executing. I'm pretty sure that's
  | not consistent with the letter or the spirit of the standard.

It quite clearly isn't consistent, what the standard says is:

 When the shell is waiting, by means of the wait utility, for
 asynchronous commands to complete, the reception of a signal for
 which a trap has been set shall cause the wait utility to return
 immediately with an exit status >128, immediately after which the
 trap associated with that signal shall be taken.

Note: "when the shell us waiting for an asynchronous command to complete"
(when that happens as a result of the user/script executing the wait utility)
then ...

What Denys is failing to realise, is that the standard describes what shells
do (or more accurately perhaps, did, in the late 1980's or early 1990's)
not what someone might want them to do.

And that is, when the wait/waitpid/wait3/wait4/waitid/wait6 (whatever the
shell  uses) system call returns EINTR, the wait utility exited with a
status indicating it was interrupted by that signal (status > 128 means 
128+SIGno) and runs the trap.

Because that is what shells actually did - the alternative being to
simply restart the wait on EINTR like many other system calls that are
interrupted by signals are conventionally restarted.

Like it or not, that's what shells did, what most still do, and what
the standard says must be done.

Apart from that, and not interrupting a wait for a foreground process,
the standard says very little about when traps should be run, and sorry
Harald, but your "as soon as" from ...

har...@gigawatt.nl said:
  | In the same way, I think that except when overridden by 2.11, the "when"
  | in "Otherwise, the argument action shall be read and executed by the
  | shell when one of the corresponding conditions arises." should be
  | interpreted as "as soon as". 

The only way to do that literally would be to run the trap from the signal
handler, as that is "as soon as" the condition arises.   But I think we all
know that is simply not possible.   So let's read that as "as soon as
possible after" instead.   That's getting more reasonable, but someone needs
to decide just what is possible - will running the trap handler mess up the
shell's internal state while a new command is parsed and executed?

Eg: what if we had
VAR=$(grep  -c some_string file*.c)
and a (trapped) signal arrives while grep is running (more correctly, while
the process running the command substitution, which runs grep, is running).
We know we cannot interrupt the wait for that foreground process to run the
trap handler, so we don't - but do we execute the trap handler before we
assign the answer to VAR ?

This kind of thing is why shells in general only normally even look to
see if there is a trap handler waiting to run after completing executing
commands, not in the middle of one.

The relevance of this is that if a signal arrives while the wait command
is executing (or as Chet suggested, while doing whatever housekeeping is
needed to prepare to run it, like looking to see what command comes next)
but before the relevant wait*() system call is running, the trap won't
be run until after the wait command completes.

That's the way shells have always worked, and the way the standard (for that
very reason) says can be relied upon by scripts - which is much of its
purpose, to tell script writers what they can expect will work, and what
will not necessarily work.

Now the standard doesn't preclude a shell from looking for pending traps
as frequently as it wants to, every second line of C code in the shell could
be
if (traps_pending) run_trap_handler();

But most shell authors (I believe) wouldn't consider that reasonable.

The standard also doesn't preclude a shell from taking extra measures to
push the arrival of a signal in the wait utility down to occur in the wait
system call (or whatever replaces it).   Old shells didn't do that, as there
simply was no mechanism for that, and using SIGCHLD was always problematic
because of its quite different implementation of different (now ancient)
systems, hence we have what we have.   The standard is not a legislature,
and does not change the rules just because what is there doesn't look
reasonable, or you don't like it.

If you want things changed, convince the major shell maintainers that this
race condition is something they should make their shell go slower to
fix (because that's really all it takes on modern systems) and wait for
them to comply.   When most major shells (perhaps all major shells, and
some of the others) have implemented what you want

Re: "wait" loses signals

2020-02-24 Thread Denys Vlasenko

On 2/24/20 9:59 AM, Robert Elz wrote:

And that is, when the wait/waitpid/wait3/wait4/waitid/wait6 (whatever the
shell  uses) system call returns EINTR, the wait utility exited with a
status indicating it was interrupted by that signal (status > 128 means
128+SIGno) and runs the trap.


This is racy. Even if you try to code is as tightly as possible:

   if (got_sigs) { handle signals }
   got_sigs = 0;
   pid = waitpid(...);  /* without WNOHANG */
   if (pid < 0 && errno == EINTR) { handle signals }

since signals can be delivered not only while waitpid() syscall
is in kernel, but also when we are only about to enter the kernel
- and in this case, the shell's sighandler will set the flag variable,
but then we enter the kernel *and sleep*.


Because that is what shells actually did - the alternative being to
simply restart the wait on EINTR like many other system calls that are
interrupted by signals are conventionally restarted.

Like it or not, that's what shells did, what most still do, and what
the standard says must be done.


Standard does not say that. It says "when the shell is waiting for an
asynchronous command to complete", it does not say "when the shell is
waiting in a waitpid() syscall".

Yes, you are right, you can argue that shell is minimally fulfilling
standard's requirement if it does something like my code example.

I am arguing that it can be made better: it can be coded so that
signal has no time window to arrive before waitpid() but have its
trap delayed to after "wait" builtin ends (which might be "never", mind you).




Re: "wait" loses signals

2020-02-24 Thread Robert Elz
Date:Mon, 24 Feb 2020 11:50:55 +0100
From:Denys Vlasenko 
Message-ID:  <47762f41-e393-30cd-50ed-43c6bdd29...@redhat.com>

  | This is racy. Even if you try to code is as tightly as possible:

Absolutely, I agree.   The question is more whether it really matters.

  | Standard does not say that. It says "when the shell is waiting for an
  | asynchronous command to complete", it does not say "when the shell is
  | waiting in a waitpid() syscall".

That's because the standard has no notion of "system calls", just functions,
but the shell is not actually waiting (it is doing something else) until
the system call causes it to pause if the desired (or any) child is not
ready for reaping.

  | Yes, you are right, you can argue that shell is minimally fulfilling
  | standard's requirement if it does something like my code example.

It doesn't even need to do that.   As I said, the standard's primary purpose
is to advise script writers what they can depend upon the shell providing.
And a race free wrt traps wait utility is not one of those things.  That's
because what scripts can rely upon is based upon what shells implement (or
implemented at the time - with some more recent additions for some more
modern functionality that has been widely adopted).

Even now, as was demonstrated, most shells have this "issue" - hence the
standard simply cannot tell users that they can rely on something else.
Any attempt to read it otherwise than that is simply wrong, and obviously
so (though sometimes it is possible to argue that the wording used does
not express the intent obviously enough - or accoasionally - at all, but
when that happens, all you will ever get as the best possible result is
corrected wording that says what it intended to say in the first place).

The standard also serves to advise shell authors what they need to do to
provide a shell which should run all conformant shell applications, but it
would be grossly unfair (and improper) to require of new shells something
that old ones didn't do.  But that side of it is less relevant to this
discussion, except that it doesn't tell shell authors to make sure there
are no race conditions wrt traps in the wait utility (it would do that in
quite different language than this, but that would be the point, if it were
there).

  | I am arguing that it can be made better:

That part is arguable

  | it can be coded so that signal has no time window to arrive before
  | waitpid() but have its trap delayed to after "wait" builtin ends
  | (which might be "never", mind you).

It can be so coded, but when done (correctly, and assuming a trapped signal
has arrived) it won't be never, the signal will interrupt the sys call that
actually pauses (which will most likely not be wait*() in this case, but that's
irrelevant) and the wait would correctly exit.  A few shells have done that.

The question is whether it is worth going to that extra effort - or in other
words, is it really better.

As best I can tell, it only really matters to shell scripts attempting to
use signals/traps as an IPC mechanism, and that I simply don't believe they
should be doing - programs that need that kind of functionality should be
written in a language that provides more suitable mechanisms (and usually
not only for simple one bit message passing that a signal offers).

There are lots of programming languages around, they each have their particular
niche - the reason their inventors created them in the first place.  Use an
appropriate one, rather than attempting to shoehorn some feature that is needed
into a language that was never intended for it - just because you happen to
be a big fan of that language.   Spread your wings, learn a new one - the hard
part about any programming isn't the programming language, it is getting the
desired concepts and structures straight - do that and any competent programmer
can make a working program in any suitable language (ie: not expecting anyone
to write an operating system in COBOL) fairly quickly.   They'll make it
better after they get used to the idioms of the language, but providing
the method needed to solve the problem is known first (that's usually the
hard part, for anything non trivial) the actual coding into a working, if
not necessarily ideal, form is simple.

kre




Re: "wait" loses signals

2020-02-24 Thread Daniel Colascione


> There are lots of programming languages around, they each have their
> particular
> niche - the reason their inventors created them in the first place.  Use
> an
> appropriate one, rather than attempting to shoehorn some feature that is
> needed
> into a language that was never intended for it - just because you happen
> to
> be a big fan of that language.   Spread your wings, learn a new one

That is a poor excuse for not fixing bugs. Maybe you can torture the
standards into confessing that this behavior isn't a bug. This behavior
nevertheless surprises people and nevertheless precludes various things
people want to do with a shell. Don't you think it's better that programs
work reliably than that they don't? Of course something working
intuitively 99.9% of the time and hanging 0.1% of the time is a bug. It's
not appropriate to treat that 0.1% hang as some kind of cosmic punishment
for using shell in a manner you find inappropriate: remember when we
believed in mechanism, not policy? Nor is the presence of the bug in other
shells adequate justification for leaving this one in a bad state. I've
never understood the philosophy of people who want to leave bugs unfixed.

No, it's not that much trouble to fix the bug. The techniques for fixing
this kind of signal race are well-known. In particular, instead of
waitpid, you use a self-pipe and signal the pipe in the signal handler,
and you have a signal handler for SIGCHLD. If we had a pwaitpid (like
pselect) we could use that too. If I could get Chet to look at my patches,
I'd fix it myself.




Re: "wait" loses signals

2020-02-24 Thread Robert Elz
Date:Mon, 24 Feb 2020 04:58:31 -0800
From:"Daniel Colascione" 
Message-ID:  <07d1441d41280e6f9535048d6485.squir...@dancol.org>

  | That is a poor excuse for not fixing bugs.

Only if they are bugs.

  | Maybe you can torture the standards into confessing that this
  | behavior isn't a bug.

No torture required.  Once again, the standard documents the way users
can expect shells to behave.  That is what a standard is - a common set
of agreed operations (or whatever is apporpriate for the object being
standardised).   It does not (or should not) ever invent new stuff and
require it.   Shells have always worked this way, so that is how the
standard is written - that is what users can expect to happen (that is why
it is called a "standard" after all).

Once again, you are free to attempt to get shells to change this way of
working, and if you can get a suitable set to agree, and implement something
new, that more meets your needs, then perhaps that might one day become the
standard, and later appear in the standards document.   New and/or changed
features to happen, expecially when they don't break backwards compatibility,
which this wouldn't.

  | This behavior nevertheless surprises people

Lots of things surprise people.

  | and nevertheless precludes various things
  | people want to do with a shell.

That was my point, that you just labelled a poor excuse.   Not everything
is suitable for implementation in sh.   Sometimes you simply have to go
elsewhere.  Wanting to do it in shell doesn't make it reasonable or possible.

I want the shell to feed my dog, where is the dogfood option?

  | Don't you think it's better that programs
  | work reliably than that they don't?

Yes, when they are written correctly.

  | Of course something working intuitively 99.9% of the time and
  | hanging 0.1% of the time is a bug.

Nonsense.   An alternative explanation is that your intuition is wrong,
and that it often works that way is just by chance.   The program is
broken because it is making unjustified assumptions about how things are
specified to work.   This is the kind of common error that people who
program (in any language) by guesswork often make "I saw Fred did this,
and I tried it, and it worked for me like I thought it would, so it
must do this similar thing like I think it will too".   Rubbish.

  | I've never understood the philosophy of people who want to leave
  | bugs unfixed.

Nor have I, except sometimes perhaps when it comes to costs.   But the
issue here is whether this is a bug.  Your belief that it is does not make
it so.

  | No, it's not that much trouble to fix the bug.

It isn't, if it needs fixing - but any fix for this will slow the shell
(for what that matters, but some people care).  Further there are simpler
cheaper techniques than the one described.

  | If we had a pwaitpid (like pselect) we could use that too.

Yes, if.   If that existed a fix would be almost cost free.  If.
I suspect that before you can get bash (note: I am no authority and have
no voice in these decisions, I work on a different shell) to make use
of something like that it would need to be implemented in quite a lot
of systems, including the commercial ones, which tend to be very conservative
about adding new features for fun.

kre




Re: "wait" loses signals

2020-02-24 Thread Daniel Colascione
> Date:Mon, 24 Feb 2020 04:58:31 -0800
> From:"Daniel Colascione" 
> Message-ID:  <07d1441d41280e6f9535048d6485.squir...@dancol.org>
>
>   | That is a poor excuse for not fixing bugs.
>
> Only if they are bugs.

That executing traps except in case you lose one rare race is painfully
obvious.

>
>   | Maybe you can torture the standards into confessing that this
>   | behavior isn't a bug.
>
> No torture required.  Once again, the standard documents the way users
> can expect shells to behave.

I refuse to let the standard cap the quality of a shell's implementation.
Missing signals this way is pure negative. It doesn't add to any
capability or help any user. It can only make computing unreliable and
hurt real users trying to automate things with shell.

> That is what a standard is - a common set
> of agreed operations

A standard is a bare minimum.

> attempt to get shells to change this way of
> working, and if you can get a suitable set to agree, and implement
> something
> new, that more meets your needs, then perhaps that might one day become
> the
> standard,

This opposition to doing more than the bare minimum that the standard
requires makes this task all the much harder.

>   | This behavior nevertheless surprises people
>
> Lots of things surprise people.

Sometimes people deserve to be surprised. This isn't one of those times.

>   | and nevertheless precludes various things
>   | people want to do with a shell.
>
> That was my point, that you just labelled a poor excuse.   Not everything
> is suitable for implementation in sh.   Sometimes you simply have to go
> elsewhere.

Making people go elsewhere *on purpose* by refusing to fix bugs is not
good software engineering.

> Wanting to do it in shell doesn't make it reasonable or
> possible.

It is reasonable and possible. All that's needed is to make an existing
operation that's almost perfectly reliable in fact perfectly reliable, and
as I've mentioned, it's not that hard.

> I want the shell to feed my dog, where is the dogfood option?

We're talking about fixing an existing shell feature, not adding a new one.

>   | Don't you think it's better that programs
>   | work reliably than that they don't?
>
> Yes, when they are written correctly.

By fixing this bug, we make a class of programs correct automatically.

>
>   | Of course something working intuitively 99.9% of the time and
>   | hanging 0.1% of the time is a bug.
>
> Nonsense.   An alternative explanation is that your intuition is wrong,
> and that it often works that way is just by chance.

We're talking about a documented feature that users expect to work a
certain way and that almost always *does* work that way and that diverges
from this behavior only under rare circumstances. Not the same as spacebar
heating.

> The program is
> broken because it is making unjustified assumptions about how things are
> specified to work.

This moralistic outlook is not helpful. It doesn't *matter* whether a
program is right or wrong or making unjustified assumptions or not.
Punishing programs does not make the world does not make the world better.
When a piece of infrastructure can transform these programs from incorrect
to correct at next to zero cost, it behooves that infrastructure component
to do that.

> This is the kind of common error that people who
> program (in any language) by guesswork often make "I saw Fred did this,
> and I tried it, and it worked for me like I thought it would, so it
> must do this similar thing like I think it will too".   Rubbish.

Ever hear of the "pit of success"? It's the idea that software gets better
when you make the intuitive thing happen to be the correct thing. Why
should we require a degree of cleverness greater than what a domain
requires? Why *not* make it so that, to the greatest extent possible,
shouldn't we let "I saw Fred do this" lead people to good patterns? Like I
said before, making things difficult on purpose doesn't actually achieve
anything.

[1] https://docs.microsoft.com/en-us/archive/blogs/brada/the-pit-of-success

>
>   | I've never understood the philosophy of people who want to leave
>   | bugs unfixed.
>
> Nor have I, except sometimes perhaps when it comes to costs.   But the
> issue here is whether this is a bug.  Your belief that it is does not make
> it so.

Your belief that this behavior is acceptable doesn't make it so --- except
under a pointlessly literal interpretation of the standards.

>   | No, it's not that much trouble to fix the bug.
>
> It isn't, if it needs fixing - but any fix for this will slow the shell
> (for what that matters, but some people care).  Further there are simpler
> cheaper techniques than the one described.

The fix for this issue will not meaningfully affect the speed of the
shell. Instead of waiting on waitpid directly, we wait on a pipe. Plenty
of programs do this already. Micro-optimizing for system call count will
hardly slow the shell: other factors matter a lot 

Re: "wait" loses signals

2020-02-24 Thread Chet Ramey
On 2/24/20 3:59 AM, Robert Elz wrote:

> The relevance of this is that if a signal arrives while the wait command
> is executing (or as Chet suggested, while doing whatever housekeeping is
> needed to prepare to run it, like looking to see what command comes next)
> but before the relevant wait*() system call is running, the trap won't
> be run until after the wait command completes.

There are two separate cases here: if the signal arrives before the wait
command has begun executing (during `housekeeping') or if it arrives
after the wait command has begun running but before it calls whatever
system call it uses to wait for children.

The second case is relatively easy to solve; Jilles wrote a message
detailing the alternatives. Bash uses the longjmp-out-of-the-trap-signal-
handler mechanism. The trap handler only has to know that the wait builtin
is running and that there's a valid saved environment to longjmp to.

The first case is trickier: there's always going to be a window between
the time the shell checks for pending traps and the time the wait builtin
starts to run. You can't really close it unless you're willing to run the
trap out of the signal handler, which everyone agrees is a bad idea, but
you can squeeze it down to practially nothing.

I think I've got a way to close that and make signals that arrive in that
first case act as if they arrived `while the shell is waiting by means of
the wait utility'. It's not much code and not disruptive.

With that, bash runs the original test script (100,000 iterations) on RHEL7
and macOS without a `stray' sleep. It's in the git devel branch.

I'm going to defer the question of whether or not that's the `right' thing
to do -- people have been trying to make signals into an IPC mechanism
since Berkeley introduced `reliable signals'.

Can we all take a breath now?

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: "wait" loses signals

2020-02-24 Thread Chet Ramey
On 2/24/20 7:58 AM, Daniel Colascione wrote:

> No, it's not that much trouble to fix the bug. The techniques for fixing
> this kind of signal race are well-known. In particular, instead of
> waitpid, you use a self-pipe and signal the pipe in the signal handler,
> and you have a signal handler for SIGCHLD. 

You've just substituted a real IPC mechanism (pipes) for the one people
are trying to make signals into.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: "wait" loses signals

2020-02-24 Thread Denys Vlasenko

On 2/24/20 5:18 PM, Chet Ramey wrote:

The first case is trickier: there's always going to be a window between
the time the shell checks for pending traps and the time the wait builtin
starts to run. You can't really close it unless you're willing to run the
trap out of the signal handler, which everyone agrees is a bad idea, but
you can squeeze it down to practially nothing.


dash uses something along these lines:

sigfillset(&mask);
sigprocmask(SIG_SETMASK, &mask, &mask);
while (!pending_sig)
sigsuspend(&mask);
sigprocmask(SIG_SETMASK, &mask, NULL);
if (pending_sig)
handle_signals(pending_sig);
pid = waitpid(... WNOHANG);

It sleeps in sigsuspend(), not in waitpid(). This way we wait for both
signals *and* children (by virtue of getting SIGCHLD for them).




Re: "wait" loses signals

2020-02-24 Thread Harald van Dijk

On 24/02/2020 08:59, Robert Elz wrote:

har...@gigawatt.nl said:
   | In the same way, I think that except when overridden by 2.11, the "when"
   | in "Otherwise, the argument action shall be read and executed by the
   | shell when one of the corresponding conditions arises." should be
   | interpreted as "as soon as".

The only way to do that literally would be to run the trap from the signal
handler, as that is "as soon as" the condition arises.   But I think we all
know that is simply not possible.   So let's read that as "as soon as
possible after" instead.


Sure.


   That's getting more reasonable, but someone needs
to decide just what is possible - will running the trap handler mess up the
shell's internal state while a new command is parsed and executed?

Eg: what if we had
VAR=$(grep  -c some_string file*.c)
and a (trapped) signal arrives while grep is running (more correctly, while
the process running the command substitution, which runs grep, is running).
We know we cannot interrupt the wait for that foreground process to run the
trap handler, so we don't - but do we execute the trap handler before we
assign the answer to VAR ?


Although 2.11 that you referred to states "When a signal for which a 
trap has been set is received while the shell is waiting for the 
completion of a utility executing a foreground command", that is not 
what any shell implements. Instead, what shells implement is more like 
"while the shell is waiting for the completion of a foreground command". 
Consider for instance (sleep 5): the sleep command run in a subshell. 
The parent shell is not waiting for the completion of a utility 
executing a foreground command, the parent shell is waiting for the 
completion of the subshell, which is not a utility. Nevertheless, shells 
do not run any trap action until after the subshell has completed.


This is just sloppy wording in the standard. It is probably written this 
way so that it is clear that given { foo; bar; }, if a signal is 
received while foo is running, any trap action runs before bar. The 
whole compound command shouldn't be considered the foreground command, 
only foo should be.


In your example, I would expect the whole of VAR=$(...) to be considered 
the foreground command that the shell is waiting for, and that is what 
almost all shells do. A notable exception is zsh.



This kind of thing is why shells in general only normally even look to
see if there is a trap handler waiting to run after completing executing
commands, not in the middle of one.

The relevance of this is that if a signal arrives while the wait command
is executing (or as Chet suggested, while doing whatever housekeeping is
needed to prepare to run it, like looking to see what command comes next)
but before the relevant wait*() system call is running, the trap won't
be run until after the wait command completes.

That's the way shells have always worked, and the way the standard (for that
very reason) says can be relied upon by scripts - which is much of its
purpose, to tell script writers what they can expect will work, and what
will not necessarily work.


You say "have always worked", but I'd like to point out that this whole 
thing started because I was looking at code that Herbert Xu had changed 
in dash to avoid this race back in 2009. That's over 10 years ago now. 
The behaviour of dash before that, and several shells now, can not, or 
at least not now, be said to be how shells have always worked.


Cheers,
Harald van Dijk