Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
This is probably the strangest bug (or maybe pair of bugs) I've run into in nearly five years of breaking shells by developing modernish. I've traced it to an interaction between bash >= 4.2 (i.e.: bash with shopt -s lastpipe) and variants of the Almquist shell, at least: dash, gwsh, Busybox ash, FreeBSD sh, and NetBSD 9.0rc2 sh. Symptom: if 'return' is invoked on bash in the last element of a pipe executed in the main shell environment, then if you subsequently 'exec' an Almquist shell variant so that it has the same PID, its 'wait' builtin breaks. I can consistently reproduce this on Linux, macOS, FreeBSD, NetBSD 9.0rc2, OpenBSD, and Solaris. To reproduce this, you need bash >= 4.2, some Almquist shell variant, and these two test scripts: ---begin test.bash--- fn() { : | return } shopt -s lastpipe || exit fn exec "${1:-dash}" test.ash ---end test.bash--- ---begin test.ash--- echo '*ash-begin' : & echo '*ash-middle' wait "$!" echo '*ash-end' ---end test.ash--- When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh, then test.ash simply waits forever on executing 'wait "$!"'. $ bash test.bash *ash-begin *ash-middle (nothing until ^C) NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh (on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2: $ bash test.bash /bin/sh *ash-begin *ash-middle [1] Segmentation fault bash test.bash sh I don't know if the different NetBSD sh behaviour is because the older NetBSD sh doesn't have the bug, or because some factor on the sdf*.org systems causes it to not be triggered. To me, this smells like the use of some uninitialised value on various Almquist shells. Tracing that is beyond my expertise though. Whether this also represents a bug in bash or not, I can't say. But no other shells trigger this that I've found, not even ksh93 and zsh which also execute the last element of a pipe in the main shell environment. - Martijn -- modernish -- harness the shell https://github.com/modernish/modernish
Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
On 06/02/2020 16:12, Martijn Dekker wrote: This is probably the strangest bug (or maybe pair of bugs) I've run into in nearly five years of breaking shells by developing modernish. I've traced it to an interaction between bash >= 4.2 (i.e.: bash with shopt -s lastpipe) and variants of the Almquist shell, at least: dash, gwsh, Busybox ash, FreeBSD sh, and NetBSD 9.0rc2 sh. Symptom: if 'return' is invoked on bash in the last element of a pipe executed in the main shell environment, then if you subsequently 'exec' an Almquist shell variant so that it has the same PID, its 'wait' builtin breaks. I can consistently reproduce this on Linux, macOS, FreeBSD, NetBSD 9.0rc2, OpenBSD, and Solaris. To reproduce this, you need bash >= 4.2, some Almquist shell variant, and these two test scripts: ---begin test.bash--- fn() { : | return } shopt -s lastpipe || exit fn exec "${1:-dash}" test.ash ---end test.bash--- ---begin test.ash--- echo '*ash-begin' : & echo '*ash-middle' wait "$!" echo '*ash-end' ---end test.ash--- When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh, then test.ash simply waits forever on executing 'wait "$!"'. Nice test. bash leaves the process in a state where SIGCHLD is blocked, and the various ash-based shells do not unblock it. Because of that, they do not pick up on the fact that the child process has terminated. I would consider this a bug both in bash and in the ash-based shells. Cheers, Harald van Dijk
Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
Date:Thu, 6 Feb 2020 19:29:41 + From:Harald van Dijk Message-ID: | Nice test. Yes! | and the various ash-based shells do not unblock it. We do now, the fix for that will be in 9.0 when it is released. ("now" as in as of the past half hour...) | Because of that, | they do not pick up on the fact that the child process has terminated. It was actually a race condition, for me it 'worked' about half the time (seems to depend whether the wait happens in the parent before or after the sub-process exits). kre ps: that core dump was an "impossible to happen" condition that this actually made happen, that will be fixed as well, both by actually now making it impossible like it was supposed to be (by not blocking or ignoring SIGCHLD, ever) and by testing for it happening anyway... The secondary fix for that one is still to be committed after I investigate some more - I know what happened, just need to make sure what will happen now if this situation which should never occur ever does happen again. That the 8.1 NetBSD sh seems to work is more just an artifact of how it runs the race I believe (or guess) - the wait & process invocation code has changed a lot in 9 (well, 9.0RC2 for now) which seems to have made the race a close call, instead of one sided. But that was not an artifact of the environment for the test, it happens for me on a real -8(ish) type system as well. kre
Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
Date:Thu, 6 Feb 2020 16:12:06 + From:Martijn Dekker Message-ID: <10e3756b-5e8f-ba00-df0d-b36c93fa2...@inlv.org> | NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org | and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh | (on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2: I have updated my opinion on that, I think it is "don't have the bug", though it is possible a blocked SIGCHLD acts differently on NetBSD than on other systems. On NetBSD it seems to affect nothing (the shell does not rely upon receiving SIGCHLD so not getting it is irrelevant) and the wait code when given an arg (as your script did) would always wait until that process exited, and return as soon as it did. None of that is changed in -9 ... but the wait command now has -n, which also works with a list of pids, and while waiting for any process in its list to exit, gets told each time a process is reaped (from lower level code) which job that process was from (new code of mine) so it can see if the process that completed finished one of the jobs for which it is waiting.\ I wasn't expecting to see exiting children that are not the shell's children, which is what happens here - the : | return creates a child (of bash) to run the ':' command, then the function returns without waiting for that one. You then exec the NetBSD shell, which inherits that child (a child of the same process) but is unaware of it. If that one happens to exit while the ash script running on the NetBSD sh is doing the wait command, core would dump. (Fix for that is now in the tree). If the bash invoked ':' command exited some other time and was noticed (eg: between commands) as having finished, it would simply have been ignored. I saw both happen. kre