Configuration Information [Automatically generated, do not change]: Machine: x86_64 OS: linux-gnu Compiler: gcc Compilation CFLAGS: -g -O2 -Wno-parentheses -Wno-format-security uname output: Linux chatoyancy 5.6.13-100.fc30.x86_64 #1 SMP Fri May 15 00:36:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Machine Type: x86_64-pc-linux-gnu
Bash Version: 5.1 Patch Level: 4 Release Status: release Description: From Bash 4.3 to the `devel' branch, `read -t timeout' fails to time out and blocks the execution forever at a small probability. The current implementation tries to stop the execution of read(2) by SIGALRM which is arranged by setitimer(2) or alarm(2). The problem is that SIGALRM can arrive when Bash is not calling read(2) or, when read(2) is called but still trying to set up the file descriptor for read. This problem of timeout failure depends on a race condition, and the probability of the timeout failure largely depends on the following conditions: - Operating system, CPU, etc. - Bash versions, Bash release status (release, maint, etc.) - Constructs (subshell, etc.) used near `read -t'. Commands executed before `read -t' - The type of file descriptors, such as files, pipes, and sockets. Usually, the probability is very low, but I found that sometimes the probability of the timeout failure can be about several percents, which is not negligible. Repeat-By: Originally, this issue was reported to a devel branch of my shell program by `eximus (3ximus)' in the following comment: https://github.com/akinomyoga/ble.sh/issues/82#issuecomment-770390986 Here is the reduced test case I created: #!/bin/bash rm -f a.pipe mkfifo a.pipe exec 9<> a.pipe rm -f a.pipe for c in {0..2000}; do (eval "echo {0..$c}" & read -u 9 -t 0.001) >/dev/null printf $'\r\e[Kok %d' "$c" done echo The expected behavior is to count up to 2000 without stopping, but the actual Bash's behavior is to randomly stop at some number before reaching 2000. I've tested with various versions of Bash in various systems. Here are the observations: - The problem starts to occur in Bash 4.3. The devel branch of Bash also exhibits the same problem. - The release status (which is specified in `configure.ac') largely matters. As far as `maint' is used, the failure probability is quite small. However, once the release status is changed to `release', the probability rises very much. - I could consistently reproduce it in Linux, FreeBSD, and Cygwin. I couldn't reproduce it in Minix. Among them, the probability is largest in Cygwin. - It also occurs with different setups of stdin for the `read' builtin including `exec 9< <(sleep 100)', `exec 9< /dev/udp/0.0.0.0/80', etc. Each setup results in a different probability of the timeout failure. - The failure probability also depends on the amount of output from `echo' in the above test case. There is a certain range of the amount where the failure probability becomes large. The range depends on other conditions. - The above construct `(COMMAND & read -t TIMEOUT) < SLOWFD' caused a significantly large probability of the timeout failure, but the timeout failures were also observed in simpler constructs very rarely in past (particularly in Cygwin). I've summarized the detailed tests and investigations at the following comment. https://github.com/akinomyoga/ble.sh/issues/82#issuecomment-774770516 I just briefly summarize it in this mail. Fix: I first checked the commit where the problem starts to occur, which was the following one: > commit 10e784337238e6081ca5f9bdc8c21492f7b89388 > Author: Chet Ramey <c...@caleb.ins.cwru.edu> > Date: Mon Mar 4 08:10:00 2013 -0500 > > commit bash-20130208 snapshot > This is the relevant log in ChangeLog: > 2/9 > --- > > builtins/read.def > - sigalrm_seen, alrmbuf: now global so the rest of the shell (trap.c) > can use them > - sigalrm: just sets flag, no longer longjmps to alrmbuf; problem was > longjmp without manipulating signal mask, leaving SIGALRM blocked > > quit.h > - move CHECK_ALRM macro here from builtins/read.def so trap.c: > check_signals() can call it > > trap.c > - check_signals: add call to CHECK_ALRM before QUIT > - check_signals_and_traps: call check_signals() instead of including > CHECK_ALRM and QUIT inline. Integrating check for read builtin's > SIGALRM (where zread call to check_signals_and_traps can see it) > fixes problem reported by Mike Frysinger <vap...@gentoo.org> In the older code, `longjmp' was called in the signal handler for SIGALRT, but it was changed to be called from `check_signals'. This was introduced after the discussion at: https://lists.gnu.org/archive/html/bug-bash/2013-02/msg00016.html Actually, another report in 2018 has already the same issue as follows: https://lists.gnu.org/archive/html/bug-bash/2018-01/msg00114.html The corresponding fix was at > commit 0275a139abe94c198eb04b05b39ca74c137bfc65 > Author: Chet Ramey <chet.ra...@case.edu> > Date: Mon Feb 5 10:34:47 2018 -0500 > > commit bash-20180202 snapshot The relevant ChangeLog is > 1/31 > ---- > lib/sh/zread.c > - zread,zreadintr: call check_signals() before calling read() to > minimize the race window between signal delivery, signal handling, > and a blocking read(2). Partial fix for FIFO read issue reported by > Oyvind Hvidsten <oyvind.hvids...@dhampir.no> In this commit, a line `check_signals ()' has been added immediately before the call of read(2) in `zread (lib/sh/zread.c)' to check the arrival of SIGALRM. However, this is mere `a partial fix' (as in the above ChangeLog), and there is still a possibility that SIGALRM arrives between the check and the call of `read(2)' as explained in the following comment: https://lists.gnu.org/archive/html/bug-bash/2018-01/msg00128.html I examined the effect of the above change, but it seems it doesn't have much effect on the above test case (in Repeat-By section of this mail). Considering the fact that the failure probability depends on the type of the file descriptor, I guess read(2) takes some time to set up the file descriptor before it starts to accept signals to cancel the read with EINTR. I suspect that the implementation by `SIGALRT' has too strong limitation to solve this issue. a Maybe one could revert the strategy to the one before the 20130209 change, but it actually leads to an undefined behavior to use `longjmp' in signal handlers [e.g., see the following link]. https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers b Maybe one could recursively generate `SIGALRM' by calling `raise(2)' in the signal handler so that `read(2)' wouldn't miss `SIGALRM', but it doesn't look a good solution and is also an undefined behavior in C standard (though it seems to be allowed in the POSIX standard, I'm not sure whether all the existing systems do exactly follow the POSIX for this particular one). Instead, I believe, it is more natural to use `select(2)', which is already used to implement `read -t 0'. In the attached patch `0001-Use-select-2-for-the-read-timeout.patch', I used `select(2)' to implement the timeout of `read(2)'. When `select(2)' is unavailable (i.e., `HAVE_SELECT' is defined in `config.h'), it still falls back to the old strategy with `SIGALRM', but I believe most of modern systems support `select(2)'. I tested the behavior with the above test case, and also tested the behavior by hand. Could you take a look at the patch? By the way, is there a reason that SIGINT is written by its value `2' in `bash_event_hook (bashline.c)'? I suggest to replace it with `SIGINT' as in the second patch `0002-replace-2-by-SIGINT.patch'. -- Koichi
0002-replace-2-by-SIGINT.patch
Description: Binary data
0001-Use-select-2-for-the-read-timeout.patch
Description: Binary data